[Users] Checkpoint/recovery inconsistencies

Luca Baiotti baiotti at ile.osaka-u.ac.jp
Tue Jul 19 21:43:03 CDT 2011


This is really good news! Thank you and congratulations. It was really a 
long-standing problem (and hard to investigate).

Questions:

- In the diff, which version of carpet do the changes refer to? (All?)

- When do you plan to commit the changes to the repositories?


Thanks,

Luca



On 20/7/11 11:21 AM, Erik Schnetter wrote:
> Christian Ott, Peter Diener, and I tracked down a set of
> checkpointing/recovery inconsistencies in the Einstein Toolkit. This
> means that many variables had different values after checkpointing and
> then recovering them, which should not be the case. We were careful
> not to use MPI, OpenMP, or any kind of compiler optimisations, so
> these inconsistencies represent real changes.
>
> We found a series of problems, and we have now one possible
> correction. Unfortunately, this requires changes to several thorns,
> mostly to schedule.ccl declarations, but also to Carpet. In addition,
> we find that it is impossible to re-calculate PseudoEvolution
> variables after recovery -- they have to be checkpointed. In other
> words, variables such as ADMBase, TmunuBase, or other variables with 3
> timelevels need to be checkpointed to be consistent.
>
> (Many of the inconsistencies are small, and are caused by differences
> in the discretisation error. That is, they will vanish in the
> continuum limit. However, one of the design principles of the Einstein
> Toolkit is that results should be as much as possible independent of
> the number of MPI processes, OpenMP threads, and
> checkpointing/recovery. Disabling checkpointing of such variables
> could be implemented depending on a parameter, which would also need
> to ensure that these variables are then recalculated -- with slightly
> different values -- after recovery. In particular, this would require
> a new schedule bin in Carpet that is executed after recovery for all
> timelevels, or alternatively a new schedule option that requests this
> behaviour from Carpet.)
>
>
>
> The necessary changes to correct these inconsistencies are in detail:
>
> 1. All variables with 3 timelevels need to be checkpointed. This
> affects ADMBase, TmunuBase, ML_ADMConstraints, ML_ADMQuantities, and
> the BSSN constraints in ML_BSSN. Since this depends on the number of
> timelevels, care has to be taken to make the right decision at run
> time.
>
> 2. The schedule group MoL_PseudoEvolution must not be scheduled in
> post_recover_variables -- since these variables are checkpointed, they
> do not need to be recalculated.
>
> 3. Carpet cannot execute the post_recover_variables bin on the past
> timelevels, since applying boundary conditions to past timelevels
> includes prolongation, and the time interpolation necessary for this
> would have a different discretisation error.
>
> 4. Carpet traversed the PostRestrict bin in the incorrect order. It
> traversed from finest to coarsest (the same order in which restriction
> has to be applied), but because fine grid boundaries may be
> interpolate from coarse grid boundaries, this bin must be traversed
> from coarsest to finest.
>
> 5. The scheduling of GRHydro and HydroBase is off in certain places.
>
> 6. The scheduling of NaNChecker leaves the NaNmask uninitialised after recovery.
>
> 7. The scheduling of various Kranc generated thorns is off in certain places.
>
> 8. Certain schedule items requiring the ADMBase variables are executed
> in the incorrect order, i.e. before the ADMBase variables were
> available.
>
> I attach a diff of a possible set of corrections. However:
> - This diff does not correct handling the TmunuBase variables -- they
> are still not checkpointed if they have multiple timelevels. I believe
> this should be done by thorn TmunuBase itself.
> - Similarly, handling checkpointing of the ADMBase variables should be
> moved from ML_BSSN_Helper to ADMBase itself.
> - I think a new schedule group MoL_PseudoEvolutionBoundaries would
> make sense, as this would simplify the schedule of thorns which use
> MoL_PseudoEvolution.
>
> -erik


More information about the Users mailing list