[Users] Checkpoint/recovery inconsistencies

Tue Jul 19 21:21:00 CDT 2011

Christian Ott, Peter Diener, and I tracked down a set of
checkpointing/recovery inconsistencies in the Einstein Toolkit. This
means that many variables had different values after checkpointing and
then recovering them, which should not be the case. We were careful
not to use MPI, OpenMP, or any kind of compiler optimisations, so
these inconsistencies represent real changes.

We found a series of problems, and we have now one possible
correction. Unfortunately, this requires changes to several thorns,
mostly to schedule.ccl declarations, but also to Carpet. In addition,
we find that it is impossible to re-calculate PseudoEvolution
variables after recovery -- they have to be checkpointed. In other
words, variables such as ADMBase, TmunuBase, or other variables with 3
timelevels need to be checkpointed to be consistent.

(Many of the inconsistencies are small, and are caused by differences
in the discretisation error. That is, they will vanish in the
continuum limit. However, one of the design principles of the Einstein
Toolkit is that results should be as much as possible independent of
the number of MPI processes, OpenMP threads, and
checkpointing/recovery. Disabling checkpointing of such variables
could be implemented depending on a parameter, which would also need
to ensure that these variables are then recalculated -- with slightly
different values -- after recovery. In particular, this would require
a new schedule bin in Carpet that is executed after recovery for all
timelevels, or alternatively a new schedule option that requests this
behaviour from Carpet.)

The necessary changes to correct these inconsistencies are in detail:

1. All variables with 3 timelevels need to be checkpointed. This
affects ADMBase, TmunuBase, ML_ADMConstraints, ML_ADMQuantities, and
the BSSN constraints in ML_BSSN. Since this depends on the number of
timelevels, care has to be taken to make the right decision at run
time.

2. The schedule group MoL_PseudoEvolution must not be scheduled in
post_recover_variables -- since these variables are checkpointed, they
do not need to be recalculated.

3. Carpet cannot execute the post_recover_variables bin on the past
timelevels, since applying boundary conditions to past timelevels
includes prolongation, and the time interpolation necessary for this
would have a different discretisation error.

4. Carpet traversed the PostRestrict bin in the incorrect order. It
traversed from finest to coarsest (the same order in which restriction
has to be applied), but because fine grid boundaries may be
interpolate from coarse grid boundaries, this bin must be traversed
from coarsest to finest.

5. The scheduling of GRHydro and HydroBase is off in certain places.

6. The scheduling of NaNChecker leaves the NaNmask uninitialised after recovery.

7. The scheduling of various Kranc generated thorns is off in certain places.

8. Certain schedule items requiring the ADMBase variables are executed
in the incorrect order, i.e. before the ADMBase variables were
available.

I attach a diff of a possible set of corrections. However:
- This diff does not correct handling the TmunuBase variables -- they
are still not checkpointed if they have multiple timelevels. I believe
this should be done by thorn TmunuBase itself.
- Similarly, handling checkpointing of the ADMBase variables should be
moved from ML_BSSN_Helper to ADMBase itself.
- I think a new schedule group MoL_PseudoEvolutionBoundaries would
make sense, as this would simplify the schedule of thorns which use
MoL_PseudoEvolution.

-erik

-- 
Erik Schnetter <schnetter at cct.lsu.edu>   http://www.cct.lsu.edu/~eschnett/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DIFF
Type: application/octet-stream
Size: 19388 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20110719/38320f33/attachment-0001.obj