[Users] Checkpoint/recovery inconsistencies

Wed Jul 20 07:28:21 CDT 2011

On 20 Jul 2011, at 03:21, Erik Schnetter wrote:

> Christian Ott, Peter Diener, and I tracked down a set of
> checkpointing/recovery inconsistencies in the Einstein Toolkit. This
> means that many variables had different values after checkpointing and
> then recovering them, which should not be the case. We were careful
> not to use MPI, OpenMP, or any kind of compiler optimisations, so
> these inconsistencies represent real changes.
> 
> We found a series of problems, and we have now one possible
> correction. Unfortunately, this requires changes to several thorns,
> mostly to schedule.ccl declarations, but also to Carpet. In addition,
> we find that it is impossible to re-calculate PseudoEvolution
> variables after recovery -- they have to be checkpointed. In other
> words, variables such as ADMBase, TmunuBase, or other variables with 3
> timelevels need to be checkpointed to be consistent.
> 
> (Many of the inconsistencies are small, and are caused by differences
> in the discretisation error. That is, they will vanish in the
> continuum limit. However, one of the design principles of the Einstein
> Toolkit is that results should be as much as possible independent of
> the number of MPI processes, OpenMP threads, and
> checkpointing/recovery. Disabling checkpointing of such variables
> could be implemented depending on a parameter, which would also need
> to ensure that these variables are then recalculated -- with slightly
> different values -- after recovery. In particular, this would require
> a new schedule bin in Carpet that is executed after recovery for all
> timelevels, or alternatively a new schedule option that requests this
> behaviour from Carpet.)

Did you find cases where the evolved variables had different values after recovery, or only analysis variables?  Can you estimate the order of the errors introduced?  Are the errors introduced only at refinement boundaries?

> 7. The scheduling of various Kranc generated thorns is off in certain places.

Could you elaborate on this?  Does Kranc itself have to be modified?

> - I think a new schedule group MoL_PseudoEvolutionBoundaries would
> make sense, as this would simplify the schedule of thorns which use
> MoL_PseudoEvolution.

The problem with this is that if you have two functions scheduled in MoL_PseudoEvolution and the second one uses values computed in the first, these values will not be correct in the boundaries. 

-- 
Ian Hinder
ian.hinder at aei.mpg.de