[Users] Checkpoint/recovery inconsistencies

Thu Jul 21 14:10:05 CDT 2011

On Wed, Jul 20, 2011 at 8:28 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>
> On 20 Jul 2011, at 03:21, Erik Schnetter wrote:
>
>> Christian Ott, Peter Diener, and I tracked down a set of
>> checkpointing/recovery inconsistencies in the Einstein Toolkit. This
>> means that many variables had different values after checkpointing and
>> then recovering them, which should not be the case. We were careful
>> not to use MPI, OpenMP, or any kind of compiler optimisations, so
>> these inconsistencies represent real changes.
>>
>> We found a series of problems, and we have now one possible
>> correction. Unfortunately, this requires changes to several thorns,
>> mostly to schedule.ccl declarations, but also to Carpet. In addition,
>> we find that it is impossible to re-calculate PseudoEvolution
>> variables after recovery -- they have to be checkpointed. In other
>> words, variables such as ADMBase, TmunuBase, or other variables with 3
>> timelevels need to be checkpointed to be consistent.
>>
>> (Many of the inconsistencies are small, and are caused by differences
>> in the discretisation error. That is, they will vanish in the
>> continuum limit. However, one of the design principles of the Einstein
>> Toolkit is that results should be as much as possible independent of
>> the number of MPI processes, OpenMP threads, and
>> checkpointing/recovery. Disabling checkpointing of such variables
>> could be implemented depending on a parameter, which would also need
>> to ensure that these variables are then recalculated -- with slightly
>> different values -- after recovery. In particular, this would require
>> a new schedule bin in Carpet that is executed after recovery for all
>> timelevels, or alternatively a new schedule option that requests this
>> behaviour from Carpet.)
>
> Did you find cases where the evolved variables had different values after recovery, or only analysis variables?  Can you estimate the order of the errors introduced?  Are the errors introduced only at refinement boundaries?

Yes. 1e-4. Yes.

>> 7. The scheduling of various Kranc generated thorns is off in certain places.
>
> Could you elaborate on this?  Does Kranc itself have to be modified?

I'm not sure yet. It probably doesn't have to be modified, but I'd
like to introduce PseudoEvolutionBoundaries, and would like to use
this group in Kranc (for the *_bc_group).

>> - I think a new schedule group MoL_PseudoEvolutionBoundaries would
>> make sense, as this would simplify the schedule of thorns which use
>> MoL_PseudoEvolution.
>
>
> The problem with this is that if you have two functions scheduled in MoL_PseudoEvolution and the second one uses values computed in the first, these values will not be correct in the boundaries.

Boundary conditions would be scheduled in both groups. There are some
times when only boundary conditions are needed, and this is when
PseudoEvolutionBoundaries comes in. The alternative is to schedule the
boundary conditions explicitly in postrestrictinitial, postrestrict,
postregridinitial, and postregrid, which is cumbersome.

-erik

-- 
Erik Schnetter <schnetter at cct.lsu.edu>   http://www.cct.lsu.edu/~eschnett/