[Users] Checkpoint/recovery inconsistencies

Tue Jul 19 21:59:53 CDT 2011

On Tue, Jul 19, 2011 at 10:43 PM, Luca Baiotti
<baiotti at ile.osaka-u.ac.jp> wrote:
> This is really good news! Thank you and congratulations. It was really a
> long-standing problem (and hard to investigate).
>
> Questions:
>
> - In the diff, which version of carpet do the changes refer to? (All?)

This refers to the Mercurial version.

> - When do you plan to commit the changes to the repositories?

Very soon.

-erik

> Thanks,
>
> Luca
>
>
>
> On 20/7/11 11:21 AM, Erik Schnetter wrote:
>> Christian Ott, Peter Diener, and I tracked down a set of
>> checkpointing/recovery inconsistencies in the Einstein Toolkit. This
>> means that many variables had different values after checkpointing and
>> then recovering them, which should not be the case. We were careful
>> not to use MPI, OpenMP, or any kind of compiler optimisations, so
>> these inconsistencies represent real changes.
>>
>> We found a series of problems, and we have now one possible
>> correction. Unfortunately, this requires changes to several thorns,
>> mostly to schedule.ccl declarations, but also to Carpet. In addition,
>> we find that it is impossible to re-calculate PseudoEvolution
>> variables after recovery -- they have to be checkpointed. In other
>> words, variables such as ADMBase, TmunuBase, or other variables with 3
>> timelevels need to be checkpointed to be consistent.
>>
>> (Many of the inconsistencies are small, and are caused by differences
>> in the discretisation error. That is, they will vanish in the
>> continuum limit. However, one of the design principles of the Einstein
>> Toolkit is that results should be as much as possible independent of
>> the number of MPI processes, OpenMP threads, and
>> checkpointing/recovery. Disabling checkpointing of such variables
>> could be implemented depending on a parameter, which would also need
>> to ensure that these variables are then recalculated -- with slightly
>> different values -- after recovery. In particular, this would require
>> a new schedule bin in Carpet that is executed after recovery for all
>> timelevels, or alternatively a new schedule option that requests this
>> behaviour from Carpet.)
>>
>>
>>
>> The necessary changes to correct these inconsistencies are in detail:
>>
>> 1. All variables with 3 timelevels need to be checkpointed. This
>> affects ADMBase, TmunuBase, ML_ADMConstraints, ML_ADMQuantities, and
>> the BSSN constraints in ML_BSSN. Since this depends on the number of
>> timelevels, care has to be taken to make the right decision at run
>> time.
>>
>> 2. The schedule group MoL_PseudoEvolution must not be scheduled in
>> post_recover_variables -- since these variables are checkpointed, they
>> do not need to be recalculated.
>>
>> 3. Carpet cannot execute the post_recover_variables bin on the past
>> timelevels, since applying boundary conditions to past timelevels
>> includes prolongation, and the time interpolation necessary for this
>> would have a different discretisation error.
>>
>> 4. Carpet traversed the PostRestrict bin in the incorrect order. It
>> traversed from finest to coarsest (the same order in which restriction
>> has to be applied), but because fine grid boundaries may be
>> interpolate from coarse grid boundaries, this bin must be traversed
>> from coarsest to finest.
>>
>> 5. The scheduling of GRHydro and HydroBase is off in certain places.
>>
>> 6. The scheduling of NaNChecker leaves the NaNmask uninitialised after recovery.
>>
>> 7. The scheduling of various Kranc generated thorns is off in certain places.
>>
>> 8. Certain schedule items requiring the ADMBase variables are executed
>> in the incorrect order, i.e. before the ADMBase variables were
>> available.
>>
>> I attach a diff of a possible set of corrections. However:
>> - This diff does not correct handling the TmunuBase variables -- they
>> are still not checkpointed if they have multiple timelevels. I believe
>> this should be done by thorn TmunuBase itself.
>> - Similarly, handling checkpointing of the ADMBase variables should be
>> moved from ML_BSSN_Helper to ADMBase itself.
>> - I think a new schedule group MoL_PseudoEvolutionBoundaries would
>> make sense, as this would simplify the schedule of thorns which use
>> MoL_PseudoEvolution.
>>
>> -erik
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
>

-- 
Erik Schnetter <schnetter at cct.lsu.edu>   http://www.cct.lsu.edu/~eschnett/