[Users] restart failure from checkpoint

Ian Hinder ian.hinder at aei.mpg.de
Mon Feb 14 10:36:32 CST 2011


On 14 Feb 2011, at 15:20, Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF MARYLAND BALTIMORE COUNTY] wrote:

> Hi Peter (and others).
> 
> I think we've been having similar checkpoint recovery issues, and we'll
> try out the parameter you mention, in case it helps.
> 
> But why does the processor decomposition have to change? Can this be
> overridden?
> 
> It sounds very inefficient to force something different at checkpoint
> recovery when presumably (at least for our evolutions) all that happened
> is we ran out of queue time and had to resubmit the job.

It is not intentional - see the discussion on the Cactus mailing list:

 	[Developers] Warn if grid structure changes upon recovery
	02-Jun-2010
	http://cactuscode.org/pipermail/developers/2010-June/005980.html

As I recall, we had some ideas about how to fix this, but probably no one has had time to do this.

> 
> 
> Bernard
> 
> On 2/9/11 7:15 PM, "Peter Diener" <diener at cct.lsu.edu> wrote:
> 
>> Hi Hee Il,
>> 
>> It happens occasionally that Carpet decides to use a different processor
>> decomposition on restart from a checkpoint file. That means that a given
>> processor may need to read from more than one checkpoint file in order to
>> read all its relevant data. This is what the warnings below indicate.
>> Note that these are just warnings and doesn't mean that the restart
>> fails. 
>> CarpetIOHDF5 will automatically open other files and find the additional
>> data it needs. This will use more memory, since it will have to read all
>> the meta-data from all the other files until it finds all the data it
>> needs. On some machines this may cause the job to abort due to running
>> out 
>> of memory. If this is the case for you, you can try to set the parameter:
>> 
>> CarpetIOHDF5::open_one_input_file_at_a_time = yes
>> 
>> This will cause CarpetIOHDF5 to only open one file at a time while
>> looking 
>> for the data and will reduce the memory usage. It will also slow the
>> recovery down (how much depends on how many checkpoint files it need to
>> access) since it takes time to read in the meta-data. But with a bit of
>> patience and perserverance it should finally succeed.
>> 
>> Cheers,
>> 
>>  Peter
>> 
>> 
>> On Thu, 10 Feb 2011, Hee Il Kim wrote:
>> 
>>> Hi,
>>> 
>>> I used to encounter restart failure with the following messages. This
>>> seems
>>> to be quite often at the moment (3 from 6 restarts). It doesn't seem to
>>> be
>>> relevant with NaNs. Is there any way to check my system I/O with Cactus,
>>> other than the benchmark I/O given in the homepage of Cactus?
>>> 
>>> Thanks,
>>> 
>>> Hee Il
>>> 
>>> #################
>>> ....
>>> INFO (ADMMacros): Spatial finite differencing order: 4
>>> INFO (Time): Timestep set to 0.0703125 (courant_static)
>>> INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 4
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>> and
>>> tl 0 not read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>> and
>>> tl 1 not read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>> and
>>> tl 2 not read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 0
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 1
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 2
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 0
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 1
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 2
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::scon[0] on rl 4 and tl
>>> 0
>>> not read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::scon[0] on rl 4 and tl
>>> 1
>>> not read completely. Will have to look for it in other files.
>>> ...
>>> ...
>>> 
>>> 
>>> 
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
> 
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users

-- 
Ian Hinder
ian.hinder at aei.mpg.de



More information about the Users mailing list