[Users] restart failure from checkpoint
Ian Hinder
ian.hinder at aei.mpg.de
Mon Feb 14 10:36:32 CST 2011
On 14 Feb 2011, at 15:20, Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF MARYLAND BALTIMORE COUNTY] wrote:
> Hi Peter (and others).
>
> I think we've been having similar checkpoint recovery issues, and we'll
> try out the parameter you mention, in case it helps.
>
> But why does the processor decomposition have to change? Can this be
> overridden?
>
> It sounds very inefficient to force something different at checkpoint
> recovery when presumably (at least for our evolutions) all that happened
> is we ran out of queue time and had to resubmit the job.
It is not intentional - see the discussion on the Cactus mailing list:
[Developers] Warn if grid structure changes upon recovery
02-Jun-2010
http://cactuscode.org/pipermail/developers/2010-June/005980.html
As I recall, we had some ideas about how to fix this, but probably no one has had time to do this.
>
>
> Bernard
>
> On 2/9/11 7:15 PM, "Peter Diener" <diener at cct.lsu.edu> wrote:
>
>> Hi Hee Il,
>>
>> It happens occasionally that Carpet decides to use a different processor
>> decomposition on restart from a checkpoint file. That means that a given
>> processor may need to read from more than one checkpoint file in order to
>> read all its relevant data. This is what the warnings below indicate.
>> Note that these are just warnings and doesn't mean that the restart
>> fails.
>> CarpetIOHDF5 will automatically open other files and find the additional
>> data it needs. This will use more memory, since it will have to read all
>> the meta-data from all the other files until it finds all the data it
>> needs. On some machines this may cause the job to abort due to running
>> out
>> of memory. If this is the case for you, you can try to set the parameter:
>>
>> CarpetIOHDF5::open_one_input_file_at_a_time = yes
>>
>> This will cause CarpetIOHDF5 to only open one file at a time while
>> looking
>> for the data and will reduce the memory usage. It will also slow the
>> recovery down (how much depends on how many checkpoint files it need to
>> access) since it takes time to read in the meta-data. But with a bit of
>> patience and perserverance it should finally succeed.
>>
>> Cheers,
>>
>> Peter
>>
>>
>> On Thu, 10 Feb 2011, Hee Il Kim wrote:
>>
>>> Hi,
>>>
>>> I used to encounter restart failure with the following messages. This
>>> seems
>>> to be quite often at the moment (3 from 6 restarts). It doesn't seem to
>>> be
>>> relevant with NaNs. Is there any way to check my system I/O with Cactus,
>>> other than the benchmark I/O given in the homepage of Cactus?
>>>
>>> Thanks,
>>>
>>> Hee Il
>>>
>>> #################
>>> ....
>>> INFO (ADMMacros): Spatial finite differencing order: 4
>>> INFO (Time): Timestep set to 0.0703125 (courant_static)
>>> INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 4
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>> and
>>> tl 0 not read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>> and
>>> tl 1 not read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>> and
>>> tl 2 not read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 0
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 1
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 2
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 0
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 1
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 2
>>> not
>>> read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::scon[0] on rl 4 and tl
>>> 0
>>> not read completely. Will have to look for it in other files.
>>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::scon[0] on rl 4 and tl
>>> 1
>>> not read completely. Will have to look for it in other files.
>>> ...
>>> ...
>>>
>>>
>>>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
--
Ian Hinder
ian.hinder at aei.mpg.de
More information about the Users
mailing list