[Users] restart failure from checkpoint

Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF MARYLAND BALTIMORE COUNTY] bernard.j.kelly at nasa.gov
Mon Feb 14 08:20:49 CST 2011


Hi Peter (and others).

I think we've been having similar checkpoint recovery issues, and we'll
try out the parameter you mention, in case it helps.

But why does the processor decomposition have to change? Can this be
overridden?

It sounds very inefficient to force something different at checkpoint
recovery when presumably (at least for our evolutions) all that happened
is we ran out of queue time and had to resubmit the job.


Bernard

On 2/9/11 7:15 PM, "Peter Diener" <diener at cct.lsu.edu> wrote:

>Hi Hee Il,
>
>It happens occasionally that Carpet decides to use a different processor
>decomposition on restart from a checkpoint file. That means that a given
>processor may need to read from more than one checkpoint file in order to
>read all its relevant data. This is what the warnings below indicate.
>Note that these are just warnings and doesn't mean that the restart
>fails. 
>CarpetIOHDF5 will automatically open other files and find the additional
>data it needs. This will use more memory, since it will have to read all
>the meta-data from all the other files until it finds all the data it
>needs. On some machines this may cause the job to abort due to running
>out 
>of memory. If this is the case for you, you can try to set the parameter:
>
>CarpetIOHDF5::open_one_input_file_at_a_time = yes
>
>This will cause CarpetIOHDF5 to only open one file at a time while
>looking 
>for the data and will reduce the memory usage. It will also slow the
>recovery down (how much depends on how many checkpoint files it need to
>access) since it takes time to read in the meta-data. But with a bit of
>patience and perserverance it should finally succeed.
>
>Cheers,
>
>   Peter
>
>
>On Thu, 10 Feb 2011, Hee Il Kim wrote:
>
>> Hi,
>> 
>> I used to encounter restart failure with the following messages. This
>>seems
>> to be quite often at the moment (3 from 6 restarts). It doesn't seem to
>>be
>> relevant with NaNs. Is there any way to check my system I/O with Cactus,
>> other than the benchmark I/O given in the homepage of Cactus?
>> 
>> Thanks,
>> 
>> Hee Il
>> 
>> #################
>> ....
>> INFO (ADMMacros): Spatial finite differencing order: 4
>> INFO (Time): Timestep set to 0.0703125 (courant_static)
>> INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 4
>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>and
>> tl 0 not read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>and
>> tl 1 not read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4
>>and
>> tl 2 not read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 0
>>not
>> read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 1
>>not
>> read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 2
>>not
>> read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 0
>>not
>> read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 1
>>not
>> read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 2
>>not
>> read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::scon[0] on rl 4 and tl
>>0
>> not read completely. Will have to look for it in other files.
>> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::scon[0] on rl 4 and tl
>>1
>> not read completely. Will have to look for it in other files.
>> ...
>> ...
>> 
>> 
>>
>_______________________________________________
>Users mailing list
>Users at einsteintoolkit.org
>http://lists.einsteintoolkit.org/mailman/listinfo/users



More information about the Users mailing list