[Users] restart failure from checkpoint

Peter Diener diener at cct.lsu.edu
Wed Feb 9 18:15:13 CST 2011


Hi Hee Il,

It happens occasionally that Carpet decides to use a different processor 
decomposition on restart from a checkpoint file. That means that a given 
processor may need to read from more than one checkpoint file in order to 
read all its relevant data. This is what the warnings below indicate. 
Note that these are just warnings and doesn't mean that the restart fails. 
CarpetIOHDF5 will automatically open other files and find the additional
data it needs. This will use more memory, since it will have to read all 
the meta-data from all the other files until it finds all the data it 
needs. On some machines this may cause the job to abort due to running out 
of memory. If this is the case for you, you can try to set the parameter:

CarpetIOHDF5::open_one_input_file_at_a_time = yes

This will cause CarpetIOHDF5 to only open one file at a time while looking 
for the data and will reduce the memory usage. It will also slow the 
recovery down (how much depends on how many checkpoint files it need to 
access) since it takes time to read in the meta-data. But with a bit of 
patience and perserverance it should finally succeed.

Cheers,

   Peter


On Thu, 10 Feb 2011, Hee Il Kim wrote:

> Hi,
> 
> I used to encounter restart failure with the following messages. This seems
> to be quite often at the moment (3 from 6 restarts). It doesn't seem to be
> relevant with NaNs. Is there any way to check my system I/O with Cactus,
> other than the benchmark I/O given in the homepage of Cactus?
> 
> Thanks,
> 
> Hee Il
> 
> #################
> ....
> INFO (ADMMacros): Spatial finite differencing order: 4
> INFO (Time): Timestep set to 0.0703125 (courant_static)
> INFO (CarpetIOHDF5): reading grid variables on mglevel 0 reflevel 4
> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4 and
> tl 0 not read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4 and
> tl 1 not read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable AHFINDERDIRECT::ahmask on rl 4 and
> tl 2 not read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 0 not
> read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 1 not
> read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::dens on rl 4 and tl 2 not
> read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 0 not
> read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 1 not
> read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::tau on rl 4 and tl 2 not
> read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::scon[0] on rl 4 and tl 0
> not read completely. Will have to look for it in other files.
> WARNING[L1,P0] (CarpetIOHDF5): Variable GRHYDRO::scon[0] on rl 4 and tl 1
> not read completely. Will have to look for it in other files.
> ...
> ...
> 
> 
>


More information about the Users mailing list