[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Mon Mar 11 06:57:09 CDT 2013
#1283: Missing data in HDF5 files
---------------------+------------------------------------------------------
Reporter: hinder | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: Cactus | Version:
Resolution: | Keywords:
---------------------+------------------------------------------------------
Comment (by hinder):
Yes, simfactory should be modified to implement this feature. I just
created #1286 to track this.
Summarising the discussion:
* A simple workaround for this problem is to checkpoint only on terminate,
and terminate based on walltime, and to use shorter walltimes to reduce
the amount of time wasted in the event of a problem. This requires job
chaining to be working on the machine to be usable.
* The use of a common "checkpoints" directory does not have any useful
effect, and can be dropped, because Cactus will only delete checkpoint
files written by the current job
* Cactus could write a "tag" file before any output which might lead to
loss of data if a write error occurs. This could be provided by a flesh
or aliased function; e.g. CCTK_StartUnsafeWrite/CCTK_EndUnsafeWrite. The
semantics would be that if an error occurs between these two functions
being called (leaving the tag in place), the simulation output (possibly
just this one file) should be considered invalid. Writers of HDF5 files
would be modified to call these functions.
* HDF5 files could be written "safely": the file would be copied to a
temporary file first, then modified, then moved back atomically. This
will incur a disk space and speed penalty, especially for 3D output, but
this might not be large (should be measured) and will avoid having to
ignore certain checkpoint files and use more CPU time to recompute data.
In this case, the CCTK_StartUnsafeWrite/CCTK_EndUnsafeWrite functions
would not be called.
* SimFactory could be enhanced to "expire" checkpoints during cleanup
(i.e. between restarts) according to a particular policy, e.g. to reduce
disk space used. It could check tags written by Cactus to determine which
checkpoint files to remove and which to link to the new restart for
recovery. It would never recover from a checkpoint file for which all the
data in the simulation was not determined to be "good".
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283#comment:9>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list