[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files

Mon Mar 11 06:57:09 CDT 2013

#1283: Missing data in HDF5 files
---------------------+------------------------------------------------------
  Reporter:  hinder  |       Owner:     
      Type:  defect  |      Status:  new
  Priority:  major   |   Milestone:     
 Component:  Cactus  |     Version:     
Resolution:          |    Keywords:     
---------------------+------------------------------------------------------

Comment (by hinder):

 Yes, simfactory should be modified to implement this feature. I just
 created #1286 to track this.

 Summarising the discussion:

 * A simple workaround for this problem is to checkpoint only on terminate,
 and terminate based on walltime, and to use shorter walltimes to reduce
 the amount of time wasted in the event of a problem.  This requires job
 chaining to be working on the machine to be usable.
 * The use of a common "checkpoints" directory does not have any useful
 effect, and can be dropped, because Cactus will only delete checkpoint
 files written by the current job
 * Cactus could write a "tag" file before any output which might lead to
 loss of data if a write error occurs.  This could be provided by a flesh
 or aliased function; e.g. CCTK_StartUnsafeWrite/CCTK_EndUnsafeWrite.  The
 semantics would be that if an error occurs between these two functions
 being called (leaving the tag in place), the simulation output (possibly
 just this one file) should be considered invalid.  Writers of HDF5 files
 would be modified to call these functions.
 * HDF5 files could be written "safely": the file would be copied to a
 temporary file first, then modified, then moved back atomically.  This
 will incur a disk space and speed penalty, especially for 3D output, but
 this might not be large (should be measured) and will avoid having to
 ignore certain checkpoint files and use more CPU time to recompute data.
 In this case, the CCTK_StartUnsafeWrite/CCTK_EndUnsafeWrite functions
 would not be called.
 * SimFactory could be enhanced to "expire" checkpoints during cleanup
 (i.e. between restarts) according to a particular policy, e.g. to reduce
 disk space used.  It could check tags written by Cactus to determine which
 checkpoint files to remove and which to link to the new restart for
 recovery.  It would never recover from a checkpoint file for which all the
 data in the simulation was not determined to be "good".

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283#comment:9>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit