[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files

Tue Mar 5 08:58:45 CST 2013

#1283: Missing data in HDF5 files
--------------------+-------------------------------------------------------
 Reporter:  hinder  |       Owner:     
     Type:  defect  |      Status:  new
 Priority:  major   |   Milestone:     
Component:  Cactus  |     Version:     
 Keywords:          |  
--------------------+-------------------------------------------------------
 If a simulation runs out of disk space while writing an HDF5 file, the
 simulation will terminate.  The hdf5 file being written might then be
 corrupt, and all data from it may be irretrievable.  In that case,
 restarting from the last-written checkpoint file may leave a "gap" in the
 data corresponding to the period between the start of the failed restart
 and the last checkpoint file written.

 Steps to reproduce:

         • Start a simulation which checkpoints periodically and consists
 of several restarts
         • Keep all checkpoint files
         • Restart 0000 completes successfully and checkpoints at iteration
 i1
         • Restart 0001 checkpoints once after some evolution at iteration
 i2
         • Restart 0001 terminates abnormally while writing an HDF5 output
 file at iteration i3
         • The output file is corrupted and nonrecoverable, so there is no
 data from iteration i1 to iteration i3
         • Restart 0002 starts at iteration i2 as this is the last
 checkpoint available
         • The simulation continues until the end, but the data from the
 corrupted HDF5 file between iteration i1 and i2 is lost

 Possible solutions:

         1. Write HDF5 files safely, e.g. by first copying the file to a
 new temporary file, performing the write, then atomically moving the
 temporary file over the original file.  The original file would then
 remain in the event of a crash while writing the new file.  This could be
 very expensive for 3D output files.
         2. Start a new set of HDF5 files after each checkpoint.  This
 seems to be the most efficient and simplest solution, but requires readers
 of HDF5 files to be modified to take it into account.
         3. Check the consistency of all HDF5 files in the previous
 restart(s) on recovery, and recover from the latest checkpoint file for
 which all previous HDF5 files are valid.  We could use code to check the
 HDF5 file, or some other flagging mechanism to indicate that HDF5 writes
 were completed successfully; e.g. we could rename the HDF5 file to .tmp
 during writes, and rename it back after a successful write.  This is
 complex and requires Cactus or simfactory to look into previous restarts.
 It also only applies to HDF5 files, and requires breaking several
 abstraction barriers.
         4. Wait for HDF5 journalling support.  As far as I know only
 metadata journalling is planned, which is probably not enough, and in any
 case, they are not actively working on the next version of HDF5 at the
 moment due to lack of funding.
         5. Checkpoint only on termination of the simulation

 In reality, we do not keep all checkpoint files.  I usually keep just the
 last checkpoint file.  I believe that a Cactus simulation will only delete
 checkpoint files which it has itself written, which means that there will
 generally be one checkpoint file kept per restart; the last one written.
 This means that you can always recover from the above situation by
 rerunning the restart during which the problem occurred.  However, keeping
 one checkpoint file per restart is a problem in itself, and we should fix
 this as well, which would then mean the potential for losing data in the
 case of an interrupted write operation.

 Thoughts?

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit