[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Tue Mar 5 08:58:45 CST 2013
#1283: Missing data in HDF5 files
--------------------+-------------------------------------------------------
Reporter: hinder | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: Cactus | Version:
Keywords: |
--------------------+-------------------------------------------------------
If a simulation runs out of disk space while writing an HDF5 file, the
simulation will terminate. The hdf5 file being written might then be
corrupt, and all data from it may be irretrievable. In that case,
restarting from the last-written checkpoint file may leave a "gap" in the
data corresponding to the period between the start of the failed restart
and the last checkpoint file written.
Steps to reproduce:
• Start a simulation which checkpoints periodically and consists
of several restarts
• Keep all checkpoint files
• Restart 0000 completes successfully and checkpoints at iteration
i1
• Restart 0001 checkpoints once after some evolution at iteration
i2
• Restart 0001 terminates abnormally while writing an HDF5 output
file at iteration i3
• The output file is corrupted and nonrecoverable, so there is no
data from iteration i1 to iteration i3
• Restart 0002 starts at iteration i2 as this is the last
checkpoint available
• The simulation continues until the end, but the data from the
corrupted HDF5 file between iteration i1 and i2 is lost
Possible solutions:
1. Write HDF5 files safely, e.g. by first copying the file to a
new temporary file, performing the write, then atomically moving the
temporary file over the original file. The original file would then
remain in the event of a crash while writing the new file. This could be
very expensive for 3D output files.
2. Start a new set of HDF5 files after each checkpoint. This
seems to be the most efficient and simplest solution, but requires readers
of HDF5 files to be modified to take it into account.
3. Check the consistency of all HDF5 files in the previous
restart(s) on recovery, and recover from the latest checkpoint file for
which all previous HDF5 files are valid. We could use code to check the
HDF5 file, or some other flagging mechanism to indicate that HDF5 writes
were completed successfully; e.g. we could rename the HDF5 file to .tmp
during writes, and rename it back after a successful write. This is
complex and requires Cactus or simfactory to look into previous restarts.
It also only applies to HDF5 files, and requires breaking several
abstraction barriers.
4. Wait for HDF5 journalling support. As far as I know only
metadata journalling is planned, which is probably not enough, and in any
case, they are not actively working on the next version of HDF5 at the
moment due to lack of funding.
5. Checkpoint only on termination of the simulation
In reality, we do not keep all checkpoint files. I usually keep just the
last checkpoint file. I believe that a Cactus simulation will only delete
checkpoint files which it has itself written, which means that there will
generally be one checkpoint file kept per restart; the last one written.
This means that you can always recover from the above situation by
rerunning the restart during which the problem occurred. However, keeping
one checkpoint file per restart is a problem in itself, and we should fix
this as well, which would then mean the potential for losing data in the
case of an interrupted write operation.
Thoughts?
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list