[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Tue Mar 5 10:29:29 CST 2013
#1283: Missing data in HDF5 files
---------------------+------------------------------------------------------
Reporter: hinder | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: Cactus | Version:
Resolution: | Keywords:
---------------------+------------------------------------------------------
Comment (by hinder):
> Starting a new HDF5 file for every restart is what Simfactory does by
default. This is why every restart has its own directory. I implemented it
this way because I was bitten by this problem in the past. People didn't
like it.
I agree (and always have) that this is the correct approach. It separates
the different restarts so that a catastrophic corruption in one restart
does not destroy the restarts before it. However, it doesn't solve the
problem entirely, because by checkpointing more frequently than once per
restart, we have introduced a finer granularity scale, and recovery is
based on this finer scale rather than the coarse scale of restarts, which
means that you can recover a simulation from a "corrupted" restart, and
not have the data from that restart.
> Simfactory needs a mechanism to check whether the output produced by a
restart is complete and consistent, before using a checkpoint file
produced by this restart. This could be implemented by Cactus writing a
"tag file" that is only present while the output is consistent. That is,
this tag file would be deleted before HDF5 output is started, and re-
created afterwards if there were no errors. During cleanup, Simfactory
would disable checkpoint files from incomplete or inconsistent restarts.
This requires keeping checkpoint files in the restart directories, and
requires keeping checkpoint files from previous restarts.
An alternative would be to implement this in Cactus. We could add a flag
to the checkpoint file which says "all data before this checkpoint file is
safe". This flag would be set to "false" at the start of OutputGH and
then reset to "true" once OutputGH completes successfully. Thorns which
write files outside this bin could call an aliased or flesh function to
indicate that they are writing files in a potentially destructive way.
Cactus would only manipulate checkpoint files that it had created in the
current restart, but it would mark all such files in this way. A simple
implementation of this flag would be to rename the checkpoint file to
".tmp" during file writes which might corrupt data. A more complicated
implementation would be to actually add information to the checkpoint file
itself, or some other external file. During recovery, Cactus won't find
the checkpoint files which were renamed before the catastrophic write, so
will continue from the last "good" checkpoint file, which will likely be
the one that the previous restart started from.
>This is what Simfactory did in the past (and there is also an option to
manually restart from an earlier restart). People didn't like this safety
measure. Do we need to go back to it?
I actually reviewed the discussion surrounding this decision earlier
today. The aim was to reduce disk space usage by putting all the
checkpoint files in the same directory, allowing cactus to delete all the
old ones rather than keeping one per restart. However, as far as I can
tell, this did not have the desired effect, because Cactus will not delete
previous checkpoint files, only those from the current run. So the only
benefit to using a common checkpoint directory is that it is slightly
easier to handle from the command line, as all the checkpoint files are in
the same place.
We said that we would rely on the logic in Cactus to do this, since it was
established and hence debugged and likely working. It just didn't do
quite what we assumed it would do!
To solve the disk space issue, we need to write new code. SimFactory
could, during cleanup (i.e. between restarts) delete old checkpoint files
according to some scheme. Alternatively, this could be implemented in
Cactus, by a new parameter "remove_previous_checkpoint_files". The former
would work with checkpoint files in the restart directories, whereas the
latter would only work if each restart shared a checkpoint directory.
> To go back to it, we would discourage people from writing anywhere than
into restart-specific directories.
If this is inconvenient, it can probably be fixed via post-processing,
even automatically during the cleanup stage. That is what the cleanup
stage is for...
You mean we would discourage people from using "../checkpoints" for the
checkpoint and recovery directory? I think I agree that this would be a
good thing.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283#comment:6>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list