[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files

Tue Mar 5 10:29:29 CST 2013

#1283: Missing data in HDF5 files
---------------------+------------------------------------------------------
  Reporter:  hinder  |       Owner:     
      Type:  defect  |      Status:  new
  Priority:  major   |   Milestone:     
 Component:  Cactus  |     Version:     
Resolution:          |    Keywords:     
---------------------+------------------------------------------------------

Comment (by hinder):

 > Starting a new HDF5 file for every restart is what Simfactory does by
 default. This is why every restart has its own directory. I implemented it
 this way because I was bitten by this problem in the past. People didn't
 like it.

 I agree (and always have) that this is the correct approach. It separates
 the different restarts so that a catastrophic corruption in one restart
 does not destroy the restarts before it.  However, it doesn't solve the
 problem entirely, because by checkpointing more frequently than once per
 restart, we have introduced a finer granularity scale, and recovery is
 based on this finer scale rather than the coarse scale of restarts, which
 means that you can recover a simulation from a "corrupted" restart, and
 not have the data from that restart.

 > Simfactory needs a mechanism to check whether the output produced by a
 restart is complete and consistent, before using a checkpoint file
 produced by this restart. This could be implemented by Cactus writing a
 "tag file" that is only present while the output is consistent. That is,
 this tag file would be deleted before HDF5 output is started, and re-
 created afterwards if there were no errors. During cleanup, Simfactory
 would disable checkpoint files from incomplete or inconsistent restarts.
 This requires keeping checkpoint files in the restart directories, and
 requires keeping checkpoint files from previous restarts.

 An alternative would be to implement this in Cactus.  We could add a flag
 to the checkpoint file which says "all data before this checkpoint file is
 safe".  This flag would be set to "false" at the start of OutputGH and
 then reset to "true" once OutputGH completes successfully.  Thorns which
 write files outside this bin could call an aliased or flesh function to
 indicate that they are writing files in a potentially destructive way.
 Cactus would only manipulate checkpoint files that it had created in the
 current restart, but it would mark all such files in this way.  A simple
 implementation of this flag would be to rename the checkpoint file to
 ".tmp" during file writes which might corrupt data.  A more complicated
 implementation would be to actually add information to the checkpoint file
 itself, or some other external file.  During recovery, Cactus won't find
 the checkpoint files which were renamed before the catastrophic write, so
 will continue from the last "good" checkpoint file, which will likely be
 the one that the previous restart started from.

 >This is what Simfactory did in the past (and there is also an option to
 manually restart from an earlier restart). People didn't like this safety
 measure. Do we need to go back to it?

 I actually reviewed the discussion surrounding this decision earlier
 today.  The aim was to reduce disk space usage by putting all the
 checkpoint files in the same directory, allowing cactus to delete all the
 old ones rather than keeping one per restart.  However, as far as I can
 tell, this did not have the desired effect, because Cactus will not delete
 previous checkpoint files, only those from the current run.  So the only
 benefit to using a common checkpoint directory is that it is slightly
 easier to handle from the command line, as all the checkpoint files are in
 the same place.

 We said that we would rely on the logic in Cactus to do this, since it was
 established and hence debugged and likely working.  It just didn't do
 quite what we assumed it would do!

 To solve the disk space issue, we need to write new code.  SimFactory
 could, during cleanup (i.e. between restarts) delete old checkpoint files
 according to some scheme.  Alternatively, this could be implemented in
 Cactus, by a new parameter "remove_previous_checkpoint_files".  The former
 would work with checkpoint files in the restart directories, whereas the
 latter would only work if each restart shared a checkpoint directory.

 > To go back to it, we would discourage people from writing anywhere than
 into restart-specific directories.
 If this is inconvenient, it can probably be fixed via post-processing,
 even automatically during the cleanup stage. That is what the cleanup
 stage is for...

 You mean we would discourage people from using "../checkpoints" for the
 checkpoint and recovery directory?  I think I agree that this would be a
 good thing.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283#comment:6>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit