[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files

Tue Mar 5 09:15:49 CST 2013

#1283: Missing data in HDF5 files
---------------------+------------------------------------------------------
  Reporter:  hinder  |       Owner:     
      Type:  defect  |      Status:  new
  Priority:  major   |   Milestone:     
 Component:  Cactus  |     Version:     
Resolution:          |    Keywords:     
---------------------+------------------------------------------------------

Comment (by knarf):

 Independent on the frequency of checkpoints and how many you keep - in all
 cases you loose data. You can always 'recover' by re-running whatever is
 missing. The question is: how much do you loose and can/should you recover
 automatically or would it be acceptable/better to leave that to the user?

 Assuming a simulations stops because it detected a problem with I/O (e.g.
 disk full) I would rather not have it restart without the user explicitly
 telling it to. 'sim submit NAME' doesn't count as 'explicitly' to me.
 Depending on the situation, a user should then probably investigate what
 happened and take steps accordingly, e.g., restarting the last 'restart'
 completely if otherwise data would be missing. It would be hard to
 correctly automate this.

 Having said that: trying to minimize the data loss is certainly
 worthwhile. Starting new files every time the code checkpoints would be a
 nice solution. Implementing this in Cactus would touch some thorn
 (assuming you do the same for all output types), and while you are right
 about the readers, there aren't so many of them; I don't think that would
 be a problem.
 You would get all this automatically though if you do a restart every time
 you checkpoint. Assuming you checkpoint every 6 hours, set your walltime
 to 6 hours and submit more restarts. Of course that is likely quite
 impractical on some systems that have limits on queues. What we could do
 is to have simfactory support this: run 'RunScript' a couple of times
 instead of once, and set the max_walltime in the Cactus parameter file to
 the real_walltime/count. That way you get real restarts without the
 queuing system knowing and without Cactus or reader code change. You add
 some overhead from reading the checkpoints though.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283#comment:1>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit