[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files

Tue Mar 5 09:37:14 CST 2013

#1283: Missing data in HDF5 files
---------------------+------------------------------------------------------
  Reporter:  hinder  |       Owner:     
      Type:  defect  |      Status:  new
  Priority:  major   |   Milestone:     
 Component:  Cactus  |     Version:     
Resolution:          |    Keywords:     
---------------------+------------------------------------------------------

Comment (by hinder):

 Replying to [comment:1 knarf]:

 > Assuming a simulations stops because it detected a problem with I/O
 (e.g. disk full) I would rather not have it restart without the user
 explicitly telling it to. 'sim submit NAME' doesn't count as 'explicitly'
 to me. Depending on the situation, a user should then probably investigate
 what happened and take steps accordingly, e.g., restarting the last
 'restart' completely if otherwise data would be missing. It would be hard
 to correctly automate this.

 At the moment, running out of quota essentially causes all your
 simulations to die immediately, and all the queued restarts also die,
 until you have nothing left in the queue.  What would be nice would be a
 feature to place a hold on your queued jobs in the case of low disk space.
 One could imagine simfactory monitoring quota usage periodically, and
 holding your jobs and sending you an email if you run low on quota.  But I
 think this is a discussion for another ticket.

 > Having said that: trying to minimize the data loss is certainly
 worthwhile. Starting new files every time the code checkpoints would be a
 nice solution. Implementing this in Cactus would touch some thorn
 (assuming you do the same for all output types), and while you are right
 about the readers, there aren't so many of them; I don't think that would
 be a problem.

 You would get an increased number of inodes used, which has been a problem
 in the past on datura.  I usually checkpoint every 3 hours, so in a 24
 hour job, that is a factor of 8 increase in the number of inodes.  This is
 really a fault of the HDF5 library, that it does not guarantee that the
 file will be recoverable in the case of an interrupted write.

 > You would get all this automatically though if you do a restart every
 time you checkpoint. Assuming you checkpoint every 6 hours, set your
 walltime to 6 hours and submit more restarts. Of course that is likely
 quite impractical on some systems that have limits on queues.

 You know, I hadn't thought of that!  I think I will start to do this.
 Thanks! It has the same issue with inodes though.

 > What we could do is to have simfactory support this: run 'RunScript' a
 couple of times instead of once, and set the max_walltime in the Cactus
 parameter file to the real_walltime/count. That way you get real restarts
 without the queuing system knowing and without Cactus or reader code
 change. You add some overhead from reading the checkpoints though.

 Yes, which might be significant on some systems.  The "correct" solution
 to the problem is to implement full journalling in HDF5.  In the absence
 of that, we might want to consider copying the HDF5 file to a temporary
 file before writing to it.  I wonder how bad this would be from a
 performance point of view.  If the filesystem was in any sense "modern",
 it would be using copy-on-write at the block level, which should be very
 fast for this.  In reality, it will probably do a full copy.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283#comment:2>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit