[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Tue Mar 5 09:37:14 CST 2013
#1283: Missing data in HDF5 files
---------------------+------------------------------------------------------
Reporter: hinder | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: Cactus | Version:
Resolution: | Keywords:
---------------------+------------------------------------------------------
Comment (by hinder):
Replying to [comment:1 knarf]:
> Assuming a simulations stops because it detected a problem with I/O
(e.g. disk full) I would rather not have it restart without the user
explicitly telling it to. 'sim submit NAME' doesn't count as 'explicitly'
to me. Depending on the situation, a user should then probably investigate
what happened and take steps accordingly, e.g., restarting the last
'restart' completely if otherwise data would be missing. It would be hard
to correctly automate this.
At the moment, running out of quota essentially causes all your
simulations to die immediately, and all the queued restarts also die,
until you have nothing left in the queue. What would be nice would be a
feature to place a hold on your queued jobs in the case of low disk space.
One could imagine simfactory monitoring quota usage periodically, and
holding your jobs and sending you an email if you run low on quota. But I
think this is a discussion for another ticket.
> Having said that: trying to minimize the data loss is certainly
worthwhile. Starting new files every time the code checkpoints would be a
nice solution. Implementing this in Cactus would touch some thorn
(assuming you do the same for all output types), and while you are right
about the readers, there aren't so many of them; I don't think that would
be a problem.
You would get an increased number of inodes used, which has been a problem
in the past on datura. I usually checkpoint every 3 hours, so in a 24
hour job, that is a factor of 8 increase in the number of inodes. This is
really a fault of the HDF5 library, that it does not guarantee that the
file will be recoverable in the case of an interrupted write.
> You would get all this automatically though if you do a restart every
time you checkpoint. Assuming you checkpoint every 6 hours, set your
walltime to 6 hours and submit more restarts. Of course that is likely
quite impractical on some systems that have limits on queues.
You know, I hadn't thought of that! I think I will start to do this.
Thanks! It has the same issue with inodes though.
> What we could do is to have simfactory support this: run 'RunScript' a
couple of times instead of once, and set the max_walltime in the Cactus
parameter file to the real_walltime/count. That way you get real restarts
without the queuing system knowing and without Cactus or reader code
change. You add some overhead from reading the checkpoints though.
Yes, which might be significant on some systems. The "correct" solution
to the problem is to implement full journalling in HDF5. In the absence
of that, we might want to consider copying the HDF5 file to a temporary
file before writing to it. I wonder how bad this would be from a
performance point of view. If the filesystem was in any sense "modern",
it would be using copy-on-write at the block level, which should be very
fast for this. In reality, it will probably do a full copy.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283#comment:2>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list