[ET Trac] [Einstein Toolkit] #1283: Missing data in HDF5 files
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Tue Mar 5 09:15:49 CST 2013
#1283: Missing data in HDF5 files
---------------------+------------------------------------------------------
Reporter: hinder | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: Cactus | Version:
Resolution: | Keywords:
---------------------+------------------------------------------------------
Comment (by knarf):
Independent on the frequency of checkpoints and how many you keep - in all
cases you loose data. You can always 'recover' by re-running whatever is
missing. The question is: how much do you loose and can/should you recover
automatically or would it be acceptable/better to leave that to the user?
Assuming a simulations stops because it detected a problem with I/O (e.g.
disk full) I would rather not have it restart without the user explicitly
telling it to. 'sim submit NAME' doesn't count as 'explicitly' to me.
Depending on the situation, a user should then probably investigate what
happened and take steps accordingly, e.g., restarting the last 'restart'
completely if otherwise data would be missing. It would be hard to
correctly automate this.
Having said that: trying to minimize the data loss is certainly
worthwhile. Starting new files every time the code checkpoints would be a
nice solution. Implementing this in Cactus would touch some thorn
(assuming you do the same for all output types), and while you are right
about the readers, there aren't so many of them; I don't think that would
be a problem.
You would get all this automatically though if you do a restart every time
you checkpoint. Assuming you checkpoint every 6 hours, set your walltime
to 6 hours and submit more restarts. Of course that is likely quite
impractical on some systems that have limits on queues. What we could do
is to have simfactory support this: run 'RunScript' a couple of times
instead of once, and set the max_walltime in the Cactus parameter file to
the real_walltime/count. That way you get real restarts without the
queuing system knowing and without Cactus or reader code change. You add
some overhead from reading the checkpoints though.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/1283#comment:1>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list