[ET Trac] [Einstein Toolkit] #316: Checkpoint recovery nonfunctional
Einstein Toolkit
trac-noreply at einsteintoolkit.org
Sun Feb 27 10:43:06 CST 2011
#316: Checkpoint recovery nonfunctional
------------------------+---------------------------------------------------
Reporter: hinder | Owner: mthomas
Type: defect | Status: new
Priority: blocker | Milestone:
Component: SimFactory | Version:
Keywords: regression |
------------------------+---------------------------------------------------
Checkpoint recovery is nonfunctional in SimFactory 2 (it has broken since
it was last fixed in ticket #60).
Using the attached parameter file, I submit a simulation on Datura:
simfactory2/bin/sim --machine datura --config sim2_datura create-submit
parfiles/cptest.par 12 1:00:00
This parameter file terminates the Cactus run after 1 minute and dumps a
checkpoint file. I then manually remove the output-0000-active symlink,
as the automatic cleanup in the main() function is cleaning up restarts
that are attempting to run, so I have disabled it, and manual cleanup
doesn't work (see ticket #315).
I then resubmit the simulation
simfactory2/bin/sim --machine datura submit parfiles/cptest.par
and observe that the checkpoint files from the first restart are never
hardlinked into the output directory. The job does not recover, and
instead starts from initial data.
Log file is attached.
Looking at the code, it appears that the checkpoint linking is conditional
on the from-restart-id parameter being passed to simfactory, which I think
is something to do with job-chaining. I can't see anywhere in the code
which sets this option, so this is probably why the linking is not
happening.
--
Ticket URL: <https://trac.einsteintoolkit.org/ticket/316>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit
More information about the Trac
mailing list