[ET Trac] [Einstein Toolkit] #316: Checkpoint recovery nonfunctional

Einstein Toolkit trac-noreply at einsteintoolkit.org
Sun Feb 27 10:43:06 CST 2011


#316: Checkpoint recovery nonfunctional
------------------------+---------------------------------------------------
 Reporter:  hinder      |       Owner:  mthomas
     Type:  defect      |      Status:  new    
 Priority:  blocker     |   Milestone:         
Component:  SimFactory  |     Version:         
 Keywords:  regression  |  
------------------------+---------------------------------------------------
 Checkpoint recovery is nonfunctional in SimFactory 2 (it has broken since
 it was last fixed in ticket #60).

 Using the attached parameter file, I submit a simulation on Datura:

   simfactory2/bin/sim --machine datura --config sim2_datura create-submit
 parfiles/cptest.par 12 1:00:00

 This parameter file terminates the Cactus run after 1 minute and dumps a
 checkpoint file.  I then manually remove the output-0000-active symlink,
 as the automatic cleanup in the main() function is cleaning up restarts
 that are attempting to run, so I have disabled it, and manual cleanup
 doesn't work (see ticket #315).

 I then resubmit the simulation

   simfactory2/bin/sim --machine datura submit parfiles/cptest.par

 and observe that the checkpoint files from the first restart are never
 hardlinked into the output directory.  The job does not recover, and
 instead starts from initial data.

 Log file is attached.

 Looking at the code, it appears that the checkpoint linking is conditional
 on the from-restart-id parameter being passed to simfactory, which I think
 is something to do with job-chaining.  I can't see anywhere in the code
 which sets this option, so this is probably why the linking is not
 happening.

-- 
Ticket URL: <https://trac.einsteintoolkit.org/ticket/316>
Einstein Toolkit <http://einsteintoolkit.org>
The Einstein Toolkit


More information about the Trac mailing list