[Users] simfactory job restart

Ian Hinder ian.hinder at aei.mpg.de
Mon Jul 28 05:33:36 CDT 2014


On 23 Jul 2014, at 18:32, Vassilios Mewes <vassilios.mewes at uv.es> wrote:

> Hello all,
> 
> a simulation has crashed without checkpointing (there was a filesystem error on the cluster)
> 
> how can i restart it? do I need to delete the uncompleted output-xxxx and output-xxx-active folder? or is that not necessary and simfactory will automatically find the latest valid checkpoint in simulation time and restart from there?

SimFactory is supposed to recover from this situation gracefully. However, I have in the past seen it notice that there are no checkpoint files in the last restart, and then start again from the initial data.  Perhaps somebody forgot to write a test case for this situation.  This is not a problem if you use a checkpoint directory shared between restarts (i.e. "../checkpoints").  If you are not using a shared checkpoint directory, I recommend deleting the output-NNNN and output-NNNN-active directories/links for the failed restarts.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder



More information about the Users mailing list