[Users] simfactory job restart
Ian Hinder
ian.hinder at aei.mpg.de
Mon Jul 28 05:33:36 CDT 2014
On 23 Jul 2014, at 18:32, Vassilios Mewes <vassilios.mewes at uv.es> wrote:
> Hello all,
>
> a simulation has crashed without checkpointing (there was a filesystem error on the cluster)
>
> how can i restart it? do I need to delete the uncompleted output-xxxx and output-xxx-active folder? or is that not necessary and simfactory will automatically find the latest valid checkpoint in simulation time and restart from there?
SimFactory is supposed to recover from this situation gracefully. However, I have in the past seen it notice that there are no checkpoint files in the last restart, and then start again from the initial data. Perhaps somebody forgot to write a test case for this situation. This is not a problem if you use a checkpoint directory shared between restarts (i.e. "../checkpoints"). If you are not using a shared checkpoint directory, I recommend deleting the output-NNNN and output-NNNN-active directories/links for the failed restarts.
--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder
More information about the Users
mailing list