[Users] simfactory job restart

Vassilios Mewes vassilios.mewes at uv.es
Mon Jul 28 08:20:17 CDT 2014


Hello Ian,

yes, this has happened to me.. I didn't use a shred checkpoint directory,
and some runs have started from the initial data again..deleting all
folders output-NNNN without a valid termination checkpoint will remedy this?

to use a shared checkpoint directory for future runs: is it sufficient to
just use

io::checkpoint_dir                    = "../checkpoints"
io::recover_dir                         = "../checkpoints"

??

best wishes,

Vassili


On Mon, Jul 28, 2014 at 12:33 PM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:

>
> On 23 Jul 2014, at 18:32, Vassilios Mewes <vassilios.mewes at uv.es> wrote:
>
> > Hello all,
> >
> > a simulation has crashed without checkpointing (there was a filesystem
> error on the cluster)
> >
> > how can i restart it? do I need to delete the uncompleted output-xxxx
> and output-xxx-active folder? or is that not necessary and simfactory will
> automatically find the latest valid checkpoint in simulation time and
> restart from there?
>
> SimFactory is supposed to recover from this situation gracefully. However,
> I have in the past seen it notice that there are no checkpoint files in the
> last restart, and then start again from the initial data.  Perhaps somebody
> forgot to write a test case for this situation.  This is not a problem if
> you use a checkpoint directory shared between restarts (i.e.
> "../checkpoints").  If you are not using a shared checkpoint directory, I
> recommend deleting the output-NNNN and output-NNNN-active directories/links
> for the failed restarts.
>
> --
> Ian Hinder
> http://numrel.aei.mpg.de/people/hinder
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20140728/6a75c3e9/attachment.html 


More information about the Users mailing list