[Users] Restarting simulations using simfactory
Atul Kedia
akedia at nd.edu
Thu Aug 27 13:18:17 CDT 2020
Hi Erik,
Thanks for your comments.
I ran a few tests last week and found that the "sim submit" command
actually works on my cluster but in a weird way. The "nohup" error I shared
earlier is just the screen output and the simulation continues to run in
the background. This is not convenient though because it doesn't reflect as
a running job through my cluster, but shows up as running only when I look
for the simulations using ./simfactory/bin/sim list-simulations. So I will
have to continue using the 'run' command.
If that is true, then I would create a new simulation (i.e. use a
> different name for the simulation), and modify the parameter file to
> point to the checkpoint files in the old simulations as restart files.
> The respective parameters are provided by the "IOUtil" thorn. This
> way, you sidestep Simfactory's automatic restart mechanism, and you
> are only using the "sim run" command you already know is working for
> your machine. The disadvantage is that you need to modify the
> parameter file each time you restart to point to the checkpoint files
> written by the previous run.
>
> I am working on this method now and trying to figure out which
parameters would need to be changed.
I am working on this method now, and as far as I can gather, I should only
be changing the recover mode to 'manual' and specify the path and file name
to the checkpoint. I will let you know if I make some progress on it.
thank you,
best regards,
Atul
> > Warning: job status is U
> > Warning: Job chaining requested but job id 999999 is not in the queue.
> Its status is U. Aborting submission.
> >
> > I guess the issue is with the "status is U" part now.
> >
> > Best regards,
> > Atul.
> >
> > On Sat, Aug 15, 2020 at 8:57 PM Erik Schnetter <schnetter at cct.lsu.edu>
> wrote:
> >>
> >> Atul
> >>
> >> "sim run" starts a simulation right away. Have you tried "sim submit"
> >> instead? This should check whether the simulation is still active, and
> >> if so, deactivate it before running the next restart.
> >>
> >> -erik
> >>
> >> On Sat, Aug 15, 2020 at 7:54 PM Atul Kedia <akedia at nd.edu> wrote:
> >> >
> >> > Hello,
> >> >
> >> > I want to restart a simulation to make it run for longer. It
> currently stopped at the time it was asked it at my par file. It has
> checkpoints enabled in the par file.
> >> >
> >> > I have increased the time in the par file at <sim_name>/output-0000/
> and at <sim_name>/SIMFACTORY/par and I tried the commands :
> >> >
> >> > simfactory/bin/sim cleanup <sim_name>
> >> > followed by
> >> > simfactory/bin/sim run <sim_name>
> >> > and set the added a line "jobid = 999999" as suggested at :
> http://lists.einsteintoolkit.org/pipermail/users/2018-September/006528.html
> >> >
> >> > and I get the error message :
> >> > "Error: Internal error: Cannot submit simulation <sim_name> because
> it is already active"
> >> >
> >> > Another email thread I used for reference was this one:
> http://lists.einsteintoolkit.org/pipermail/users/2018-May/006281.html
> >> >
> >> > I am using ET_Mayer with the default simfactory that it comes with
> (simfactory 2, I think).
> >> >
> >> > Any help would be really appreciated.
> >> >
> >> > Thank you,
> >> >
> >> > --
> >> > Atul Kedia
> >> > PhD student,
> >> > Physics department,
> >> > University of Notre Dame.
> >> > _______________________________________________
> >> > Users mailing list
> >> > Users at einsteintoolkit.org
> >> > http://lists.einsteintoolkit.org/mailman/listinfo/users
> >>
> >>
> >>
> >> --
> >> Erik Schnetter <schnetter at cct.lsu.edu>
> >> http://www.perimeterinstitute.ca/personal/eschnetter/
> >
> >
> >
> > --
> > Atul Kedia
> > PhD student,
> > Physics department,
> > University of Notre Dame.
>
>
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20200827/5fe30634/attachment.html
More information about the Users
mailing list