[Users] Restarting simulations using simfactory

Erik Schnetter schnetter at cct.lsu.edu
Thu Aug 27 13:37:37 CDT 2020


On Thu, Aug 27, 2020 at 2:18 PM Atul Kedia <akedia at nd.edu> wrote:
>
> Hi Erik,
>
> Thanks for your comments.
>
> I ran a few tests last week and found that the "sim submit" command actually works on my cluster but in a weird way. The "nohup" error I shared earlier is just the screen output and the simulation continues to run in the background. This is not convenient though because it doesn't reflect as a running job through my cluster, but shows up as running only when I look for the simulations using ./simfactory/bin/sim list-simulations.

This is not how Simfactory is supposed to behave. It seems that
Simfactory is not correctly set up for your cluster. The mechanism
you're describing is the one that's useful on a workstation or laptop.

> So I will have to continue using the 'run' command.
>
>> If that is true, then I would create a new simulation (i.e. use a
>> different name for the simulation), and modify the parameter file to
>> point to the checkpoint files in the old simulations as restart files.
>> The respective parameters are provided by the "IOUtil" thorn. This
>> way, you sidestep Simfactory's automatic restart mechanism, and you
>> are only using the "sim run" command you already know is working for
>> your machine. The disadvantage is that you need to modify the
>> parameter file each time you restart to point to the checkpoint files
>> written by the previous run.
>>
> I am working on this method now and trying to figure out which parameters would need to be changed.
>
> I am working on this method now, and as far as I can gather, I should only be changing the recover mode to 'manual' and specify the path and file name to the checkpoint. I will let you know if I make some progress on it.

I think you want "recover = auto" instead; that's the generally useful setting.

-erik

> thank you,
> best regards,
> Atul
>
>>
>> > Warning: job status is U
>> > Warning: Job chaining requested but job id 999999 is not in the queue. Its status is U. Aborting submission.
>> >
>> > I guess the issue is with the "status is U" part now.
>> >
>> > Best regards,
>> > Atul.
>> >
>> > On Sat, Aug 15, 2020 at 8:57 PM Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>> >>
>> >> Atul
>> >>
>> >> "sim run" starts a simulation right away. Have you tried "sim submit"
>> >> instead? This should check whether the simulation is still active, and
>> >> if so, deactivate it before running the next restart.
>> >>
>> >> -erik
>> >>
>> >> On Sat, Aug 15, 2020 at 7:54 PM Atul Kedia <akedia at nd.edu> wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > I want to restart a simulation to make it run for longer. It currently stopped at the time it was asked it at my par file. It has checkpoints enabled in the par file.
>> >> >
>> >> > I have increased the time in the par file at <sim_name>/output-0000/ and at  <sim_name>/SIMFACTORY/par and I tried the commands :
>> >> >
>> >> > simfactory/bin/sim cleanup <sim_name>
>> >> > followed by
>> >> > simfactory/bin/sim run <sim_name>
>> >> > and set the added a line "jobid = 999999" as suggested at : http://lists.einsteintoolkit.org/pipermail/users/2018-September/006528.html
>> >> >
>> >> > and I get the error message :
>> >> > "Error: Internal error: Cannot submit simulation <sim_name> because it is already active"
>> >> >
>> >> > Another email thread I used for reference was this one: http://lists.einsteintoolkit.org/pipermail/users/2018-May/006281.html
>> >> >
>> >> > I am using ET_Mayer with the default simfactory that it comes with (simfactory 2, I think).
>> >> >
>> >> > Any help would be really appreciated.
>> >> >
>> >> > Thank you,
>> >> >
>> >> > --
>> >> > Atul Kedia
>> >> > PhD student,
>> >> > Physics department,
>> >> > University of Notre Dame.
>> >> > _______________________________________________
>> >> > Users mailing list
>> >> > Users at einsteintoolkit.org
>> >> > http://lists.einsteintoolkit.org/mailman/listinfo/users
>> >>
>> >>
>> >>
>> >> --
>> >> Erik Schnetter <schnetter at cct.lsu.edu>
>> >> http://www.perimeterinstitute.ca/personal/eschnetter/
>> >
>> >
>> >
>> > --
>> > Atul Kedia
>> > PhD student,
>> > Physics department,
>> > University of Notre Dame.
>>
>>
>>
>> --
>> Erik Schnetter <schnetter at cct.lsu.edu>
>> http://www.perimeterinstitute.ca/personal/eschnetter/



-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/


More information about the Users mailing list