[Users] Checkpointing with Cactus/Simfactory
ian.hinder at aei.mpg.de
Wed Aug 31 16:39:04 CDT 2016
On 31 Aug 2016, at 23:02, dumsani <g14n8326 at campus.ru.ac.za> wrote:
> Hi All,
> I have some very long BH simulations to run and I'd like to checkpoint
> for these. I haven't really done checkpointing before. But what I
> know is that chekpointing information can be specified in the parameter
> file (for use by Cactus), and also that Simfactory
> does seem to have some stuff to do with or handle checkointing (
> "restart-id", etc...). Of course, scheduling systems (e.g. PBSPro) at
> HPCs would have support for checkpointing but I don't want to use that.
> Probably it is only best to use that to set the walltime.
> So, my main question is: Assuming I set a maximum walltime of 12 hours,
> and I set my simulation to dump checkpoints every 3hrs (in
> walltime units), how do I *restart* my job at the end of the 12 hrs
> using Simfactory in a way that the simulation starts off from the last
> checkpoint it droppped before terminating? What extra command line
> options should I pass to the sumbit command of SImfactory?
You don't need anything extra; just "sim submit <simulationname>".
– If the job has completed already, the next job will be queued.
– If the job is queued or running, the next job will be queued with a dependency to only start when the previous one finishes. (The dependency logic is in the submit script of the machine; it's possible that the machine you are using does not have this defined. Look for references to "chain" in the other submit scripts in case you need to add this to your own machine.)
You can also use
TerminationTrigger::max_walltime = @WALLTIME_HOURS@
TerminationTrigger::on_remaining_walltime = 30 # minutes
TerminationTrigger::output_remtime_every_minutes = 30
This will cause Cactus to cleanly terminate 30 minutes before the end of the job's walltime (as a margin). If you additionally use
IO::checkpoint_on_terminate = yes
then you will get a checkpoint written. Without this, your job will be unceremoniously killed by the scheduler, leaving you with up to 3 hours of wasted computer time, possible corrupted output files, and duplicate data.
It is also convenient to use
TerminationTrigger::termination_from_file = yes
TerminationTrigger::termination_file = "terminate.txt"
TerminationTrigger::create_termination_file = yes
This will create a file called "terminate.txt" in the output directory. If you add a "1" to this file, Cactus will terminate immediately (and checkpoint, if you have set checkpoint_on_terminate as above). You can then resubmit the simulation if you like. This allows you to easily stop and start simulations without losing any runtime.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Users