[Users] The simulation stops suddenly, problem with restarting from checkpoint, and with number of processors

Erik Schnetter schnetter at cct.lsu.edu
Wed Sep 11 12:24:36 CDT 2019


Hassan

The last lines of the simulation output might not include the error
message. There should be two files in the output directory, one ending
in *.out, the other ending in *.err. The latter might have an actual
error message.

To see whether all cores are used, you can look at the startup output
of Carpet. This would be near the beginning of the *.out file above
(within the first 1000 lines or so). To get more detailed output, you
can activate the thorn "SystemTopology" in your parameter file. This
will provide more details regarding cores and threads in your output.

-erik


On Wed, Sep 11, 2019 at 12:19 PM Hassan Khalvati <hassan.kh92 at gmail.com> wrote:
>
> Dear All,
> I had a simulation running for nearly 5 days and it stops today with no reason, no errors, and no termination.
> the first thing I need help with is that I can not find the cause that the simulation has been stopped. The last lines during the simulation have been attached as a text file.
>
>
> The second problem is that I can not restart from the checkpoint. there is an error :
>
>  ./simfactory/bin/sim submit the-last-one   --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56
> Error: job id is negative
> Aborting Simfactory.
>
>
>  I looked up in email archives, and I did what Roland has suggested, to add a line for jobid, (jobid = 999999) in the properties.ini file, but I am  still getting errors
>
> ./simfactory/bin/sim submit the-last-one   --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56
> Warning: job status is U
> Warning: job status is U
> Assigned restart id: 1
> Warning: Too many used cores per node specified: specified ppn-used=56 (ppn is 28)
> Executing submit command: exec nohup /home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript < /dev/null > /dev/null 2> /dev/null & echo $!
> Submit finished, job id is 8907
>
>
>
> I changed the lines in the properties.ini file for procs, and again getting error
>
>
> ./simfactory/bin/sim submit the-last-one   --parfile=par/bbh-2res-1mass-10sep-final.par
> Assigned restart id: 1
> Executing submit command: exec nohup /home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript < /dev/null > /dev/null 2> /dev/null & echo $!
> Submit finished, job id is 10517
>
> And finally, I am confused about the option for the "ppn, procs, and ..." numbers in the Simfactory. I have attached my CPU information. It is a double 14 core Xeon(R) CPU E5-2680, with 2 threads per core. my submission command was:
> ./simfactory/bin/sim create-run the-last-one   --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56 --ppn-used=56
> but in the properties.ini file, it is mentioned that:
> numprocs        = 4
> nodeprocs       = 4
> numthreads      = 14
> I have also attached the properties.ini file. Is it using only 4 cores? I looked up in the Simfactory docs, and also ET's wiki. I can not get a clear picture of how the option of the number of processors works. However, with the same command line, I have mentioned above, --procs=56 --ppn-used=56, the simulation was performing well, I want to know if it is using total number of processors on my system or not. I would be grateful if anyone could help me with each of these issues.
>
> Attachments are:
> parameter file,
> properties.ini,
> simulation-last-lines,
> CPU info,
> and the log.txt file.
>
>
>
> Sincerely,
> Hassan
>
>
> --
>
> Hassan Khalvati
> Sharif University of Technology, Tehran
> Hassan.Khalvati at physics.sharif.edu
> Hassan.kh92 at gmail.com
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users



-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/


More information about the Users mailing list