[Users] The simulation stops suddenly, problem with restarting from checkpoint, and with number of processors

Hassan Khalvati hassan.kh92 at gmail.com
Thu Sep 12 05:39:04 CDT 2019


Dear Erik,
Thank you for your reply, but there are not *.out or *.err files in the
output directory or anywhere else. Was there an option that I have had to
activate that to save these files?

Hassan

On Wed, 11 Sep 2019 at 21:54, Erik Schnetter <schnetter at cct.lsu.edu> wrote:

> Hassan
>
> The last lines of the simulation output might not include the error
> message. There should be two files in the output directory, one ending
> in *.out, the other ending in *.err. The latter might have an actual
> error message.
>
> To see whether all cores are used, you can look at the startup output
> of Carpet. This would be near the beginning of the *.out file above
> (within the first 1000 lines or so). To get more detailed output, you
> can activate the thorn "SystemTopology" in your parameter file. This
> will provide more details regarding cores and threads in your output.
>
> -erik
>
>
> On Wed, Sep 11, 2019 at 12:19 PM Hassan Khalvati <hassan.kh92 at gmail.com>
> wrote:
> >
> > Dear All,
> > I had a simulation running for nearly 5 days and it stops today with no
> reason, no errors, and no termination.
> > the first thing I need help with is that I can not find the cause that
> the simulation has been stopped. The last lines during the simulation have
> been attached as a text file.
> >
> >
> > The second problem is that I can not restart from the checkpoint. there
> is an error :
> >
> >  ./simfactory/bin/sim submit the-last-one
>  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56
> > Error: job id is negative
> > Aborting Simfactory.
> >
> >
> >  I looked up in email archives, and I did what Roland has suggested, to
> add a line for jobid, (jobid = 999999) in the properties.ini file, but I
> am  still getting errors
> >
> > ./simfactory/bin/sim submit the-last-one
>  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56
> > Warning: job status is U
> > Warning: job status is U
> > Assigned restart id: 1
> > Warning: Too many used cores per node specified: specified ppn-used=56
> (ppn is 28)
> > Executing submit command: exec nohup
> /home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript <
> /dev/null > /dev/null 2> /dev/null & echo $!
> > Submit finished, job id is 8907
> >
> >
> >
> > I changed the lines in the properties.ini file for procs, and again
> getting error
> >
> >
> > ./simfactory/bin/sim submit the-last-one
>  --parfile=par/bbh-2res-1mass-10sep-final.par
> > Assigned restart id: 1
> > Executing submit command: exec nohup
> /home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript <
> /dev/null > /dev/null 2> /dev/null & echo $!
> > Submit finished, job id is 10517
> >
> > And finally, I am confused about the option for the "ppn, procs, and
> ..." numbers in the Simfactory. I have attached my CPU information. It is a
> double 14 core Xeon(R) CPU E5-2680, with 2 threads per core. my submission
> command was:
> > ./simfactory/bin/sim create-run the-last-one
>  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56 --ppn-used=56
> > but in the properties.ini file, it is mentioned that:
> > numprocs        = 4
> > nodeprocs       = 4
> > numthreads      = 14
> > I have also attached the properties.ini file. Is it using only 4 cores?
> I looked up in the Simfactory docs, and also ET's wiki. I can not get a
> clear picture of how the option of the number of processors works. However,
> with the same command line, I have mentioned above, --procs=56
> --ppn-used=56, the simulation was performing well, I want to know if it is
> using total number of processors on my system or not. I would be grateful
> if anyone could help me with each of these issues.
> >
> > Attachments are:
> > parameter file,
> > properties.ini,
> > simulation-last-lines,
> > CPU info,
> > and the log.txt file.
> >
> >
> >
> > Sincerely,
> > Hassan
> >
> >
> > --
> >
> > Hassan Khalvati
> > Sharif University of Technology, Tehran
> > Hassan.Khalvati at physics.sharif.edu
> > Hassan.kh92 at gmail.com
> >
> > _______________________________________________
> > Users mailing list
> > Users at einsteintoolkit.org
> > http://lists.einsteintoolkit.org/mailman/listinfo/users
>
>
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>


-- 

Hassan Khalvati
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20190912/fb719ee8/attachment.html 


More information about the Users mailing list