[Users] The simulation stops suddenly, problem with restarting from checkpoint, and with number of processors

Haas, Roland rhaas at illinois.edu
Thu Oct 3 10:43:09 CDT 2019


Hello Hassan, Erik,

does this issue still persist? Since there are no *.out and no *.err
file, would you mind providing a list of the files that are present?

Ie the output of somehing like:

ls -lR /home/cosmo/simulations/the-last-one/output-0001/

as well as the simfactory log file
(home/cosmo/simulations/the-last-one/log.txt), please?

Note that the files should always be saved. There names are given in
the SubmitScript (exactly how depends on the queuing system used, it
may be good if you could include the file SubmitScript and RunScript
as well). 

Yours,
Roland

> Dear Erik,
> Thank you for your reply, but there are not *.out or *.err files in the
> output directory or anywhere else. Was there an option that I have had to
> activate that to save these files?
> 
> Hassan
> 
> On Wed, 11 Sep 2019 at 21:54, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
> 
> > Hassan
> >
> > The last lines of the simulation output might not include the error
> > message. There should be two files in the output directory, one ending
> > in *.out, the other ending in *.err. The latter might have an actual
> > error message.
> >
> > To see whether all cores are used, you can look at the startup output
> > of Carpet. This would be near the beginning of the *.out file above
> > (within the first 1000 lines or so). To get more detailed output, you
> > can activate the thorn "SystemTopology" in your parameter file. This
> > will provide more details regarding cores and threads in your output.
> >
> > -erik
> >
> >
> > On Wed, Sep 11, 2019 at 12:19 PM Hassan Khalvati <hassan.kh92 at gmail.com>
> > wrote:  
> > >
> > > Dear All,
> > > I had a simulation running for nearly 5 days and it stops today with no  
> > reason, no errors, and no termination.  
> > > the first thing I need help with is that I can not find the cause that  
> > the simulation has been stopped. The last lines during the simulation have
> > been attached as a text file.  
> > >
> > >
> > > The second problem is that I can not restart from the checkpoint. there  
> > is an error :  
> > >
> > >  ./simfactory/bin/sim submit the-last-one  
> >  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56  
> > > Error: job id is negative
> > > Aborting Simfactory.
> > >
> > >
> > >  I looked up in email archives, and I did what Roland has suggested, to  
> > add a line for jobid, (jobid = 999999) in the properties.ini file, but I
> > am  still getting errors  
> > >
> > > ./simfactory/bin/sim submit the-last-one  
> >  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56  
> > > Warning: job status is U
> > > Warning: job status is U
> > > Assigned restart id: 1
> > > Warning: Too many used cores per node specified: specified ppn-used=56  
> > (ppn is 28)  
> > > Executing submit command: exec nohup  
> > /home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript <
> > /dev/null > /dev/null 2> /dev/null & echo $!  
> > > Submit finished, job id is 8907
> > >
> > >
> > >
> > > I changed the lines in the properties.ini file for procs, and again  
> > getting error  
> > >
> > >
> > > ./simfactory/bin/sim submit the-last-one  
> >  --parfile=par/bbh-2res-1mass-10sep-final.par  
> > > Assigned restart id: 1
> > > Executing submit command: exec nohup  
> > /home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript <
> > /dev/null > /dev/null 2> /dev/null & echo $!  
> > > Submit finished, job id is 10517
> > >
> > > And finally, I am confused about the option for the "ppn, procs, and  
> > ..." numbers in the Simfactory. I have attached my CPU information. It is a
> > double 14 core Xeon(R) CPU E5-2680, with 2 threads per core. my submission
> > command was:  
> > > ./simfactory/bin/sim create-run the-last-one  
> >  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56 --ppn-used=56  
> > > but in the properties.ini file, it is mentioned that:
> > > numprocs        = 4
> > > nodeprocs       = 4
> > > numthreads      = 14
> > > I have also attached the properties.ini file. Is it using only 4 cores?  
> > I looked up in the Simfactory docs, and also ET's wiki. I can not get a
> > clear picture of how the option of the number of processors works. However,
> > with the same command line, I have mentioned above, --procs=56
> > --ppn-used=56, the simulation was performing well, I want to know if it is
> > using total number of processors on my system or not. I would be grateful
> > if anyone could help me with each of these issues.  
> > >
> > > Attachments are:
> > > parameter file,
> > > properties.ini,
> > > simulation-last-lines,
> > > CPU info,
> > > and the log.txt file.
> > >
> > >
> > >
> > > Sincerely,
> > > Hassan
> > >
> > >
> > > --
> > >
> > > Hassan Khalvati
> > > Sharif University of Technology, Tehran
> > > Hassan.Khalvati at physics.sharif.edu
> > > Hassan.kh92 at gmail.com
> > >
> > > _______________________________________________
> > > Users mailing list
> > > Users at einsteintoolkit.org
> > > http://lists.einsteintoolkit.org/mailman/listinfo/users  
> >
> >
> >
> > --
> > Erik Schnetter <schnetter at cct.lsu.edu>
> > http://www.perimeterinstitute.ca/personal/eschnetter/
> >  
> 
> 



-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20191003/0f21c0f5/attachment.bin 


More information about the Users mailing list