[Users] The simulation stops suddenly, problem with restarting from checkpoint, and with number of processors

Haas, Roland rhaas at illinois.edu
Wed Oct 23 10:24:45 CDT 2019


Hello Hassan,

using create-run is likely not the best way to run these simulations.
Create-run is really intended for interactive use and will output
stdout and stderr to screen and never to a file (which is why there are
no *.out and *.err files).

The command to use to run production simulations is "create-submit" (or
just "submit") which will start the simulation in the background.

I cannot really make a good guess as to why the simulation ran faster /
slower using the options given as there are usually too many things
interacting.

Looking at the RunScript files nothing seems to be obviously wrong,
though OpenMPI's core binding could affect things. Please see
https://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.6 for a description of the options available. Usually you will want something like
--bind-to-socket or --bind-to-none (with the exact name of the option depending a bit on the OpenMPI version used).

I am also not sure (without seeing at least the out files and possibly
the ini files for the machine) how many threads / MPI ranks were used
without the --num-threads option. If it ended up using just threads
and a single MPI ranks then a speed difference of 2 or so is not
unlikely.

Are you running this on a Mac or a Linux box? 

Yours,
Roland

> Dear Roland,
> sorry for the delay.
> I could resume the simulation, doing based on your suggestion to add a line
> for jobid, but =999985 nor =999999.
> The *.err and *.out are still missing. Please find the files attached.
> 
> Here is the script I used to run the simulation
> 
>   ./simfactory/bin/sim create-run the-last-one \
>   --parfile=par/bbh-2res-1mass-10sep-final.par --ppn-used=56
> --num-threads=14 --num-smt=2
> 
> 
> There is another problem besides this. In the script line above, I used
> ppn-used=56 and procs=56 before and the "htop" was showing that all 56
> threads are being used 100%. but when I changed to what is written in the
> above line ( --ppn-used=56 --num-threads=14 --num-smt=2 ), the "htop" was
> showing that only a portion of cores are being used and non was performing
> on 100%, however, the simulation was *running faster more than fourfold. *
> I am still confused about specifying the number of cores.
> 
> Bests,
> Hassan
> 
> On Thu, 3 Oct 2019 at 19:13, Haas, Roland <rhaas at illinois.edu> wrote:
> 
> > Hello Hassan, Erik,
> >
> > does this issue still persist? Since there are no *.out and no *.err
> > file, would you mind providing a list of the files that are present?
> >
> > Ie the output of somehing like:
> >
> > ls -lR /home/cosmo/simulations/the-last-one/output-0001/
> >
> > as well as the simfactory log file
> > (home/cosmo/simulations/the-last-one/log.txt), please?
> >
> > Note that the files should always be saved. There names are given in
> > the SubmitScript (exactly how depends on the queuing system used, it
> > may be good if you could include the file SubmitScript and RunScript
> > as well).
> >
> > Yours,
> > Roland
> >  
> > > Dear Erik,
> > > Thank you for your reply, but there are not *.out or *.err files in the
> > > output directory or anywhere else. Was there an option that I have had to
> > > activate that to save these files?
> > >
> > > Hassan
> > >
> > > On Wed, 11 Sep 2019 at 21:54, Erik Schnetter <schnetter at cct.lsu.edu>  
> > wrote:  
> > >  
> > > > Hassan
> > > >
> > > > The last lines of the simulation output might not include the error
> > > > message. There should be two files in the output directory, one ending
> > > > in *.out, the other ending in *.err. The latter might have an actual
> > > > error message.
> > > >
> > > > To see whether all cores are used, you can look at the startup output
> > > > of Carpet. This would be near the beginning of the *.out file above
> > > > (within the first 1000 lines or so). To get more detailed output, you
> > > > can activate the thorn "SystemTopology" in your parameter file. This
> > > > will provide more details regarding cores and threads in your output.
> > > >
> > > > -erik
> > > >
> > > >
> > > > On Wed, Sep 11, 2019 at 12:19 PM Hassan Khalvati <  
> > hassan.kh92 at gmail.com>  
> > > > wrote:  
> > > > >
> > > > > Dear All,
> > > > > I had a simulation running for nearly 5 days and it stops today with  
> > no  
> > > > reason, no errors, and no termination.  
> > > > > the first thing I need help with is that I can not find the cause  
> > that  
> > > > the simulation has been stopped. The last lines during the simulation  
> > have  
> > > > been attached as a text file.  
> > > > >
> > > > >
> > > > > The second problem is that I can not restart from the checkpoint.  
> > there  
> > > > is an error :  
> > > > >
> > > > >  ./simfactory/bin/sim submit the-last-one  
> > > >  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56  
> > > > > Error: job id is negative
> > > > > Aborting Simfactory.
> > > > >
> > > > >
> > > > >  I looked up in email archives, and I did what Roland has suggested,  
> > to  
> > > > add a line for jobid, (jobid = 999999) in the properties.ini file, but  
> > I  
> > > > am  still getting errors  
> > > > >
> > > > > ./simfactory/bin/sim submit the-last-one  
> > > >  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56  
> > > > > Warning: job status is U
> > > > > Warning: job status is U
> > > > > Assigned restart id: 1
> > > > > Warning: Too many used cores per node specified: specified  
> > ppn-used=56  
> > > > (ppn is 28)  
> > > > > Executing submit command: exec nohup  
> > > >  
> > /home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript <  
> > > > /dev/null > /dev/null 2> /dev/null & echo $!  
> > > > > Submit finished, job id is 8907
> > > > >
> > > > >
> > > > >
> > > > > I changed the lines in the properties.ini file for procs, and again  
> > > > getting error  
> > > > >
> > > > >
> > > > > ./simfactory/bin/sim submit the-last-one  
> > > >  --parfile=par/bbh-2res-1mass-10sep-final.par  
> > > > > Assigned restart id: 1
> > > > > Executing submit command: exec nohup  
> > > >  
> > /home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript <  
> > > > /dev/null > /dev/null 2> /dev/null & echo $!  
> > > > > Submit finished, job id is 10517
> > > > >
> > > > > And finally, I am confused about the option for the "ppn, procs,  
> > and  
> > > > ..." numbers in the Simfactory. I have attached my CPU information. It  
> > is a  
> > > > double 14 core Xeon(R) CPU E5-2680, with 2 threads per core. my  
> > submission  
> > > > command was:  
> > > > > ./simfactory/bin/sim create-run the-last-one  
> > > >  --parfile=par/bbh-2res-1mass-10sep-final.par --procs=56  
> > --ppn-used=56  
> > > > > but in the properties.ini file, it is mentioned that:
> > > > > numprocs        = 4
> > > > > nodeprocs       = 4
> > > > > numthreads      = 14
> > > > > I have also attached the properties.ini file. Is it using only 4  
> > cores?  
> > > > I looked up in the Simfactory docs, and also ET's wiki. I can not get a
> > > > clear picture of how the option of the number of processors works.  
> > However,  
> > > > with the same command line, I have mentioned above, --procs=56
> > > > --ppn-used=56, the simulation was performing well, I want to know if  
> > it is  
> > > > using total number of processors on my system or not. I would be  
> > grateful  
> > > > if anyone could help me with each of these issues.  
> > > > >
> > > > > Attachments are:
> > > > > parameter file,
> > > > > properties.ini,
> > > > > simulation-last-lines,
> > > > > CPU info,
> > > > > and the log.txt file.
> > > > >
> > > > >
> > > > >
> > > > > Sincerely,
> > > > > Hassan
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Hassan Khalvati
> > > > > Sharif University of Technology, Tehran
> > > > > Hassan.Khalvati at physics.sharif.edu
> > > > > Hassan.kh92 at gmail.com
> > > > >
> > > > > _______________________________________________
> > > > > Users mailing list
> > > > > Users at einsteintoolkit.org
> > > > > http://lists.einsteintoolkit.org/mailman/listinfo/users  
> > > >
> > > >
> > > >
> > > > --
> > > > Erik Schnetter <schnetter at cct.lsu.edu>
> > > > http://www.perimeterinstitute.ca/personal/eschnetter/
> > > >  
> > >
> > >  
> >
> >
> >
> > --
> > My email is as private as my paper mail. I therefore support encrypting
> > and signing email messages. Get my PGP key from http://pgp.mit.edu .
> >  
> 
> 



-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20191023/a9e9b3e2/attachment.bin 


More information about the Users mailing list