[Users] The simulation stops suddenly, problem with restarting from checkpoint, and with number of processors

Hassan Khalvati hassan.kh92 at gmail.com
Wed Sep 11 11:18:09 CDT 2019


Dear All,
I had a simulation running for nearly 5 days and it stops today with no
reason, no errors, and no termination.
the first thing I need help with is that I can not find the cause that the
simulation has been stopped. The last lines during the simulation have been
attached as a text file.


The second problem is that I can not restart from the checkpoint. there is
an error :

 ./simfactory/bin/sim submit the-last-one
--parfile=par/bbh-2res-1mass-10sep-final.par --procs=56
Error: job id is negative
Aborting Simfactory.


 I looked up in email archives, and I did what Roland has suggested, to add
a line for jobid, (jobid = 999999) in the properties.ini file, but I am
still getting errors

./simfactory/bin/sim submit the-last-one
--parfile=par/bbh-2res-1mass-10sep-final.par --procs=56
Warning: job status is U
Warning: job status is U
Assigned restart id: 1
Warning: Too many used cores per node specified: specified ppn-used=56 (ppn
is 28)
Executing submit command: exec nohup
/home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript <
/dev/null > /dev/null 2> /dev/null & echo $!
Submit finished, job id is 8907



I changed the lines in the properties.ini file for procs, and again getting
error


./simfactory/bin/sim submit the-last-one
--parfile=par/bbh-2res-1mass-10sep-final.par
Assigned restart id: 1
Executing submit command: exec nohup
/home/cosmo/simulations/the-last-one/output-0001/SIMFACTORY/SubmitScript <
/dev/null > /dev/null 2> /dev/null & echo $!
Submit finished, job id is 10517

And finally, I am confused about the option for the "ppn, procs, and ..."
numbers in the Simfactory. I have attached my CPU information. It is a
double 14 core Xeon(R) CPU E5-2680, with 2 threads per core. my submission
command was:
./simfactory/bin/sim create-run the-last-one
--parfile=par/bbh-2res-1mass-10sep-final.par --procs=56 --ppn-used=56
but in the properties.ini file, it is mentioned that:
numprocs        = 4
nodeprocs       = 4
numthreads      = 14
I have also attached the properties.ini file. Is it using only 4 cores? I
looked up in the Simfactory docs, and also ET's wiki. I can not get a clear
picture of how the option of the number of processors works. However, with
the same command line, I have mentioned above, --procs=56 --ppn-used=56,
the simulation was performing well, I want to know if it is using total
number of processors on my system or not. I would be grateful if anyone
could help me with each of these issues.

Attachments are:
parameter file,
properties.ini,
simulation-last-lines,
CPU info,
and the log.txt file.



Sincerely,
Hassan


-- 




*Hassan KhalvatiSharif University of Technology,
TehranHassan.Khalvati at physics.sharif.edu
<Hassan.Khalvati at physics.sharif.edu>Hassan.kh92 at gmail.com
<Hassan.kh92 at gmail.com>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20190911/215e4b82/attachment-0001.html 
-------------- next part --------------
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::Creating simulation the-last-one
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::Simulation directory: /home/cosmo/simulations/the-last-one
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::Simulation Properties:
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::[properties]
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::machine         = cosmo-Super-Server
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::simulationid    = simulation-the-last-one-cosmo-Super-Server-cosmo-Super-Server-cosmo-2019.09.08-13.49.21-40523
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::sourcedir       = /home/cosmo/ET/Cactus
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::configuration   = sim
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::configid        = config-sim-cosmo-Super-Server-home-cosmo-ET-Cactus
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::buildid         = build-sim-cosmo-Super-Server-cosmo-2019.05.07-09.43.14-31763
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::testsuite       = False
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::executable      = /home/cosmo/simulations/the-last-one/SIMFACTORY/exe/cactus_sim
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::optionlist      = /home/cosmo/simulations/the-last-one/SIMFACTORY/cfg/OptionList
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::submitscript    = /home/cosmo/simulations/the-last-one/SIMFACTORY/run/SubmitScript
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::runscript       = /home/cosmo/simulations/the-last-one/SIMFACTORY/run/RunScript
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::parfile         = /home/cosmo/simulations/the-last-one/SIMFACTORY/par/bbh-2res-1mass-10sep-final.par
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::
[LOG:2019-09-08 13:49:21] restart.create(simulationName, parfile)::Simulation the-last-one created
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::Creating new properties because this is an independant run, not a run following a submit
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::Determined the following properties
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::[properties]
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::machine         = cosmo-Super-Server
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::simulationid    = simulation-the-last-one-cosmo-Super-Server-cosmo-Super-Server-cosmo-2019.09.08-13.49.21-40523
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::sourcedir       = /home/cosmo/ET/Cactus
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::configuration   = sim
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::configid        = config-sim-cosmo-Super-Server-home-cosmo-ET-Cactus
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::buildid         = build-sim-cosmo-Super-Server-cosmo-2019.05.07-09.43.14-31763
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::testsuite       = False
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::executable      = /home/cosmo/simulations/the-last-one/SIMFACTORY/exe/cactus_sim
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::optionlist      = /home/cosmo/simulations/the-last-one/SIMFACTORY/cfg/OptionList
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::submitscript    = /home/cosmo/simulations/the-last-one/SIMFACTORY/run/SubmitScript
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::runscript       = /home/cosmo/simulations/the-last-one/SIMFACTORY/run/RunScript
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::parfile         = /home/cosmo/simulations/the-last-one/SIMFACTORY/par/bbh-2res-1mass-10sep-final.par
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::numprocs        = 4
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::nodeprocs       = 4
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::numthreads      = 14
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::hostname        = cosmo-Super-Server
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::ppn             = 28
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::ppnused         = 56
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::procsrequested  = 28
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::pbsSimulationName= the-last-one-00
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::cpufreq         = 
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::user            = cosmo
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::memory          = 0
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::nodes           = 1
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::procs           = 56
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::numsmt          = 1
[LOG:2019-09-08 13:49:21] restart.userRun(simulationName)::
[LOG:2019-09-08 13:49:21] self.makeActive()::Simulation the-last-one with restart-id 0 has been made active
[LOG:2019-09-08 13:49:21] self.run(debug)::Prepping for execution/run
[LOG:2019-09-08 13:49:21] checkpointing = self.PrepareCheckpointing(recover_id)::PrepareCheckpointing: max_restart_id: -1
[LOG:2019-09-08 13:49:21] self.run(debug)::Defined substitution properties for execution/run
[LOG:2019-09-08 13:49:21] self.run(debug)::{'SIMULATION_ID': 'simulation-the-last-one-cosmo-Super-Server-cosmo-Super-Server-cosmo-2019.09.08-13.49.21-40523', 'NODE_PROCS': 4, 'PPN_USED': 56, 'PPN': 28, 'CPUFREQ': None, 'USER': 'cosmo', 'RUNDIR': '/home/cosmo/simulations/the-last-one/output-0000', 'NODES': 1, 'SIMULATION_NAME': 'the-last-one', 'NUM_THREADS': 14, 'EXECUTABLE': '/home/cosmo/simulations/the-last-one/SIMFACTORY/exe/cactus_sim', 'PROCS_REQUESTED': 28, 'RESTART_ID': 0, 'NUM_SMT': 1, 'CONFIGURATION': 'sim', 'PROCS': 56, 'SUBMITSCRIPT': '/home/cosmo/simulations/the-last-one/SIMFACTORY/run/SubmitScript', 'MACHINE': 'cosmo-Super-Server', 'PARFILE': '/home/cosmo/simulations/the-last-one/output-0000/bbh-2res-1mass-10sep-final.par', 'SOURCEDIR': '/home/cosmo/ET/Cactus', 'HOSTNAME': 'cosmo-Super-Server', 'RUNDEBUG': 0, 'NUM_PROCS': 4, 'SCRIPTFILE': '/home/cosmo/simulations/the-last-one/SIMFACTORY/run/SubmitScript', 'MEMORY': '0', 'SHORT_SIMULATION_NAME': 'the-last-one-00'}
[LOG:2019-09-08 13:49:21] self.run(debug)::Executing run command: /home/cosmo/simulations/the-last-one/output-0000/SIMFACTORY/RunScript
-------------- next part --------------
A non-text attachment was scrubbed...
Name: properties.ini
Type: application/octet-stream
Size: 1179 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20190911/215e4b82/attachment-0004.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bbh-2res-1mass-10sep-final.par
Type: application/octet-stream
Size: 21926 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20190911/215e4b82/attachment-0005.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simulation-last-lines
Type: application/octet-stream
Size: 39788 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20190911/215e4b82/attachment-0006.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpu
Type: application/octet-stream
Size: 1547 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20190911/215e4b82/attachment-0007.obj 


More information about the Users mailing list