Dear friends,

We are trying to use those examples listed on the website to test if it works on our HPC.


 It goes well with a simple tov equation, which takes 2-5 min. But when we use it to simulate GW150914 BH merger or solve tov equation with high precision and long time, it seems that it will stop at an iteration point without any further output, even an error. What really confuses us is that for different tests it stops at different points. Could you please help us find out what goes wrong with the simulation? I will attach the log.txt and parameter files to this mail. Thanks for your time!

Here we used a partition with 144 nodes. Sometimes with a specific —procs and —num-threads number in the shell file, the simulation finished successfully. In other time it came across the problem above. In the two neutron star output files, the job stoped at two different iteration points, and was cancelled due to time out or by hand.


Can you tell us the simfactory command line that you used to submit the simulation?

From the log file, it looks like it might be wrong, or the machine might not be set up correctly in simfactory.

[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::numprocs        = 8
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::nodeprocs       = 8
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::numthreads      = 18
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::hostname        = b01.hpc.pku.edu.cn<http://b01.hpc.pku.edu.cn>
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::ppn             = 144
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::ppnused         = 144
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::procsrequested  = 144
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::pbsSimulationName= GW150914-0000
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::cpufreq         =
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::user            = 1801110076
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::memory          = 0
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::nodes           = 1
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::procs           = 144
[LOG:2019-06-01 16:07:23] restart.userRun(simulationName)::numsmt          = 1

In particular, ppn = 144 looks wrong.

Erik, can you confirm?

If it's trying to run on too few nodes, it will run out of memory, as Steve suggested.

