[Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station

白济民 beki-cat at sjtu.edu.cn
Fri May 28 11:36:38 CDT 2021


Hi Roland,
Thanks for your detailed explanation. I'm now wondering how could I address the issue on mismatching MPI stacks of compiling and running the simulation and I'm looking forward to your help. I'm new in running MPI programs. I wish I could succeed in running the example this weekend if I can to reproduce the desired result.
Yours sincerely, 
Jimmy


----- 原始邮件 -----
发件人: "白济民" <beki-cat at sjtu.edu.cn>
收件人: "users" <users at einsteintoolkit.org>
抄送: "1614603292" <1614603292 at qq.com>
发送时间: 星期四, 2021年 5 月 27日 下午 9:19:10
主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station

Hi Roland,
Thanks for your detailed explanation! I'm now wondering how could I address the issue on mismatching MPI stacks of compiling and running the simulation and I'm looking forward to your help.
Yours sincerely,
Jimmy

----- 原始邮件 -----
发件人: "Roland Haas" <rhaas at illinois.edu>
收件人: "白济民" <beki-cat at sjtu.edu.cn>
抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
发送时间: 星期二, 2021年 5 月 25日 下午 10:56:12
主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station

Hello Jimmy,

the --procs and --num-threads options are only used by the submit (and
create-submit and run and create-run) sub-commands. Using them with the
"build" command will not have any effect.

"-Roe" is a raw Cactus option (see
http://einsteintoolkit.org/usersguide/UsersGuide.html#x1-176000D though -R still needs to be document) it must be added to the "RunScript" file in configs/sim/RunScript just after the "@EXECUTABLE@" placeholder ie:

mpirun -np @NUM_PROCS@ @EXECUTABLE@ -L 3 @PARFILE@

becomes

mpirun -np @NUM_PROCS@ @EXECUTABLE@ -Roe -L 3 @PARFILE@

Simfactoy documentation can be found here:

http://simfactory.org/info/documentation/

and

https://docs.einsteintoolkit.org/et-docs/Simulation_Factory_Advanced_Tutorial

though both are somewhat difficult to use.

However if you only ever used --procs and --num-threads with build then
the reason for the code to fail is this. Namely you must use --procs
and --num-threads with the submit command.

Looking at the start of the error file that you included:

--8<--
export CACTUS_NUM_PROCS=2
export CACTUS_NUM_THREADS=26

mpirun -np 2 /home/bai/simulations/bns_merger_3/SIMFACTORY/exe/cactus_sim -L 3 /home/bai/simulations/bns_merger_3/output-0000/nsnstohmns.par
--8<--

there are 2 MPI executable started and the expected number of MPI ranks
(2) is recorded correctly in CACTUS_NUM_PROC.

In the *.out file there will be lines like this (not all next to each
other):

INFO (Carpet): MPI is enabled
INFO (Carpet): Carpet is running on 6 processes
INFO (Carpet): This is process 0
INFO (Carpet): OpenMP is enabled
INFO (Carpet): This process contains 2 threads, this is thread 0
INFO (Carpet): There are 12 threads in total
INFO (Carpet): There are 2 threads per process
INFO (Carpet): This process runs on host ekohaes8, pid=22663
INFO (Carpet): This process runs on 12 cores: 0-5, 12-17
INFO (Carpet): Thread 0 runs on 12 cores: 0-5, 12-17
INFO (Carpet): Thread 1 runs on 12 cores: 0-5, 12-17

and you should check that these match what you expect.

You may also want to make sure that there are no "leftover" Cactus
processes around (ie when not running a simulation, "top" does not show
any cactus_sim).

The very many level 1 errors and the duplicate lines in the ascii
output file are almost certainly due to the simulation being started
twice, which in turns is probably due to mismatching MPI stacks, yes.

You can set:

IO::abort_on_io_errors = "yes"

which will make Cactus abort on errors from HDF5 instead of trying to
continue.

Yours,
Roland

> Hi Roland,
> I'm sorry I got several typos in the command lines in the previous reply, they should be:
> "./simfactory/bin/sim create-submit bns_merger --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par -Roe --walltime 24:0:0"
> it returns: "sim.py: error: no such option: -R"
> and,
> Instead, I built the ET and run the simulation via the commands:
> --8<--
> simfactory/bin/sim build  --procs 52 --num-threads 26 --thornlist thornlists/nsnstohmns.th 
> ./simfactory/bin/sim create-submit bns_merger_4 --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par --walltime 24:0:0
> --8<--
> Yours sincerely:
> Jimmy
> 
> 
> ----- 原始邮件 -----
> 发件人: "白济民" <beki-cat at sjtu.edu.cn>
> 收件人: "users" <users at einsteintoolkit.org>
> 抄送: "1614603292" <1614603292 at qq.com>
> 发送时间: 星期二, 2021年 5 月 25日 下午 12:32:18
> 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> 
> Hi Roland,
> Thanks for your patience. However, when I execute the command adding "-Roe" in Cactus:
> "./simfactory/bin/sim create-submit bns_merger --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par --Roe --walltime 24:0:0"
> it returns: "sim.py: error: no such option: -R"
> 
> Instead, I built the ET and run the simulation via the commands:
> --8<--
> simfactory/bin/sim build  --procs 52 --num-threads 26 --thornlist thornlists/nsnstohmns.th 
> ./simfactory/bin/sim create-submit bns_merger_4 --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par -Roe --walltime 24:0:0
> --8<--
> 
> When I look at the file "mp_Psi4_l2_m2_r300.00" I'm interested in (for clearance I upload this file) it gets double lines with the same records and I wonder 
> this shows that the simulation is started 2 times and I guess this is the case of mismatching MPI ranks and I'm looking forward to avoid this.
> I also notice in the err file there is a large number of level-1 errors (it is too large, so I grep 1000 lines for uploading for clearance~), and I wonder 
> why they occur, is this also a consequence of mismatching MPI ranks?
> Yours sincerely:
> Jimmy
> 
> 
> 
> ----- 原始邮件 -----
> 发件人: "Roland Haas" <rhaas at illinois.edu>
> 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> 发送时间: 星期一, 2021年 5 月 24日 下午 10:29:24
> 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> 
> Hello Jimmy,
> 
> ok, in case you are already giving options to simfactory that should
> result in multiple MPI ranks (eg --procs 26 --num-threads 13) then you
> are most likely facing an issue that the MPI stack used to compile the
> code is not the same as the one used to run the code. This should
> however have resulted in a different error (namely Carpet reporting
> that something is inconsistent with a CACTUS_NUM_PROCS and the number
> of MPI ranks), which is why I suggested the issue might be the
> simfactory command line used. I explain how to check this at the end
> of the email.
> 
> Can you provide the exact (no simplified, other otherwise
> modified) simfactory command line you used? Otherwise this is very hard
> to remotely diagnose.
> 
> Note that the ini files just provide defaults and eg the one you
> provided will, since you set num-threads to 26, use a single MPI rank
> until you ask for more procs/cores than 26. Ie this command:
> 
> ./simfactory/bin/sim submit --procs 26 --parfiles ...
> 
> will use 1 MPI rank. Instead you must use a command line like the one I
> provided as an example before:
> 
> ./simfactory/bin/sim submit --procs 26 --num-threads 13 ...
> 
> that explicitly asks for procs and num-threads such that more than 1
> MPI rank is created.
> 
> Having mismatched MPI stacks tends to manifest itself in that instead of
> N MPI ranks Carpet reports just 1 MPI rank but the simulation is
> started N times.
> 
> To check whether this is the case you would add the "-Roe" option to
> the Cactus command line which causes it to write output from each MPI
> rank to a file CCTK_ProcN.out where N is the MPI rank.
> 
> You should run this and check and provide the (comnplete, please
> do not abridge them) output files.
> 
> Carpet reports the total number of MPI ranks that it uses in there.
> 
> Yours,
> Roland
> 
> > Hi Roland,
> > Thanks for your advice and I know that I need more than 1 MPI ranks to run the simulation. I manage to change the related parameters in my mdb/machines .ini file as follows:
> > --8<--
> > # Source tree management
> > sourcebasedir   = /home/bai/ET
> > optionlist      = generic.cfg
> > submitscript    = generic.sub
> > runscript       = generic.run
> > make            = make -j at MAKEJOBS@
> > basedir         = /home/bai/simulations
> > ppn             = 52
> > max-num-threads = 26
> > num-threads     = 26
> > nodes           = 1
> > submit          = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@/@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
> > getstatus       = ps @JOB_ID@
> > --8<--
> > so that I can use the "./simfactory/bin/sim setup-silent" command to run simfactory using the machine's default settings.
> > 
> > However, when I run the simulation, it aborts and the same level 0 warning occurs together with the following notice:
> > --9<--
> > WARNING level 0 from host dell-Precision-7920-Tower process 0
> >   while executing schedule bin BoundaryConditions, routine RotatingSymmetry180::Rot180_ApplyBC
> >   in thorn RotatingSymmetry180, file /home/bai/ET/Cactus/configs/sim/build/RotatingSymmetry180/rotatingsymmetry180.c:492:  
> >   -> TAT/Slab can only be used if there is a single local component per MPI process    
> > cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > Rank 0 with PID 74149 received signal 6
> > Writing backtrace to nsnstohmns/backtrace.0.txt
> > -----------------------------------------------------------------------------
> > It seems that [at least] one of the processes that was started with
> > mpirun did not invoke MPI_INIT before quitting (it is possible that
> > more than one process did not invoke MPI_INIT -- mpirun was only
> > notified of the first one, which was on node n0).
> > 
> > mpirun can *only* be used with MPI programs (i.e., programs that
> > invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
> > to run non-MPI programs over the lambooted nodes.
> > -----------------------------------------------------------------------------
> > --9<--
> > For clearance, I upload the machine.ini file. 
> > Yours sincerely,
> > Jimmy
> > 
> > ----- 原始邮件 -----
> > 发件人: "Roland Haas" <rhaas at illinois.edu>
> > 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> > 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> > 发送时间: 星期五, 2021年 5 月 21日 下午 10:02:06
> > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > 
> > Hello Jimmy,
> > 
> > the error is the level 0 warning at the end of the err file:
> > 
> > --8<--
> > WARNING level 0 from host dell-Precision-7920-Tower process 0
> >   while executing schedule bin BoundaryConditions, routine RotatingSymmetry180::Rot180_ApplyBC
> >   in thorn RotatingSymmetry180, file /home/bai/ET/Cactus/configs/sim/build/RotatingSymmetry180/rotatingsymmetry180.c:492:  
> >   -> TAT/Slab can only be used if there is a single local component per MPI process    
> > cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0' 
> > --8<--
> > 
> > namely "TAT/Slab can only be used if there is a single local component
> > per MPI process". 
> > 
> > To avoid this you will have to use more than 1 MPI ranks (the technical
> > description is a bit complicated).
> > 
> > When using simulation factory you must ensure that the values for
> > --procs / --cores (total number of threads created) and --num-threads
> > (number of threads per MPI rank) are such that there are at least 2 MPI
> > ranks.
> > 
> > Eg:
> > 
> > ./simfactory/bin/sim submit --cores 12 --num-threads 6 ...
> > 
> > or when using mpirun directly the equivalent would be:
> > 
> > export OMP_NUM_THREADS=6
> > mpirun -n 2 ...
> > 
> > Yours,
> > Roland
> >   
> > > Hello,
> > >     I met a problem when running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" in ET's gallery on my own work station and I'm looking forward to your help.
> > >     It aborts unexpectedly after running a few minutes. The end of the Output-error-file reads as follows:
> > > 
> > >     cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > >     Rank 0 with PID 73447 received signal 6
> > >     Writing backtrace to nsnstohmns/backtrace.0.txt
> > >     Aborted (core dumped)
> > > 
> > >     I also uploaded the entire error file for clearance.
> > > 
> > >     I built the ET using 64 processors by using the following command:
> > >     simfactory/bin/sim build -j64 --thornlist thornlists/nsnstohmns.th
> > >     
> > >     and I ran the simulation using 20 processors by using the following command:
> > >     ./simfactory/bin/sim create-submit bns_merger /home/bai/ET/Cactus/par/nsnstohmns.par 20 24:0:0
> > >     
> > > Yours sincerely:
> > > Jimmy
> > >             
> >   
> 
> 


-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .


More information about the Users mailing list