[Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station

Roland Haas rhaas at illinois.edu
Thu Jun 10 12:11:22 CDT 2021


Hello Jimmy,

glad to have been able to help.

Yours,
Roland

> Hi Roland,
> Thanks for your help! I've run the sample code successfully. It's my beginning to explore the ETK and I'm looking forward to making progress with it.
> Yours sincerely,
> Jimmy
> 
> ----- 原始邮件 -----
> 发件人: "Roland Haas" <rhaas at illinois.edu>
> 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> 发送时间: 星期二, 2021年 6 月 01日 下午 11:20:26
> 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> 
> Hello Jimmy,
> 
> indeed Carpet reports a single MPI ranks in the static-tov-np2.out file:
> 
> --8<--
> INFO (SystemTopology): MPI process-to-host mapping:
> This is MPI process 0 of 1
> MPI hosts:
>   0: dell-Precision-7920-Tower
> This MPI process runs on host 0 of 1
> On this host, this is MPI process 0 of 1
> --8<--
> 
> On a workstation my guess would be that either:
> 
> * there are multiple conflicting MPI stack installed (eg OpenMPI and
>   MPICH/MVAPICH) which you can check using your package manager (eg
>   dpkg --list or rpm -qa)
> 
> * somehow Cactus failed to detect an MPI stack and built its own which
>   then conflicts with a possibly installed (conflicting) MPI stack
> 
> Looking at the ldd output, and given that MPI does not show up, my
> guess is that it is the second bullet and eg you will find a directory
> configs/sim/scratch/external/MPI with the self-compiled library.
> 
> To make this work you *either* have to ensure to use the mpirun tool
> compiled as part of that MPI stack, which you will find in (both,
> hopefully):
> 
> exe/sim/mpirun
> 
> and
> 
> configs/sim/scratch/external/MPI/bin/mpirun
> 
> and you have to put the *full* path to it into 
> 
> configs/sim/RunScript (for your current build) and
> repos/simfactory2/mdb/run/generic.run (for future builds).
> 
> Alternatively you can try and understand why Cactus did not find your
> installed MPI stack. This can be caused by only installing the runtime
> libraries (eg libopenmpi3 in Debian/Ubuntu) rather than the development
> package (eg libopenmpi-dev). The simplest way to ensure that the
> required packages are installed is to consult the top part of:
> 
> https://urldefense.com/v3/__https://github.com/nds-org/jupyter-et/blob/master/CactusTutorial.ipynb__;!!DZ3fjg!sbgSOjm1NbPtCgx9HNY_5f5TaiMEKA7j-pOY-BfsJpfwDEcxcl9GhCdzjwq9hxY5$ 
> 
> where we list the required packages for a number of OS and package
> managers.
> 
> Yours,
> Roland
> 
> > Hi Roland,
> > Thanks for your detailed explanation. Which I used to build and run the gallery code is a single remote workstation. I try the command:  
> > --8>--    
> > export OMP_NUM_THREADS=6
> > mpirun -np 1 exe/cactus_sim par/static_tov.par  
> > --8>--    
> > I got the desired output. I attach the output file named tov-static-np1.out.
> > However, when I try the command:  
> > --8>--    
> > export OMP_NUM_THREADS=6
> > mpirun -np 2 exe/cactus_sim par/static_tov.par  
> > --8>--    
> > I found that the program just started twice, as can be seen in my attached output file named tov-static-np2.out.
> > Namely, I expect the desired output from Carpet of the number of processes running should be:
> > INFO (Carpet): Carpet is running on 2 processes
> > rather than showing :
> > INFO (Carpet): Carpet is running on 1 processes 
> > twice.
> > The same condition goes for running the bns example in the gallery.
> > When running:  
> > --8>--    
> > ./simfactory/bin/sim execute 'bash -li'
> > ldd exe/cactus_sim  
> > --8>--    
> > its output is attached in file ldd.out and when running the command "which mpirun", the bash output is simply:
> > /usr/bin/mpirun
> > Yours sincerely,
> > Jimmy
> > 
> > 
> > 
> > ----- 原始邮件 -----
> > 发件人: "Roland Haas" <rhaas at illinois.edu>
> > 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> > 抄送: "users" <users at einsteintoolkit.org>
> > 发送时间: 星期六, 2021年 5 月 29日 上午 1:09:29
> > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > 
> > Hello Jimmy,
> > 
> > right now I am only guessing that mismatching MPI stacks could be the
> > issue.
> > 
> > Without having seen the full out and err files this is pretty hard to
> > diagnose (please try and attach them to your emails).
> > 
> > As far as making sure it is the correct MPI stack, there is not that
> > much I can suggest.
> > 
> > Usually on a cluster you want to make sure that the same MPI modules
> > are loaded during compilation and when you run, by adding them top the
> > envsetup variable of simfactory (see
> > https://urldefense.com/v3/__https://docs.einsteintoolkit.org/et-docs/Configuring_a_new_machine__;!!DZ3fjg!rUWUVgTcFi4uEP8-aLjyc2k48WfMmf4ZDOAn7kMDVbnicBNxPIShqS4qlgPdCMMT$ ).
> > 
> > A trick that I find useful is to compile the code, then use simfactory
> > to get a (login, interactive) shell with the same modules loaded using:
> > 
> > ./simfactory/bin/sim execute 'bash -li'
> > 
> > in there then one can run "ldd exe/cactus_sim" which shows the location
> > of the MPI library that Cactus linked against.
> > 
> > Then check which mpirun executable is used (you seem to have used
> > mpirun in from generic.run which may or may not work fine on a cluster)
> > eg by running
> > 
> > which mpirun
> > 
> > that shows the full path of the mpirun command used.
> > 
> > This path should "match" (ie be in the same directory structure) as the
> > MPI libraries.
> > 
> > Note are you trying this on a cluster or just a single workstation? If
> > just a single workstation then you can also test things by getting a
> > shell with the modules loaded as described above, then run:
> > 
> > export OMP_NUM_THREADS=1
> > mpirun -n 1 exe/cactus_sim par/tov-static.par
> > 
> > which will (if I made no typos) start Cactus using mpirun and use a
> > single MPI rank (-n 1) and 1 OpenMP thread. If this also fails then at
> > least it gives you a simpler test case with fewer moving parts and
> > would make things simpler.
> > 
> > If you *are* on a cluster, then the best choice is to contact that
> > cluster help desk who should be able to help you get things running
> > (since they know their cluster).
> > 
> > Yours,
> > Roland
> >   
> > > Hi Roland,
> > > Thanks for your detailed explanation. I'm now wondering how could I address the issue on mismatching MPI stacks of compiling and running the simulation and I'm looking forward to your help. I'm new in running MPI programs. I wish I could succeed in running the example this weekend if I can to reproduce the desired result.
> > > Yours sincerely, 
> > > Jimmy
> > > 
> > > 
> > > ----- 原始邮件 -----
> > > 发件人: "白济民" <beki-cat at sjtu.edu.cn>
> > > 收件人: "users" <users at einsteintoolkit.org>
> > > 抄送: "1614603292" <1614603292 at qq.com>
> > > 发送时间: 星期四, 2021年 5 月 27日 下午 9:19:10
> > > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > > 
> > > Hi Roland,
> > > Thanks for your detailed explanation! I'm now wondering how could I address the issue on mismatching MPI stacks of compiling and running the simulation and I'm looking forward to your help.
> > > Yours sincerely,
> > > Jimmy
> > > 
> > > ----- 原始邮件 -----
> > > 发件人: "Roland Haas" <rhaas at illinois.edu>
> > > 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> > > 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> > > 发送时间: 星期二, 2021年 5 月 25日 下午 10:56:12
> > > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > > 
> > > Hello Jimmy,
> > > 
> > > the --procs and --num-threads options are only used by the submit (and
> > > create-submit and run and create-run) sub-commands. Using them with the
> > > "build" command will not have any effect.
> > > 
> > > "-Roe" is a raw Cactus option (see
> > > https://urldefense.com/v3/__http://einsteintoolkit.org/usersguide/UsersGuide.html*x1-176000D__;Iw!!DZ3fjg!vobS_jxCUmVt6VY8Msy0SUDtmmoXapH--VbfcWYLTM5sQyUfLVeWs410ZiFRecpv$  though -R still needs to be document) it must be added to the "RunScript" file in configs/sim/RunScript just after the "@EXECUTABLE@" placeholder ie:
> > > 
> > > mpirun -np @NUM_PROCS@ @EXECUTABLE@ -L 3 @PARFILE@
> > > 
> > > becomes
> > > 
> > > mpirun -np @NUM_PROCS@ @EXECUTABLE@ -Roe -L 3 @PARFILE@
> > > 
> > > Simfactoy documentation can be found here:
> > > 
> > > https://urldefense.com/v3/__http://simfactory.org/info/documentation/__;!!DZ3fjg!vobS_jxCUmVt6VY8Msy0SUDtmmoXapH--VbfcWYLTM5sQyUfLVeWs410ZpVF6P_p$ 
> > > 
> > > and
> > > 
> > > https://urldefense.com/v3/__https://docs.einsteintoolkit.org/et-docs/Simulation_Factory_Advanced_Tutorial__;!!DZ3fjg!vobS_jxCUmVt6VY8Msy0SUDtmmoXapH--VbfcWYLTM5sQyUfLVeWs410ZklRbu6I$ 
> > > 
> > > though both are somewhat difficult to use.
> > > 
> > > However if you only ever used --procs and --num-threads with build then
> > > the reason for the code to fail is this. Namely you must use --procs
> > > and --num-threads with the submit command.
> > > 
> > > Looking at the start of the error file that you included:
> > > 
> > > --8<--
> > > export CACTUS_NUM_PROCS=2
> > > export CACTUS_NUM_THREADS=26
> > > 
> > > mpirun -np 2 /home/bai/simulations/bns_merger_3/SIMFACTORY/exe/cactus_sim -L 3 /home/bai/simulations/bns_merger_3/output-0000/nsnstohmns.par
> > > --8<--
> > > 
> > > there are 2 MPI executable started and the expected number of MPI ranks
> > > (2) is recorded correctly in CACTUS_NUM_PROC.
> > > 
> > > In the *.out file there will be lines like this (not all next to each
> > > other):
> > > 
> > > INFO (Carpet): MPI is enabled
> > > INFO (Carpet): Carpet is running on 6 processes
> > > INFO (Carpet): This is process 0
> > > INFO (Carpet): OpenMP is enabled
> > > INFO (Carpet): This process contains 2 threads, this is thread 0
> > > INFO (Carpet): There are 12 threads in total
> > > INFO (Carpet): There are 2 threads per process
> > > INFO (Carpet): This process runs on host ekohaes8, pid=22663
> > > INFO (Carpet): This process runs on 12 cores: 0-5, 12-17
> > > INFO (Carpet): Thread 0 runs on 12 cores: 0-5, 12-17
> > > INFO (Carpet): Thread 1 runs on 12 cores: 0-5, 12-17
> > > 
> > > and you should check that these match what you expect.
> > > 
> > > You may also want to make sure that there are no "leftover" Cactus
> > > processes around (ie when not running a simulation, "top" does not show
> > > any cactus_sim).
> > > 
> > > The very many level 1 errors and the duplicate lines in the ascii
> > > output file are almost certainly due to the simulation being started
> > > twice, which in turns is probably due to mismatching MPI stacks, yes.
> > > 
> > > You can set:
> > > 
> > > IO::abort_on_io_errors = "yes"
> > > 
> > > which will make Cactus abort on errors from HDF5 instead of trying to
> > > continue.
> > > 
> > > Yours,
> > > Roland
> > >     
> > > > Hi Roland,
> > > > I'm sorry I got several typos in the command lines in the previous reply, they should be:
> > > > "./simfactory/bin/sim create-submit bns_merger --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par -Roe --walltime 24:0:0"
> > > > it returns: "sim.py: error: no such option: -R"
> > > > and,
> > > > Instead, I built the ET and run the simulation via the commands:
> > > > --8<--
> > > > simfactory/bin/sim build  --procs 52 --num-threads 26 --thornlist thornlists/nsnstohmns.th 
> > > > ./simfactory/bin/sim create-submit bns_merger_4 --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par --walltime 24:0:0
> > > > --8<--
> > > > Yours sincerely:
> > > > Jimmy
> > > > 
> > > > 
> > > > ----- 原始邮件 -----
> > > > 发件人: "白济民" <beki-cat at sjtu.edu.cn>
> > > > 收件人: "users" <users at einsteintoolkit.org>
> > > > 抄送: "1614603292" <1614603292 at qq.com>
> > > > 发送时间: 星期二, 2021年 5 月 25日 下午 12:32:18
> > > > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > > > 
> > > > Hi Roland,
> > > > Thanks for your patience. However, when I execute the command adding "-Roe" in Cactus:
> > > > "./simfactory/bin/sim create-submit bns_merger --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par --Roe --walltime 24:0:0"
> > > > it returns: "sim.py: error: no such option: -R"
> > > > 
> > > > Instead, I built the ET and run the simulation via the commands:
> > > > --8<--
> > > > simfactory/bin/sim build  --procs 52 --num-threads 26 --thornlist thornlists/nsnstohmns.th 
> > > > ./simfactory/bin/sim create-submit bns_merger_4 --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par -Roe --walltime 24:0:0
> > > > --8<--
> > > > 
> > > > When I look at the file "mp_Psi4_l2_m2_r300.00" I'm interested in (for clearance I upload this file) it gets double lines with the same records and I wonder 
> > > > this shows that the simulation is started 2 times and I guess this is the case of mismatching MPI ranks and I'm looking forward to avoid this.
> > > > I also notice in the err file there is a large number of level-1 errors (it is too large, so I grep 1000 lines for uploading for clearance~), and I wonder 
> > > > why they occur, is this also a consequence of mismatching MPI ranks?
> > > > Yours sincerely:
> > > > Jimmy
> > > > 
> > > > 
> > > > 
> > > > ----- 原始邮件 -----
> > > > 发件人: "Roland Haas" <rhaas at illinois.edu>
> > > > 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> > > > 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> > > > 发送时间: 星期一, 2021年 5 月 24日 下午 10:29:24
> > > > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > > > 
> > > > Hello Jimmy,
> > > > 
> > > > ok, in case you are already giving options to simfactory that should
> > > > result in multiple MPI ranks (eg --procs 26 --num-threads 13) then you
> > > > are most likely facing an issue that the MPI stack used to compile the
> > > > code is not the same as the one used to run the code. This should
> > > > however have resulted in a different error (namely Carpet reporting
> > > > that something is inconsistent with a CACTUS_NUM_PROCS and the number
> > > > of MPI ranks), which is why I suggested the issue might be the
> > > > simfactory command line used. I explain how to check this at the end
> > > > of the email.
> > > > 
> > > > Can you provide the exact (no simplified, other otherwise
> > > > modified) simfactory command line you used? Otherwise this is very hard
> > > > to remotely diagnose.
> > > > 
> > > > Note that the ini files just provide defaults and eg the one you
> > > > provided will, since you set num-threads to 26, use a single MPI rank
> > > > until you ask for more procs/cores than 26. Ie this command:
> > > > 
> > > > ./simfactory/bin/sim submit --procs 26 --parfiles ...
> > > > 
> > > > will use 1 MPI rank. Instead you must use a command line like the one I
> > > > provided as an example before:
> > > > 
> > > > ./simfactory/bin/sim submit --procs 26 --num-threads 13 ...
> > > > 
> > > > that explicitly asks for procs and num-threads such that more than 1
> > > > MPI rank is created.
> > > > 
> > > > Having mismatched MPI stacks tends to manifest itself in that instead of
> > > > N MPI ranks Carpet reports just 1 MPI rank but the simulation is
> > > > started N times.
> > > > 
> > > > To check whether this is the case you would add the "-Roe" option to
> > > > the Cactus command line which causes it to write output from each MPI
> > > > rank to a file CCTK_ProcN.out where N is the MPI rank.
> > > > 
> > > > You should run this and check and provide the (comnplete, please
> > > > do not abridge them) output files.
> > > > 
> > > > Carpet reports the total number of MPI ranks that it uses in there.
> > > > 
> > > > Yours,
> > > > Roland
> > > >       
> > > > > Hi Roland,
> > > > > Thanks for your advice and I know that I need more than 1 MPI ranks to run the simulation. I manage to change the related parameters in my mdb/machines .ini file as follows:
> > > > > --8<--
> > > > > # Source tree management
> > > > > sourcebasedir   = /home/bai/ET
> > > > > optionlist      = generic.cfg
> > > > > submitscript    = generic.sub
> > > > > runscript       = generic.run
> > > > > make            = make -j at MAKEJOBS@
> > > > > basedir         = /home/bai/simulations
> > > > > ppn             = 52
> > > > > max-num-threads = 26
> > > > > num-threads     = 26
> > > > > nodes           = 1
> > > > > submit          = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@/@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
> > > > > getstatus       = ps @JOB_ID@
> > > > > --8<--
> > > > > so that I can use the "./simfactory/bin/sim setup-silent" command to run simfactory using the machine's default settings.
> > > > > 
> > > > > However, when I run the simulation, it aborts and the same level 0 warning occurs together with the following notice:
> > > > > --9<--
> > > > > WARNING level 0 from host dell-Precision-7920-Tower process 0
> > > > >   while executing schedule bin BoundaryConditions, routine RotatingSymmetry180::Rot180_ApplyBC
> > > > >   in thorn RotatingSymmetry180, file /home/bai/ET/Cactus/configs/sim/build/RotatingSymmetry180/rotatingsymmetry180.c:492:        
> > > > >   -> TAT/Slab can only be used if there is a single local component per MPI process          
> > > > > cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > > > > Rank 0 with PID 74149 received signal 6
> > > > > Writing backtrace to nsnstohmns/backtrace.0.txt
> > > > > -----------------------------------------------------------------------------
> > > > > It seems that [at least] one of the processes that was started with
> > > > > mpirun did not invoke MPI_INIT before quitting (it is possible that
> > > > > more than one process did not invoke MPI_INIT -- mpirun was only
> > > > > notified of the first one, which was on node n0).
> > > > > 
> > > > > mpirun can *only* be used with MPI programs (i.e., programs that
> > > > > invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
> > > > > to run non-MPI programs over the lambooted nodes.
> > > > > -----------------------------------------------------------------------------
> > > > > --9<--
> > > > > For clearance, I upload the machine.ini file. 
> > > > > Yours sincerely,
> > > > > Jimmy
> > > > > 
> > > > > ----- 原始邮件 -----
> > > > > 发件人: "Roland Haas" <rhaas at illinois.edu>
> > > > > 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> > > > > 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> > > > > 发送时间: 星期五, 2021年 5 月 21日 下午 10:02:06
> > > > > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > > > > 
> > > > > Hello Jimmy,
> > > > > 
> > > > > the error is the level 0 warning at the end of the err file:
> > > > > 
> > > > > --8<--
> > > > > WARNING level 0 from host dell-Precision-7920-Tower process 0
> > > > >   while executing schedule bin BoundaryConditions, routine RotatingSymmetry180::Rot180_ApplyBC
> > > > >   in thorn RotatingSymmetry180, file /home/bai/ET/Cactus/configs/sim/build/RotatingSymmetry180/rotatingsymmetry180.c:492:        
> > > > >   -> TAT/Slab can only be used if there is a single local component per MPI process          
> > > > > cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0' 
> > > > > --8<--
> > > > > 
> > > > > namely "TAT/Slab can only be used if there is a single local component
> > > > > per MPI process". 
> > > > > 
> > > > > To avoid this you will have to use more than 1 MPI ranks (the technical
> > > > > description is a bit complicated).
> > > > > 
> > > > > When using simulation factory you must ensure that the values for
> > > > > --procs / --cores (total number of threads created) and --num-threads
> > > > > (number of threads per MPI rank) are such that there are at least 2 MPI
> > > > > ranks.
> > > > > 
> > > > > Eg:
> > > > > 
> > > > > ./simfactory/bin/sim submit --cores 12 --num-threads 6 ...
> > > > > 
> > > > > or when using mpirun directly the equivalent would be:
> > > > > 
> > > > > export OMP_NUM_THREADS=6
> > > > > mpirun -n 2 ...
> > > > > 
> > > > > Yours,
> > > > > Roland
> > > > >         
> > > > > > Hello,
> > > > > >     I met a problem when running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" in ET's gallery on my own work station and I'm looking forward to your help.
> > > > > >     It aborts unexpectedly after running a few minutes. The end of the Output-error-file reads as follows:
> > > > > > 
> > > > > >     cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > > > > >     Rank 0 with PID 73447 received signal 6
> > > > > >     Writing backtrace to nsnstohmns/backtrace.0.txt
> > > > > >     Aborted (core dumped)
> > > > > > 
> > > > > >     I also uploaded the entire error file for clearance.
> > > > > > 
> > > > > >     I built the ET using 64 processors by using the following command:
> > > > > >     simfactory/bin/sim build -j64 --thornlist thornlists/nsnstohmns.th
> > > > > >     
> > > > > >     and I ran the simulation using 20 processors by using the following command:
> > > > > >     ./simfactory/bin/sim create-submit bns_merger /home/bai/ET/Cactus/par/nsnstohmns.par 20 24:0:0
> > > > > >     
> > > > > > Yours sincerely:
> > > > > > Jimmy
> > > > > >                   
> > > > >         
> > > > 
> > > >       
> > > 
> > >     
> > 
> >   
> 
> 


-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20210610/61c8ed37/attachment-0001.bin 


More information about the Users mailing list