[Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
白济民
beki-cat at sjtu.edu.cn
Thu Jun 10 11:54:36 CDT 2021
Hi Roland,
Thanks for your help! I've run the sample code successfully. It's my beginning to explore the ETK and I'm looking forward to making progress with it.
Yours sincerely,
Jimmy
----- 原始邮件 -----
发件人: "Roland Haas" <rhaas at illinois.edu>
收件人: "白济民" <beki-cat at sjtu.edu.cn>
抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
发送时间: 星期二, 2021年 6 月 01日 下午 11:20:26
主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
Hello Jimmy,
indeed Carpet reports a single MPI ranks in the static-tov-np2.out file:
--8<--
INFO (SystemTopology): MPI process-to-host mapping:
This is MPI process 0 of 1
MPI hosts:
0: dell-Precision-7920-Tower
This MPI process runs on host 0 of 1
On this host, this is MPI process 0 of 1
--8<--
On a workstation my guess would be that either:
* there are multiple conflicting MPI stack installed (eg OpenMPI and
MPICH/MVAPICH) which you can check using your package manager (eg
dpkg --list or rpm -qa)
* somehow Cactus failed to detect an MPI stack and built its own which
then conflicts with a possibly installed (conflicting) MPI stack
Looking at the ldd output, and given that MPI does not show up, my
guess is that it is the second bullet and eg you will find a directory
configs/sim/scratch/external/MPI with the self-compiled library.
To make this work you *either* have to ensure to use the mpirun tool
compiled as part of that MPI stack, which you will find in (both,
hopefully):
exe/sim/mpirun
and
configs/sim/scratch/external/MPI/bin/mpirun
and you have to put the *full* path to it into
configs/sim/RunScript (for your current build) and
repos/simfactory2/mdb/run/generic.run (for future builds).
Alternatively you can try and understand why Cactus did not find your
installed MPI stack. This can be caused by only installing the runtime
libraries (eg libopenmpi3 in Debian/Ubuntu) rather than the development
package (eg libopenmpi-dev). The simplest way to ensure that the
required packages are installed is to consult the top part of:
https://github.com/nds-org/jupyter-et/blob/master/CactusTutorial.ipynb
where we list the required packages for a number of OS and package
managers.
Yours,
Roland
> Hi Roland,
> Thanks for your detailed explanation. Which I used to build and run the gallery code is a single remote workstation. I try the command:
> --8>--
> export OMP_NUM_THREADS=6
> mpirun -np 1 exe/cactus_sim par/static_tov.par
> --8>--
> I got the desired output. I attach the output file named tov-static-np1.out.
> However, when I try the command:
> --8>--
> export OMP_NUM_THREADS=6
> mpirun -np 2 exe/cactus_sim par/static_tov.par
> --8>--
> I found that the program just started twice, as can be seen in my attached output file named tov-static-np2.out.
> Namely, I expect the desired output from Carpet of the number of processes running should be:
> INFO (Carpet): Carpet is running on 2 processes
> rather than showing :
> INFO (Carpet): Carpet is running on 1 processes
> twice.
> The same condition goes for running the bns example in the gallery.
> When running:
> --8>--
> ./simfactory/bin/sim execute 'bash -li'
> ldd exe/cactus_sim
> --8>--
> its output is attached in file ldd.out and when running the command "which mpirun", the bash output is simply:
> /usr/bin/mpirun
> Yours sincerely,
> Jimmy
>
>
>
> ----- 原始邮件 -----
> 发件人: "Roland Haas" <rhaas at illinois.edu>
> 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> 抄送: "users" <users at einsteintoolkit.org>
> 发送时间: 星期六, 2021年 5 月 29日 上午 1:09:29
> 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
>
> Hello Jimmy,
>
> right now I am only guessing that mismatching MPI stacks could be the
> issue.
>
> Without having seen the full out and err files this is pretty hard to
> diagnose (please try and attach them to your emails).
>
> As far as making sure it is the correct MPI stack, there is not that
> much I can suggest.
>
> Usually on a cluster you want to make sure that the same MPI modules
> are loaded during compilation and when you run, by adding them top the
> envsetup variable of simfactory (see
> https://urldefense.com/v3/__https://docs.einsteintoolkit.org/et-docs/Configuring_a_new_machine__;!!DZ3fjg!rUWUVgTcFi4uEP8-aLjyc2k48WfMmf4ZDOAn7kMDVbnicBNxPIShqS4qlgPdCMMT$ ).
>
> A trick that I find useful is to compile the code, then use simfactory
> to get a (login, interactive) shell with the same modules loaded using:
>
> ./simfactory/bin/sim execute 'bash -li'
>
> in there then one can run "ldd exe/cactus_sim" which shows the location
> of the MPI library that Cactus linked against.
>
> Then check which mpirun executable is used (you seem to have used
> mpirun in from generic.run which may or may not work fine on a cluster)
> eg by running
>
> which mpirun
>
> that shows the full path of the mpirun command used.
>
> This path should "match" (ie be in the same directory structure) as the
> MPI libraries.
>
> Note are you trying this on a cluster or just a single workstation? If
> just a single workstation then you can also test things by getting a
> shell with the modules loaded as described above, then run:
>
> export OMP_NUM_THREADS=1
> mpirun -n 1 exe/cactus_sim par/tov-static.par
>
> which will (if I made no typos) start Cactus using mpirun and use a
> single MPI rank (-n 1) and 1 OpenMP thread. If this also fails then at
> least it gives you a simpler test case with fewer moving parts and
> would make things simpler.
>
> If you *are* on a cluster, then the best choice is to contact that
> cluster help desk who should be able to help you get things running
> (since they know their cluster).
>
> Yours,
> Roland
>
> > Hi Roland,
> > Thanks for your detailed explanation. I'm now wondering how could I address the issue on mismatching MPI stacks of compiling and running the simulation and I'm looking forward to your help. I'm new in running MPI programs. I wish I could succeed in running the example this weekend if I can to reproduce the desired result.
> > Yours sincerely,
> > Jimmy
> >
> >
> > ----- 原始邮件 -----
> > 发件人: "白济民" <beki-cat at sjtu.edu.cn>
> > 收件人: "users" <users at einsteintoolkit.org>
> > 抄送: "1614603292" <1614603292 at qq.com>
> > 发送时间: 星期四, 2021年 5 月 27日 下午 9:19:10
> > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> >
> > Hi Roland,
> > Thanks for your detailed explanation! I'm now wondering how could I address the issue on mismatching MPI stacks of compiling and running the simulation and I'm looking forward to your help.
> > Yours sincerely,
> > Jimmy
> >
> > ----- 原始邮件 -----
> > 发件人: "Roland Haas" <rhaas at illinois.edu>
> > 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> > 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> > 发送时间: 星期二, 2021年 5 月 25日 下午 10:56:12
> > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> >
> > Hello Jimmy,
> >
> > the --procs and --num-threads options are only used by the submit (and
> > create-submit and run and create-run) sub-commands. Using them with the
> > "build" command will not have any effect.
> >
> > "-Roe" is a raw Cactus option (see
> > https://urldefense.com/v3/__http://einsteintoolkit.org/usersguide/UsersGuide.html*x1-176000D__;Iw!!DZ3fjg!vobS_jxCUmVt6VY8Msy0SUDtmmoXapH--VbfcWYLTM5sQyUfLVeWs410ZiFRecpv$ though -R still needs to be document) it must be added to the "RunScript" file in configs/sim/RunScript just after the "@EXECUTABLE@" placeholder ie:
> >
> > mpirun -np @NUM_PROCS@ @EXECUTABLE@ -L 3 @PARFILE@
> >
> > becomes
> >
> > mpirun -np @NUM_PROCS@ @EXECUTABLE@ -Roe -L 3 @PARFILE@
> >
> > Simfactoy documentation can be found here:
> >
> > https://urldefense.com/v3/__http://simfactory.org/info/documentation/__;!!DZ3fjg!vobS_jxCUmVt6VY8Msy0SUDtmmoXapH--VbfcWYLTM5sQyUfLVeWs410ZpVF6P_p$
> >
> > and
> >
> > https://urldefense.com/v3/__https://docs.einsteintoolkit.org/et-docs/Simulation_Factory_Advanced_Tutorial__;!!DZ3fjg!vobS_jxCUmVt6VY8Msy0SUDtmmoXapH--VbfcWYLTM5sQyUfLVeWs410ZklRbu6I$
> >
> > though both are somewhat difficult to use.
> >
> > However if you only ever used --procs and --num-threads with build then
> > the reason for the code to fail is this. Namely you must use --procs
> > and --num-threads with the submit command.
> >
> > Looking at the start of the error file that you included:
> >
> > --8<--
> > export CACTUS_NUM_PROCS=2
> > export CACTUS_NUM_THREADS=26
> >
> > mpirun -np 2 /home/bai/simulations/bns_merger_3/SIMFACTORY/exe/cactus_sim -L 3 /home/bai/simulations/bns_merger_3/output-0000/nsnstohmns.par
> > --8<--
> >
> > there are 2 MPI executable started and the expected number of MPI ranks
> > (2) is recorded correctly in CACTUS_NUM_PROC.
> >
> > In the *.out file there will be lines like this (not all next to each
> > other):
> >
> > INFO (Carpet): MPI is enabled
> > INFO (Carpet): Carpet is running on 6 processes
> > INFO (Carpet): This is process 0
> > INFO (Carpet): OpenMP is enabled
> > INFO (Carpet): This process contains 2 threads, this is thread 0
> > INFO (Carpet): There are 12 threads in total
> > INFO (Carpet): There are 2 threads per process
> > INFO (Carpet): This process runs on host ekohaes8, pid=22663
> > INFO (Carpet): This process runs on 12 cores: 0-5, 12-17
> > INFO (Carpet): Thread 0 runs on 12 cores: 0-5, 12-17
> > INFO (Carpet): Thread 1 runs on 12 cores: 0-5, 12-17
> >
> > and you should check that these match what you expect.
> >
> > You may also want to make sure that there are no "leftover" Cactus
> > processes around (ie when not running a simulation, "top" does not show
> > any cactus_sim).
> >
> > The very many level 1 errors and the duplicate lines in the ascii
> > output file are almost certainly due to the simulation being started
> > twice, which in turns is probably due to mismatching MPI stacks, yes.
> >
> > You can set:
> >
> > IO::abort_on_io_errors = "yes"
> >
> > which will make Cactus abort on errors from HDF5 instead of trying to
> > continue.
> >
> > Yours,
> > Roland
> >
> > > Hi Roland,
> > > I'm sorry I got several typos in the command lines in the previous reply, they should be:
> > > "./simfactory/bin/sim create-submit bns_merger --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par -Roe --walltime 24:0:0"
> > > it returns: "sim.py: error: no such option: -R"
> > > and,
> > > Instead, I built the ET and run the simulation via the commands:
> > > --8<--
> > > simfactory/bin/sim build --procs 52 --num-threads 26 --thornlist thornlists/nsnstohmns.th
> > > ./simfactory/bin/sim create-submit bns_merger_4 --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par --walltime 24:0:0
> > > --8<--
> > > Yours sincerely:
> > > Jimmy
> > >
> > >
> > > ----- 原始邮件 -----
> > > 发件人: "白济民" <beki-cat at sjtu.edu.cn>
> > > 收件人: "users" <users at einsteintoolkit.org>
> > > 抄送: "1614603292" <1614603292 at qq.com>
> > > 发送时间: 星期二, 2021年 5 月 25日 下午 12:32:18
> > > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > >
> > > Hi Roland,
> > > Thanks for your patience. However, when I execute the command adding "-Roe" in Cactus:
> > > "./simfactory/bin/sim create-submit bns_merger --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par --Roe --walltime 24:0:0"
> > > it returns: "sim.py: error: no such option: -R"
> > >
> > > Instead, I built the ET and run the simulation via the commands:
> > > --8<--
> > > simfactory/bin/sim build --procs 52 --num-threads 26 --thornlist thornlists/nsnstohmns.th
> > > ./simfactory/bin/sim create-submit bns_merger_4 --procs 52 --num-threads 26 --parfile /home/bai/ET/Cactus/par/nsnstohmns.par -Roe --walltime 24:0:0
> > > --8<--
> > >
> > > When I look at the file "mp_Psi4_l2_m2_r300.00" I'm interested in (for clearance I upload this file) it gets double lines with the same records and I wonder
> > > this shows that the simulation is started 2 times and I guess this is the case of mismatching MPI ranks and I'm looking forward to avoid this.
> > > I also notice in the err file there is a large number of level-1 errors (it is too large, so I grep 1000 lines for uploading for clearance~), and I wonder
> > > why they occur, is this also a consequence of mismatching MPI ranks?
> > > Yours sincerely:
> > > Jimmy
> > >
> > >
> > >
> > > ----- 原始邮件 -----
> > > 发件人: "Roland Haas" <rhaas at illinois.edu>
> > > 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> > > 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> > > 发送时间: 星期一, 2021年 5 月 24日 下午 10:29:24
> > > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > >
> > > Hello Jimmy,
> > >
> > > ok, in case you are already giving options to simfactory that should
> > > result in multiple MPI ranks (eg --procs 26 --num-threads 13) then you
> > > are most likely facing an issue that the MPI stack used to compile the
> > > code is not the same as the one used to run the code. This should
> > > however have resulted in a different error (namely Carpet reporting
> > > that something is inconsistent with a CACTUS_NUM_PROCS and the number
> > > of MPI ranks), which is why I suggested the issue might be the
> > > simfactory command line used. I explain how to check this at the end
> > > of the email.
> > >
> > > Can you provide the exact (no simplified, other otherwise
> > > modified) simfactory command line you used? Otherwise this is very hard
> > > to remotely diagnose.
> > >
> > > Note that the ini files just provide defaults and eg the one you
> > > provided will, since you set num-threads to 26, use a single MPI rank
> > > until you ask for more procs/cores than 26. Ie this command:
> > >
> > > ./simfactory/bin/sim submit --procs 26 --parfiles ...
> > >
> > > will use 1 MPI rank. Instead you must use a command line like the one I
> > > provided as an example before:
> > >
> > > ./simfactory/bin/sim submit --procs 26 --num-threads 13 ...
> > >
> > > that explicitly asks for procs and num-threads such that more than 1
> > > MPI rank is created.
> > >
> > > Having mismatched MPI stacks tends to manifest itself in that instead of
> > > N MPI ranks Carpet reports just 1 MPI rank but the simulation is
> > > started N times.
> > >
> > > To check whether this is the case you would add the "-Roe" option to
> > > the Cactus command line which causes it to write output from each MPI
> > > rank to a file CCTK_ProcN.out where N is the MPI rank.
> > >
> > > You should run this and check and provide the (comnplete, please
> > > do not abridge them) output files.
> > >
> > > Carpet reports the total number of MPI ranks that it uses in there.
> > >
> > > Yours,
> > > Roland
> > >
> > > > Hi Roland,
> > > > Thanks for your advice and I know that I need more than 1 MPI ranks to run the simulation. I manage to change the related parameters in my mdb/machines .ini file as follows:
> > > > --8<--
> > > > # Source tree management
> > > > sourcebasedir = /home/bai/ET
> > > > optionlist = generic.cfg
> > > > submitscript = generic.sub
> > > > runscript = generic.run
> > > > make = make -j at MAKEJOBS@
> > > > basedir = /home/bai/simulations
> > > > ppn = 52
> > > > max-num-threads = 26
> > > > num-threads = 26
> > > > nodes = 1
> > > > submit = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@/@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
> > > > getstatus = ps @JOB_ID@
> > > > --8<--
> > > > so that I can use the "./simfactory/bin/sim setup-silent" command to run simfactory using the machine's default settings.
> > > >
> > > > However, when I run the simulation, it aborts and the same level 0 warning occurs together with the following notice:
> > > > --9<--
> > > > WARNING level 0 from host dell-Precision-7920-Tower process 0
> > > > while executing schedule bin BoundaryConditions, routine RotatingSymmetry180::Rot180_ApplyBC
> > > > in thorn RotatingSymmetry180, file /home/bai/ET/Cactus/configs/sim/build/RotatingSymmetry180/rotatingsymmetry180.c:492:
> > > > -> TAT/Slab can only be used if there is a single local component per MPI process
> > > > cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > > > Rank 0 with PID 74149 received signal 6
> > > > Writing backtrace to nsnstohmns/backtrace.0.txt
> > > > -----------------------------------------------------------------------------
> > > > It seems that [at least] one of the processes that was started with
> > > > mpirun did not invoke MPI_INIT before quitting (it is possible that
> > > > more than one process did not invoke MPI_INIT -- mpirun was only
> > > > notified of the first one, which was on node n0).
> > > >
> > > > mpirun can *only* be used with MPI programs (i.e., programs that
> > > > invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
> > > > to run non-MPI programs over the lambooted nodes.
> > > > -----------------------------------------------------------------------------
> > > > --9<--
> > > > For clearance, I upload the machine.ini file.
> > > > Yours sincerely,
> > > > Jimmy
> > > >
> > > > ----- 原始邮件 -----
> > > > 发件人: "Roland Haas" <rhaas at illinois.edu>
> > > > 收件人: "白济民" <beki-cat at sjtu.edu.cn>
> > > > 抄送: "users" <users at einsteintoolkit.org>, "1614603292" <1614603292 at qq.com>
> > > > 发送时间: 星期五, 2021年 5 月 21日 下午 10:02:06
> > > > 主题: Re: [Users] a problem met in running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" on Jimmy's own work station
> > > >
> > > > Hello Jimmy,
> > > >
> > > > the error is the level 0 warning at the end of the err file:
> > > >
> > > > --8<--
> > > > WARNING level 0 from host dell-Precision-7920-Tower process 0
> > > > while executing schedule bin BoundaryConditions, routine RotatingSymmetry180::Rot180_ApplyBC
> > > > in thorn RotatingSymmetry180, file /home/bai/ET/Cactus/configs/sim/build/RotatingSymmetry180/rotatingsymmetry180.c:492:
> > > > -> TAT/Slab can only be used if there is a single local component per MPI process
> > > > cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0'
> > > > --8<--
> > > >
> > > > namely "TAT/Slab can only be used if there is a single local component
> > > > per MPI process".
> > > >
> > > > To avoid this you will have to use more than 1 MPI ranks (the technical
> > > > description is a bit complicated).
> > > >
> > > > When using simulation factory you must ensure that the values for
> > > > --procs / --cores (total number of threads created) and --num-threads
> > > > (number of threads per MPI rank) are such that there are at least 2 MPI
> > > > ranks.
> > > >
> > > > Eg:
> > > >
> > > > ./simfactory/bin/sim submit --cores 12 --num-threads 6 ...
> > > >
> > > > or when using mpirun directly the equivalent would be:
> > > >
> > > > export OMP_NUM_THREADS=6
> > > > mpirun -n 2 ...
> > > >
> > > > Yours,
> > > > Roland
> > > >
> > > > > Hello,
> > > > > I met a problem when running the sample code "Binary, inspiraling neutron stars forming a hypermassive neutron star" in ET's gallery on my own work station and I'm looking forward to your help.
> > > > > It aborts unexpectedly after running a few minutes. The end of the Output-error-file reads as follows:
> > > > >
> > > > > cactus_sim: /home/bai/ET/Cactus/configs/sim/build/Carpet/helpers.cc:275: int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > > > > Rank 0 with PID 73447 received signal 6
> > > > > Writing backtrace to nsnstohmns/backtrace.0.txt
> > > > > Aborted (core dumped)
> > > > >
> > > > > I also uploaded the entire error file for clearance.
> > > > >
> > > > > I built the ET using 64 processors by using the following command:
> > > > > simfactory/bin/sim build -j64 --thornlist thornlists/nsnstohmns.th
> > > > >
> > > > > and I ran the simulation using 20 processors by using the following command:
> > > > > ./simfactory/bin/sim create-submit bns_merger /home/bai/ET/Cactus/par/nsnstohmns.par 20 24:0:0
> > > > >
> > > > > Yours sincerely:
> > > > > Jimmy
> > > > >
> > > >
> > >
> > >
> >
> >
>
>
--
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
More information about the Users
mailing list