[Users] BNSM/TOV simulation error
Roland Haas
rhaas at illinois.edu
Wed Mar 22 09:06:01 CDT 2023
Hello Spandan Sarma,
I am not aware of any method other than trial and error I am afraid.
This is somewhat similar to the situation of scaling up a simulation
setup where one also starts with a small number of nodes then increases
the number of nodes until the speed is acceptable for the cost of the
nodes used.
In principle, the code should never fails due to too many MPI ranks, it
should either just become slow (due to increase communication overhead)
or output an error message (eg when there are not enough points to have
even a single point on a MPI rank).
Yours,
Roland
> Dear Roland,
>
> Thank you so much for the help. I included your suggestions and tried
> running the TOV with 16 cores with increased resolution, and it worked
> successfully. I have submitted a BNSM simulation making similar relevant
> changes and am awaiting its result.
>
> Also, is there any way other than trial and error to calculate how many MPI
> ranks are too much for a simulation?
>
> Regards,
> Spandan Sarma
>
> On Mon, Mar 20, 2023 at 9:45 PM Roland Haas <rhaas at illinois.edu> wrote:
>
> > Hello Spandan Sarma,
> >
> > Not having looked very carefully yet, one thing that turned out an
> > issue in the last while has been that the gallery example (see
> > https://urldefense.com/v3/__http://einsteintoolkit.org/gallery/bns/index.html__;!!DZ3fjg!_R1hP5KhkYLYZXjJxyT5mxkfQ99j1DQVjDDtj0YEgJHaUNRhvfeMeGUI1Xceb72878eBfcixyBFVhOcVYoi46W0$ ) is "small" and set
> > up to run (see the web-page) 24 hours using 12 cores. Running on many
> > more cores (MPI ranks really) can lead to these issues.
> >
> > So the first step would be to make sure that you run small enough (I
> > would try for no more than 24 or so MPI ranks, and usually more than 8
> > threads per MPI rank is not helping) and verify that the example works.
> >
> > Then, you can increase the resolution (the dx, dy, dz parameters in the
> > parameter file *.par) to make sure that that NS are resolved well
> > (resolution on the refinement level that contains them better than say
> > 200m at least) and slowly scale up the number of cores to use until you
> > have acceptable run speed.
> >
> > Based on your log files there were 16 MPI ranks for the TOV example
> > (which last ran on 5 MPI ranks) and 144 MPI ranks for BNS (which was
> > last run on 12 MPI ranks). In particular the latter one is "too many"
> > and I suspect the error is due to that.
> >
> > Yours,
> > Roland
> >
> > > Hello,
> > >
> > > I was trying to run the BNSM simulation from the ET gallery on the
> > > institute cluster KANAD at IISER Bhopal in the short queue (max nodes:
> > 16;
> > > walltime: 24 hrs) of our queuing system, but the following error came up:
> > >
> > > The grid structure is inconsistent. It is impossible to continue.
> > >
> > > WARNING level 0 from host n16 process 0
> > >
> > > in thorn CarpetLib, file
> > >
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> > >
> > > -> The grid structure is inconsistent. It is impossible to continue.
> > >
> > > cactus_sim:
> > >
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/Carpet/src/helpers.cc:275:
> > > int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > >
> > > Rank 0 with PID 4473 received signal 6
> > >
> > > Writing backtrace to nsnstohmns1/backtrace.0.txt
> > >
> > > WARNING level 0 from host n63 process 128
> > >
> > > in thorn CarpetLib, file
> > >
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> > >
> > > -> The grid structure is inconsistent. It is impossible to continue.
> > >
> > > cactus_sim:
> > >
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/Carpet/src/helpers.cc:275:
> > > int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > >
> > > Rank 128 with PID 1350 received signal 6
> > >
> > > Writing backtrace to nsnstohmns1/backtrace.128.txt
> > >
> > > WARNING level 0 from host n63 process 141
> > >
> > > in thorn CarpetLib, file
> > >
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> > >
> > > -> The grid structure is inconsistent. It is impossible to continue.
> > >
> > > cactus_sim:
> > >
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/Carpet/src/helpers.cc:275:
> > > int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > >
> > >
> > > After this issue, I tried performing the simulation using the same
> > > parameter file in the debug queue (max:1 node), and it worked fine. But
> > > upon trying out the TOV simulation example in the debug queue, the same
> > > error came:
> > >
> > >
> > > [1mWARNING level 0 from host n85 process 0
> > >
> > > in thorn CarpetLib, file
> > >
> > /home2/shamims/ET_debug/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> > >
> > > -> [0m The grid structure is inconsistent. It is impossible to
> > continue.
> > >
> > > WARNING level 0 from host n85 process 0
> > >
> > > in thorn CarpetLib, file
> > >
> > /home2/shamims/ET_debug/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> > >
> > > -> The grid structure is inconsistent. It is impossible to continue.
> > >
> > > cactus_sim:
> > >
> > /home2/shamims/ET_debug/Cactus/arrangements/Carpet/Carpet/src/helpers.cc:275:
> > > int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > >
> > >
> > > I am unable to understand what the issue is. I have attached parameter
> > > files, the runscript, and the output files for both the simulations (TOV
> > > and BNSM) for reference. Thanks in advance for the help.
> > >
> > > Regards,
> >
> >
> > --
> > My email is as private as my paper mail. I therefore support encrypting
> > and signing email messages. Get my PGP key from https://urldefense.com/v3/__http://pgp.mit.edu__;!!DZ3fjg!_R1hP5KhkYLYZXjJxyT5mxkfQ99j1DQVjDDtj0YEgJHaUNRhvfeMeGUI1Xceb72878eBfcixyBFVhOcVE9Xb-sI$ .
> >
>
>
--
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20230322/414dfa4c/attachment-0001.bin
More information about the Users
mailing list