[Users] Thorn setup taking too much time in cluster

Shamim Haque 1910511 shamims at iiserb.ac.in
Tue Jun 6 03:04:45 CDT 2023


Hello Roland,

Thanks for pointing me to the links. I'll check out the resources.

Regards
Shamim Haque
Senior Research Fellow (SRF)
Department of Physics
IISER Bhopal

ᐧ

On Fri, Jun 2, 2023 at 3:24 AM Roland Haas <rhaas at illinois.edu> wrote:

> Hello Shamim Haque,
>
> "the grid structure inconsistent. Impossible to continue" is a fairly
> generic error that Carpet outputs if it detects that the grid structure
> has become inconsistent and it is impossible to continue the
> simulation.
>
> This condition is detected by a low level routine in Carpet that no
> longer has access to the higher level information that was passed to
> Carpet and that is actually the cause for the inconsistent grid
> structure.
>
> The mailing list has a couple of similar reports that you could take a
> look at and that may point to a possible solution.
>
> For this sort of error, more details would be needed to do any useful
> diagnosis. At least the full log file (stdout and stderr) of the
> simulation up to the point when it aborted would be needed along with
> possibly some more log files that record the grid structure.
>
> Here are some possible relevant email threads:
>
> https://lists.einsteintoolkit.org/pipermail/users/2023-March/008881.html
>
> https://lists.einsteintoolkit.org/pipermail/users/2021-February/007792.html
>
> maybe also:
>
>
> https://bitbucket.org/einsteintoolkit/tickets/issues/2599/nsnstohmns-cannot-be-reproduced-using
>
>
> https://bitbucket.org/einsteintoolkit/tickets/issues/2516/different-simulation-result-with-different
>
> Yours,
> Roland
>
> > Dear Steve,
> >
> > Thank you for your reply. I tried the same simulation with a finer grid,
> > and the simulation started working fine, even though very slow (looks
> like
> > due to slow inter-node communication), but it did work out. I could see a
> > few iterations towards the final couple of hours from the wall time.
> >
> > Turns out, a simulation with GRhydro, in such cases (where the grid needs
> > to be finer), would end with an error saying, "*the grid structure
> > inconsistent. Impossible to continue*". On the other hand, a simulation
> > with IllinoisGRMHD stops abruptly during the thorn setup (somewhere
> around
> > the SpaceMask and AHFinderDirect setup).
> >
> > Later I tried to see if I can pace up the simulation, but looks like the
> > inter-node communication is very slow in the HPC, which may be an
> inherent
> > problem with the HPC since it is a very old one.
> >
> > Regards
> > Shamim Haque
> > Senior Research Fellow (SRF)
> > Department of Physics
> > IISER Bhopal
> >
> > ᐧ
> >
> > On Tue, May 23, 2023 at 10:08 PM Steven R. Brandt <sbrandt at cct.lsu.edu>
> > wrote:
> >
> > > Sorry that no one has replied to you in a while. Are you still
> > > experiencing this difficulty?
> > >
> > > --Steve
> > > On 4/4/2023 3:08 AM, Shamim Haque 1910511 wrote:
> > >
> > > Dear Steven,
> > >
> > > I assure you that I submitted the simulation for the first time only. I
> > > used "sim create-submit" to submit the simulation, which would not
> submit
> > > the job if the same name was executed earlier.
> > >
> > > Secondly, I found this same message appearing in the output files from
> > > debug queue (1 node, with GRHydro) and high memory node (3 nodes, with
> > > IllinoisGRMHD), here the simulation ran successfully. I have attached
> the
> > > output files for reference.
> > >
> > > Regards
> > > Shamim Haque
> > > Senior Research Fellow (SRF)
> > > Department of Physics
> > > IISER Bhopal
> > >
> > > ᐧ
> > >
> > > On Tue, Apr 4, 2023 at 12:35 AM Steven R. Brandt <sbrandt at cct.lsu.edu>
> > > wrote:
> > >
> > >> I see this error message in your output:
> > >>
> > >>   -> [0m No HDF5 checkpoint files with basefilename
> 'checkpoint.chkpt'
> > >> and file extension '.h5' found in recovery directory
> > >> 'nsns_toy1.2_DDME2BPS_quark_1.2vs1.6M_40km_g25'
> > >>
> > >> I suspect you did a "sim submit" for a job, got a failure, and did a
> > >> second "sim submit" without purging. That immediately triggered the
> error.
> > >> Then, for some reason, MPI didn't shut down cleanly and the processes
> hung
> > >> doing nothing until they used up the walltime.
> > >>
> > >> --Steve
> > >> On 4/2/2023 5:16 AM, Shamim Haque 1910511 wrote:
> > >>
> > >> Hello,
> > >>
> > >> I am trying to run BNSM using IllinoisGRMHD on HPC Kanad at IISER
> Bhopal.
> > >> While I have tested the parfile to be running fine on debug queue (1
> node)
> > >> and high memory queue (3 nodes), I am unable to run the simulation in
> a
> > >> queue with 9 nodes (144 cores).
> > >>
> > >> The output file suggests that the setup of listed thorns is not
> complete
> > >> within 24 hours, which is the max walltime for this queue.
> > >>
> > >> Is there a way to sort out this issue? I have attached the parfile and
> > >> outfile for reference.
> > >>
> > >> Regards
> > >> Shamim Haque
> > >> Senior Research Fellow (SRF)
> > >> Department of Physics
> > >> IISER Bhopal
> > >> ᐧ
> > >>
> > >> _______________________________________________
> > >> Users mailing listUsers at einsteintoolkit.orghttp://
> lists.einsteintoolkit.org/mailman/listinfo/users
> > >>
> > >> _______________________________________________
> > >> Users mailing list
> > >> Users at einsteintoolkit.org
> > >>
> https://urldefense.com/v3/__http://lists.einsteintoolkit.org/mailman/listinfo/users__;!!DZ3fjg!_x567GYN6TSCHGzd9qNq7I2vnukVIdWuWrpvklLkBiR2voNBEMX99OkQxtvGmuazb6nd9jcdqRNh8C_eiuyn$
> > >>
> > >
>
> --
> My email is as private as my paper mail. I therefore support encrypting
> and signing email messages. Get my PGP key from http://pgp.mit.edu .
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20230606/32902d0b/attachment-0001.htm>


More information about the Users mailing list