[Users] Thorn setup taking too much time in cluster

Thu Jun 1 16:53:51 CDT 2023

Hello Shamim Haque,

"the grid structure inconsistent. Impossible to continue" is a fairly
generic error that Carpet outputs if it detects that the grid structure
has become inconsistent and it is impossible to continue the
simulation. 

This condition is detected by a low level routine in Carpet that no
longer has access to the higher level information that was passed to
Carpet and that is actually the cause for the inconsistent grid
structure.

The mailing list has a couple of similar reports that you could take a
look at and that may point to a possible solution.

For this sort of error, more details would be needed to do any useful
diagnosis. At least the full log file (stdout and stderr) of the
simulation up to the point when it aborted would be needed along with
possibly some more log files that record the grid structure.

Here are some possible relevant email threads:

https://lists.einsteintoolkit.org/pipermail/users/2023-March/008881.html

https://lists.einsteintoolkit.org/pipermail/users/2021-February/007792.html

maybe also:

https://bitbucket.org/einsteintoolkit/tickets/issues/2599/nsnstohmns-cannot-be-reproduced-using

https://bitbucket.org/einsteintoolkit/tickets/issues/2516/different-simulation-result-with-different

Yours,
Roland

> Dear Steve,
> 
> Thank you for your reply. I tried the same simulation with a finer grid,
> and the simulation started working fine, even though very slow (looks like
> due to slow inter-node communication), but it did work out. I could see a
> few iterations towards the final couple of hours from the wall time.
> 
> Turns out, a simulation with GRhydro, in such cases (where the grid needs
> to be finer), would end with an error saying, "*the grid structure
> inconsistent. Impossible to continue*". On the other hand, a simulation
> with IllinoisGRMHD stops abruptly during the thorn setup (somewhere around
> the SpaceMask and AHFinderDirect setup).
> 
> Later I tried to see if I can pace up the simulation, but looks like the
> inter-node communication is very slow in the HPC, which may be an inherent
> problem with the HPC since it is a very old one.
> 
> Regards
> Shamim Haque
> Senior Research Fellow (SRF)
> Department of Physics
> IISER Bhopal
> 
> ᐧ
> 
> On Tue, May 23, 2023 at 10:08 PM Steven R. Brandt <sbrandt at cct.lsu.edu>
> wrote:
> 
> > Sorry that no one has replied to you in a while. Are you still
> > experiencing this difficulty?
> >
> > --Steve
> > On 4/4/2023 3:08 AM, Shamim Haque 1910511 wrote:
> >
> > Dear Steven,
> >
> > I assure you that I submitted the simulation for the first time only. I
> > used "sim create-submit" to submit the simulation, which would not submit
> > the job if the same name was executed earlier.
> >
> > Secondly, I found this same message appearing in the output files from
> > debug queue (1 node, with GRHydro) and high memory node (3 nodes, with
> > IllinoisGRMHD), here the simulation ran successfully. I have attached the
> > output files for reference.
> >
> > Regards
> > Shamim Haque
> > Senior Research Fellow (SRF)
> > Department of Physics
> > IISER Bhopal
> >
> > ᐧ
> >
> > On Tue, Apr 4, 2023 at 12:35 AM Steven R. Brandt <sbrandt at cct.lsu.edu>
> > wrote:
> >  
> >> I see this error message in your output:
> >>  
> >>   -> [0m No HDF5 checkpoint files with basefilename 'checkpoint.chkpt'  
> >> and file extension '.h5' found in recovery directory
> >> 'nsns_toy1.2_DDME2BPS_quark_1.2vs1.6M_40km_g25'
> >>
> >> I suspect you did a "sim submit" for a job, got a failure, and did a
> >> second "sim submit" without purging. That immediately triggered the error.
> >> Then, for some reason, MPI didn't shut down cleanly and the processes hung
> >> doing nothing until they used up the walltime.
> >>
> >> --Steve
> >> On 4/2/2023 5:16 AM, Shamim Haque 1910511 wrote:
> >>
> >> Hello,
> >>
> >> I am trying to run BNSM using IllinoisGRMHD on HPC Kanad at IISER Bhopal.
> >> While I have tested the parfile to be running fine on debug queue (1 node)
> >> and high memory queue (3 nodes), I am unable to run the simulation in a
> >> queue with 9 nodes (144 cores).
> >>
> >> The output file suggests that the setup of listed thorns is not complete
> >> within 24 hours, which is the max walltime for this queue.
> >>
> >> Is there a way to sort out this issue? I have attached the parfile and
> >> outfile for reference.
> >>
> >> Regards
> >> Shamim Haque
> >> Senior Research Fellow (SRF)
> >> Department of Physics
> >> IISER Bhopal
> >> ᐧ
> >>
> >> _______________________________________________
> >> Users mailing listUsers at einsteintoolkit.orghttp://lists.einsteintoolkit.org/mailman/listinfo/users
> >>
> >> _______________________________________________
> >> Users mailing list
> >> Users at einsteintoolkit.org
> >> https://urldefense.com/v3/__http://lists.einsteintoolkit.org/mailman/listinfo/users__;!!DZ3fjg!_x567GYN6TSCHGzd9qNq7I2vnukVIdWuWrpvklLkBiR2voNBEMX99OkQxtvGmuazb6nd9jcdqRNh8C_eiuyn$ 
> >>  
> >  

-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20230601/e0cbab33/attachment.sig>