[Users] Thorn setup taking too much time in cluster

Shamim Haque 1910511 shamims at iiserb.ac.in
Tue May 23 13:43:14 CDT 2023


Dear Steve,

Thank you for your reply. I tried the same simulation with a finer grid,
and the simulation started working fine, even though very slow (looks like
due to slow inter-node communication), but it did work out. I could see a
few iterations towards the final couple of hours from the wall time.

Turns out, a simulation with GRhydro, in such cases (where the grid needs
to be finer), would end with an error saying, "*the grid structure
inconsistent. Impossible to continue*". On the other hand, a simulation
with IllinoisGRMHD stops abruptly during the thorn setup (somewhere around
the SpaceMask and AHFinderDirect setup).

Later I tried to see if I can pace up the simulation, but looks like the
inter-node communication is very slow in the HPC, which may be an inherent
problem with the HPC since it is a very old one.

Regards
Shamim Haque
Senior Research Fellow (SRF)
Department of Physics
IISER Bhopal

ᐧ

On Tue, May 23, 2023 at 10:08 PM Steven R. Brandt <sbrandt at cct.lsu.edu>
wrote:

> Sorry that no one has replied to you in a while. Are you still
> experiencing this difficulty?
>
> --Steve
> On 4/4/2023 3:08 AM, Shamim Haque 1910511 wrote:
>
> Dear Steven,
>
> I assure you that I submitted the simulation for the first time only. I
> used "sim create-submit" to submit the simulation, which would not submit
> the job if the same name was executed earlier.
>
> Secondly, I found this same message appearing in the output files from
> debug queue (1 node, with GRHydro) and high memory node (3 nodes, with
> IllinoisGRMHD), here the simulation ran successfully. I have attached the
> output files for reference.
>
> Regards
> Shamim Haque
> Senior Research Fellow (SRF)
> Department of Physics
> IISER Bhopal
>
>>
> On Tue, Apr 4, 2023 at 12:35 AM Steven R. Brandt <sbrandt at cct.lsu.edu>
> wrote:
>
>> I see this error message in your output:
>>
>>   -> [0m No HDF5 checkpoint files with basefilename 'checkpoint.chkpt'
>> and file extension '.h5' found in recovery directory
>> 'nsns_toy1.2_DDME2BPS_quark_1.2vs1.6M_40km_g25'
>>
>> I suspect you did a "sim submit" for a job, got a failure, and did a
>> second "sim submit" without purging. That immediately triggered the error.
>> Then, for some reason, MPI didn't shut down cleanly and the processes hung
>> doing nothing until they used up the walltime.
>>
>> --Steve
>> On 4/2/2023 5:16 AM, Shamim Haque 1910511 wrote:
>>
>> Hello,
>>
>> I am trying to run BNSM using IllinoisGRMHD on HPC Kanad at IISER Bhopal.
>> While I have tested the parfile to be running fine on debug queue (1 node)
>> and high memory queue (3 nodes), I am unable to run the simulation in a
>> queue with 9 nodes (144 cores).
>>
>> The output file suggests that the setup of listed thorns is not complete
>> within 24 hours, which is the max walltime for this queue.
>>
>> Is there a way to sort out this issue? I have attached the parfile and
>> outfile for reference.
>>
>> Regards
>> Shamim Haque
>> Senior Research Fellow (SRF)
>> Department of Physics
>> IISER Bhopal
>>>>
>> _______________________________________________
>> Users mailing listUsers at einsteintoolkit.orghttp://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20230524/da905eda/attachment-0001.htm>


More information about the Users mailing list