[Users] Issue with npernode value in MPI

Shamim Haque 1910511 shamims at iiserb.ac.in
Wed May 1 14:35:52 CDT 2024


Hello Steve,

Thanks for pointing this out. I'll try to write a fresh runscipt by
looking at example runscripts.

Since you're using slurm, MPI should be smart enough that you don't need to
pass -n, -npernode,

I don't need to pass -n as well? I can see -n @NUM_PROCS@ in the SBATCH
runscripts that uses openmpi (example - graham, expanse). Can you please
explain a little bit about what should be the simplest/safest mpi execution
command to start with? And how can we build it further to optimise it more?

How did you get a Runscript and Submitscript for this machine. Did you
create yourself?

My first attempt was at an HPC at my home institute IISER Bhopal, which had
PBS. Then, I installed ETK in the NSM facility (India), which has SLURM. I
changed most of the stuff in machinefile and submitscipt as per SLRUM (by
looking at available example scripts) and copied the runscript from PBS,
which was working fine.

The current HPC also has SLURM, so I copied all the scripts from the NSM
facility. It always worked alright, so I was never quite sceptical about
the runscipt, especially because the error has only shown up in the current
HPC only and quite occasionally.

Now, I can see from example runscripts that mpi execution commands for
SBATCH look very different from PBS ones.

Regards
Shamim Haque
Senior Research Fellow (SRF)
Department of Physics
IISER Bhopal

ᐧ

On Wed, May 1, 2024 at 11:05 PM Steven Brandt <sbrandt at cct.lsu.edu> wrote:

> Hello Shamim,
>
> The error says that you're calling MPI with the wrong parameters,
> specificall -npernode. Since you're using slurm, MPI should be smart enough
> that you don't need to pass -n, -npernode,  How did you get a Runscript and
> Submitscript for this machine. Did you create yourself?
>
> --Steve
> On 5/1/2024 6:54 AM, Shamim Haque 1910511 wrote:
>
> Hi all,
>
> I am attempting ETK installation in KALINGA Cluster at NISER, India. This
> cluster has 40 procs per node and SLURM workload manager.
>
> I compiled ETK with gcc-7.5 and openmpi-4.0.5 (attached the machinefile,
> optionlist, submitscript and runscript). The installation is mostly
> alright, as I can run parfiles for test TOV and BNS mergers.
>
> I tried to run a simulation with procs=160 (nodes 4) and num-threads=1 but
> landed with this error (error file also attached):
>
>
>
>
>
>
>
>
>
>
>
>
> *+ mpiexec -n 640 -npernode 40.0
> /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/SIMFACTORY/exe/cactus_sim
> -L 3
> /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/output-0000/eos20_dx25_r500_rg7.par
> ----------------------------------------------------------------------------
> Open MPI has detected that a parameter given to a command line option does
> not match the expected format:   Option: npernode   Param:  40.0 This is
> frequently caused by omitting to provide the parameter to an option that
> requires one. Please check the command line and try again.
> ----------------------------------------------------------------------------
> *
>
> Strangely, this error is not at all regular. Mostly, the error won't
> appear, and the simulation works just fine (with no changes being made in
> the scripts or simfactory command). In fact, this exact simulation has
> worked fine before. Since I am unable to find the source of this issue, I
> am also unable to recreate the error on my own. But it does kick in
> occasionally.
>
> My command for mpi execution in runscript looks like this:
>
> *time mpiexec -n @NUM_PROCS@ -npernode @(@PPN_USED@ / @NUM_THREADS@)@
> @EXECUTABLE@ -L 3 @PARFILE@*
>
> If I replace * @(@PPN_USED@ / @NUM_THREADS@)@ *with a desired value, then
> the script always works. My simfactory command looks like this:
>
>
> *./simfactory/bin/sim create-submit dx25_r500_rg7_t30_p640-1_2
> --parfile=par-smooth/scale_test/eos20_dx25_r500_rg7.par --queue=large1
> --procs=640 --num-threads=1 --walltime=00:45:00 *
>
> I am unable to understand how to solve this issue. Any help with this
> issue is appreciated. Please let me know if you need more information.
> Thank you.
>
> Regards
> Shamim Haque
> Senior Research Fellow (SRF)
> Department of Physics
> IISER Bhopal
>>
> _______________________________________________
> Users mailing listUsers at einsteintoolkit.orghttp://lists.einsteintoolkit.org/mailman/listinfo/users
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20240502/3b317f86/attachment-0001.htm>


More information about the Users mailing list