[Users] Issue with npernode value in MPI

Shamim Haque 1910511 shamims at iiserb.ac.in
Fri May 10 05:41:27 CDT 2024


Hello Steve,

My Runscript is now working fine after removing -npernode. Thanks for the
help.

Regards
Shamim Haque
Senior Research Fellow (SRF)
Department of Physics
IISER Bhopal

ᐧ

On Thu, May 2, 2024 at 1:05 AM Shamim Haque 1910511 <shamims at iiserb.ac.in>
wrote:

> Hello Steve,
>
> Thanks for pointing this out. I'll try to write a fresh runscipt by
> looking at example runscripts.
>
> Since you're using slurm, MPI should be smart enough that you don't need
> to pass -n, -npernode,
>
> I don't need to pass -n as well? I can see -n @NUM_PROCS@ in the SBATCH
> runscripts that uses openmpi (example - graham, expanse). Can you please
> explain a little bit about what should be the simplest/safest mpi execution
> command to start with? And how can we build it further to optimise it more?
>
> How did you get a Runscript and Submitscript for this machine. Did you
> create yourself?
>
> My first attempt was at an HPC at my home institute IISER Bhopal, which
> had PBS. Then, I installed ETK in the NSM facility (India), which has
> SLURM. I changed most of the stuff in machinefile and submitscipt as per
> SLRUM (by looking at available example scripts) and copied the runscript
> from PBS, which was working fine.
>
> The current HPC also has SLURM, so I copied all the scripts from the NSM
> facility. It always worked alright, so I was never quite sceptical about
> the runscipt, especially because the error has only shown up in the current
> HPC only and quite occasionally.
>
> Now, I can see from example runscripts that mpi execution commands for
> SBATCH look very different from PBS ones.
>
> Regards
> Shamim Haque
> Senior Research Fellow (SRF)
> Department of Physics
> IISER Bhopal
>
>>
> On Wed, May 1, 2024 at 11:05 PM Steven Brandt <sbrandt at cct.lsu.edu> wrote:
>
>> Hello Shamim,
>>
>> The error says that you're calling MPI with the wrong parameters,
>> specificall -npernode. Since you're using slurm, MPI should be smart enough
>> that you don't need to pass -n, -npernode,  How did you get a Runscript and
>> Submitscript for this machine. Did you create yourself?
>>
>> --Steve
>> On 5/1/2024 6:54 AM, Shamim Haque 1910511 wrote:
>>
>> Hi all,
>>
>> I am attempting ETK installation in KALINGA Cluster at NISER, India. This
>> cluster has 40 procs per node and SLURM workload manager.
>>
>> I compiled ETK with gcc-7.5 and openmpi-4.0.5 (attached the machinefile,
>> optionlist, submitscript and runscript). The installation is mostly
>> alright, as I can run parfiles for test TOV and BNS mergers.
>>
>> I tried to run a simulation with procs=160 (nodes 4) and num-threads=1
>> but landed with this error (error file also attached):
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *+ mpiexec -n 640 -npernode 40.0
>> /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/SIMFACTORY/exe/cactus_sim
>> -L 3
>> /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/output-0000/eos20_dx25_r500_rg7.par
>> ----------------------------------------------------------------------------
>> Open MPI has detected that a parameter given to a command line option does
>> not match the expected format:   Option: npernode   Param:  40.0 This is
>> frequently caused by omitting to provide the parameter to an option that
>> requires one. Please check the command line and try again.
>> ----------------------------------------------------------------------------
>> *
>>
>> Strangely, this error is not at all regular. Mostly, the error won't
>> appear, and the simulation works just fine (with no changes being made in
>> the scripts or simfactory command). In fact, this exact simulation has
>> worked fine before. Since I am unable to find the source of this issue, I
>> am also unable to recreate the error on my own. But it does kick in
>> occasionally.
>>
>> My command for mpi execution in runscript looks like this:
>>
>> *time mpiexec -n @NUM_PROCS@ -npernode @(@PPN_USED@ / @NUM_THREADS@)@
>> @EXECUTABLE@ -L 3 @PARFILE@*
>>
>> If I replace * @(@PPN_USED@ / @NUM_THREADS@)@ *with a desired value,
>> then the script always works. My simfactory command looks like this:
>>
>>
>> *./simfactory/bin/sim create-submit dx25_r500_rg7_t30_p640-1_2
>> --parfile=par-smooth/scale_test/eos20_dx25_r500_rg7.par --queue=large1
>> --procs=640 --num-threads=1 --walltime=00:45:00 *
>>
>> I am unable to understand how to solve this issue. Any help with this
>> issue is appreciated. Please let me know if you need more information.
>> Thank you.
>>
>> Regards
>> Shamim Haque
>> Senior Research Fellow (SRF)
>> Department of Physics
>> IISER Bhopal
>>>>
>> _______________________________________________
>> Users mailing listUsers at einsteintoolkit.orghttp://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20240510/a7c1fde1/attachment.htm>


More information about the Users mailing list