[Users] Issue with npernode value in MPI

Steven R. Brandt sbrandt at cct.lsu.edu
Fri May 10 10:08:24 CDT 2024


Cool. If you have anyone else using your computer, you can submit your 
ini/cfg/runscript/submitscript to Simfactory. Thanks!

On 5/10/2024 5:41 AM, Shamim Haque 1910511 wrote:
> Hello Steve,
>
> My Runscript is now working fine after removing -npernode. Thanks for 
> the help.
>
> Regards
> Shamim Haque
> Senior Research Fellow (SRF)
> Department of Physics
> IISER Bhopal
>
>>
> On Thu, May 2, 2024 at 1:05 AM Shamim Haque 1910511 
> <shamims at iiserb.ac.in> wrote:
>
>     Hello Steve,
>
>     Thanks for pointing this out. I'll try to write a fresh runscipt
>     by looking at example runscripts.
>
>     Since you're using slurm, MPI should be smart enough that you
>     don't need to pass -n, -npernode,
>
>     I don't need to pass -n as well? I can see -n @NUM_PROCS@ in the
>     SBATCH runscripts that uses openmpi (example - graham, expanse).
>     Can you please explain a little bit about what should be the
>     simplest/safest mpi execution command to start with? And how can
>     we build it further to optimise it more?
>
>     How did you get a Runscript and Submitscript for this machine. Did
>     you create yourself?
>
>     My first attempt was at an HPC at my home institute IISER Bhopal,
>     which had PBS. Then, I installed ETK in the NSM facility (India),
>     which has SLURM. I changed most of the stuff in machinefile and
>     submitscipt as per SLRUM (by looking at available example scripts)
>     and copied the runscript from PBS, which was working fine.
>
>     The current HPC also has SLURM, so I copied all the scripts from
>     the NSM facility. It always worked alright, so I was never quite
>     sceptical about the runscipt, especially because the error has
>     only shown up in the current HPC only and quite occasionally.
>
>     Now, I can see from example runscripts that mpi execution commands
>     for SBATCH look very different from PBS ones.
>
>     Regards
>     Shamim Haque
>     Senior Research Fellow (SRF)
>     Department of Physics
>     IISER Bhopal
>
>>
>     On Wed, May 1, 2024 at 11:05 PM Steven Brandt
>     <sbrandt at cct.lsu.edu> wrote:
>
>         Hello Shamim,
>
>         The error says that you're calling MPI with the wrong
>         parameters, specificall -npernode. Since you're using slurm,
>         MPI should be smart enough that you don't need to pass -n,
>         -npernode,  How did you get a Runscript and Submitscript for
>         this machine. Did you create yourself?
>
>         --Steve
>
>         On 5/1/2024 6:54 AM, Shamim Haque 1910511 wrote:
>>         Hi all,
>>
>>         I am attempting ETK installation in KALINGA Cluster at NISER,
>>         India. This cluster has 40 procs per node and SLURM workload
>>         manager.
>>
>>         I compiled ETK with gcc-7.5 and openmpi-4.0.5 (attached the
>>         machinefile, optionlist, submitscript and runscript). The
>>         installation is mostly alright, as I can run parfiles for
>>         test TOV and BNS mergers.
>>
>>         I tried to run a simulation with procs=160 (nodes 4) and
>>         num-threads=1 but landed with this error (error file also
>>         attached):
>>
>>         /+ mpiexec -n 640 -npernode 40.0
>>         /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/SIMFACTORY/exe/cactus_sim
>>         -L 3
>>         /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/output-0000/eos20_dx25_r500_rg7.par
>>         ----------------------------------------------------------------------------
>>         Open MPI has detected that a parameter given to a command line
>>         option does not match the expected format:
>>
>>           Option: npernode
>>           Param:  40.0
>>
>>         This is frequently caused by omitting to provide the parameter
>>         to an option that requires one. Please check the command line
>>         and try again.
>>         ----------------------------------------------------------------------------
>>         /
>>
>>         Strangely, this error is not at all regular. Mostly, the
>>         error won't appear, and the simulation works just fine (with
>>         no changes being made in the scripts or simfactory
>>         command). In fact, this exact simulation has worked fine
>>         before. Since I am unable to find the source of this issue, I
>>         am also unable to recreate the error on my own. But it does
>>         kick in occasionally.
>>
>>         My command for mpi execution in runscript looks like this:
>>
>>         /time mpiexec -n @NUM_PROCS@ -npernode @(@PPN_USED@ /
>>         @NUM_THREADS@)@ @EXECUTABLE@ -L 3 @PARFILE@/
>>
>>         If I replace / @(@PPN_USED@ / @NUM_THREADS@)@ /with a desired
>>         value, then the script always works. My simfactory command
>>         looks like this:
>>
>>         /./simfactory/bin/sim create-submit
>>         dx25_r500_rg7_t30_p640-1_2
>>         --parfile=par-smooth/scale_test/eos20_dx25_r500_rg7.par
>>         --queue=large1 --procs=640 --num-threads=1 --walltime=00:45:00
>>         /
>>
>>         I am unable to understand how to solve this issue. Any help
>>         with this issue is appreciated. Please let me know if you
>>         need more information. Thank you.
>>
>>         Regards
>>         Shamim Haque
>>         Senior Research Fellow (SRF)
>>         Department of Physics
>>         IISER Bhopal
>>>>
>>         _______________________________________________
>>         Users mailing list
>>         Users at einsteintoolkit.org
>>         http://lists.einsteintoolkit.org/mailman/listinfo/users
>         _______________________________________________
>         Users mailing list
>         Users at einsteintoolkit.org
>         http://lists.einsteintoolkit.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20240510/3036f8d9/attachment-0001.htm>


More information about the Users mailing list