[Users] Issue with npernode value in MPI
Steven R. Brandt
sbrandt at cct.lsu.edu
Fri May 10 10:08:24 CDT 2024
Cool. If you have anyone else using your computer, you can submit your
ini/cfg/runscript/submitscript to Simfactory. Thanks!
On 5/10/2024 5:41 AM, Shamim Haque 1910511 wrote:
> Hello Steve,
>
> My Runscript is now working fine after removing -npernode. Thanks for
> the help.
>
> Regards
> Shamim Haque
> Senior Research Fellow (SRF)
> Department of Physics
> IISER Bhopal
>
> ᐧ
>
> On Thu, May 2, 2024 at 1:05 AM Shamim Haque 1910511
> <shamims at iiserb.ac.in> wrote:
>
> Hello Steve,
>
> Thanks for pointing this out. I'll try to write a fresh runscipt
> by looking at example runscripts.
>
> Since you're using slurm, MPI should be smart enough that you
> don't need to pass -n, -npernode,
>
> I don't need to pass -n as well? I can see -n @NUM_PROCS@ in the
> SBATCH runscripts that uses openmpi (example - graham, expanse).
> Can you please explain a little bit about what should be the
> simplest/safest mpi execution command to start with? And how can
> we build it further to optimise it more?
>
> How did you get a Runscript and Submitscript for this machine. Did
> you create yourself?
>
> My first attempt was at an HPC at my home institute IISER Bhopal,
> which had PBS. Then, I installed ETK in the NSM facility (India),
> which has SLURM. I changed most of the stuff in machinefile and
> submitscipt as per SLRUM (by looking at available example scripts)
> and copied the runscript from PBS, which was working fine.
>
> The current HPC also has SLURM, so I copied all the scripts from
> the NSM facility. It always worked alright, so I was never quite
> sceptical about the runscipt, especially because the error has
> only shown up in the current HPC only and quite occasionally.
>
> Now, I can see from example runscripts that mpi execution commands
> for SBATCH look very different from PBS ones.
>
> Regards
> Shamim Haque
> Senior Research Fellow (SRF)
> Department of Physics
> IISER Bhopal
>
> ᐧ
>
> On Wed, May 1, 2024 at 11:05 PM Steven Brandt
> <sbrandt at cct.lsu.edu> wrote:
>
> Hello Shamim,
>
> The error says that you're calling MPI with the wrong
> parameters, specificall -npernode. Since you're using slurm,
> MPI should be smart enough that you don't need to pass -n,
> -npernode, How did you get a Runscript and Submitscript for
> this machine. Did you create yourself?
>
> --Steve
>
> On 5/1/2024 6:54 AM, Shamim Haque 1910511 wrote:
>> Hi all,
>>
>> I am attempting ETK installation in KALINGA Cluster at NISER,
>> India. This cluster has 40 procs per node and SLURM workload
>> manager.
>>
>> I compiled ETK with gcc-7.5 and openmpi-4.0.5 (attached the
>> machinefile, optionlist, submitscript and runscript). The
>> installation is mostly alright, as I can run parfiles for
>> test TOV and BNS mergers.
>>
>> I tried to run a simulation with procs=160 (nodes 4) and
>> num-threads=1 but landed with this error (error file also
>> attached):
>>
>> /+ mpiexec -n 640 -npernode 40.0
>> /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/SIMFACTORY/exe/cactus_sim
>> -L 3
>> /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/output-0000/eos20_dx25_r500_rg7.par
>> ----------------------------------------------------------------------------
>> Open MPI has detected that a parameter given to a command line
>> option does not match the expected format:
>>
>> Option: npernode
>> Param: 40.0
>>
>> This is frequently caused by omitting to provide the parameter
>> to an option that requires one. Please check the command line
>> and try again.
>> ----------------------------------------------------------------------------
>> /
>>
>> Strangely, this error is not at all regular. Mostly, the
>> error won't appear, and the simulation works just fine (with
>> no changes being made in the scripts or simfactory
>> command). In fact, this exact simulation has worked fine
>> before. Since I am unable to find the source of this issue, I
>> am also unable to recreate the error on my own. But it does
>> kick in occasionally.
>>
>> My command for mpi execution in runscript looks like this:
>>
>> /time mpiexec -n @NUM_PROCS@ -npernode @(@PPN_USED@ /
>> @NUM_THREADS@)@ @EXECUTABLE@ -L 3 @PARFILE@/
>>
>> If I replace / @(@PPN_USED@ / @NUM_THREADS@)@ /with a desired
>> value, then the script always works. My simfactory command
>> looks like this:
>>
>> /./simfactory/bin/sim create-submit
>> dx25_r500_rg7_t30_p640-1_2
>> --parfile=par-smooth/scale_test/eos20_dx25_r500_rg7.par
>> --queue=large1 --procs=640 --num-threads=1 --walltime=00:45:00
>> /
>>
>> I am unable to understand how to solve this issue. Any help
>> with this issue is appreciated. Please let me know if you
>> need more information. Thank you.
>>
>> Regards
>> Shamim Haque
>> Senior Research Fellow (SRF)
>> Department of Physics
>> IISER Bhopal
>> ᐧ
>>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.einsteintoolkit.org/pipermail/users/attachments/20240510/3036f8d9/attachment-0001.htm>
More information about the Users
mailing list