[Users] Error in BNS Simulation at Cluster
Steven R. Brandt
sbrandt at cct.lsu.edu
Thu Aug 12 10:03:10 CDT 2021
Ideally, all the #PBS directives should go into your SubmitScript.
However, that's not your problem.
I am wondering whether your mpirun and configured mpi match. This could
give you the error you see.
Cactus configures mpi based on which mpic++ (or mpicc) is in your path
at the time of configuration. Did you do the same module load before
compiling as you do for running? (module load
/home2/mallick/ET/Cactus/openmpi-x86_64)
Note, some builds of MPI now use the PMI interface and don't have an
actual mpirun command. Please make sure this isn't what's going on here.
Curiosity question: Is your batch queue system Torque, or are you using
the PBS interface to slurm?
--Steve
On 8/12/2021 9:44 AM, Shamim Haque wrote:
> For that, I write a separate jobscript and submit it in the queue.
> That jobscript looks like this:
>
> /#!/bin/bash
> #PBS -N nsns2
> #PBS -j oe
> #PBS -V
> #PBS -v TEMP=/scratch1
> #PBS -o job_nsns2.out
> #PBS -e job_nsns2.err
> #PBS -l select=2:ncpus=16
> #PBS -l walltime=24:10:00
> #PBS -q p-queue
> export OMP_NUM_THREADS=1
> cd /home2/mallick/ET/Cactus
> CPUS=`cat $PBS_NODEFILE | wc -l`
> DATE=`date +%c`
> echo Job started at $DATE
> module load /home2/mallick/ET/Cactus/openmpi-x86_64/
> /
> /home2/mallick/ET/Cactus/simfactory/bin/sim whoami
>
> /
> /time mpirun /home2/mallick/ET/Cactus/simfactory/bin/sim create-submit
> nsns30 --basedir=/home2/mallick/simulations --procs=32 --ppn=16
> --num-threads=1 --num-smt=1 --ppn-used=16 --parfile
> /home2/mallick/ET/Cactus/parfile/nsns_vlr_mass_diff.par
> --walltime=24:00:00
> DATE=`date +%c`
> echo Job finished at $DATE/
>
> Regards
> Shamim Haque
> Junior Research Fellow (JRF)
> Department of Physics
> IISER Bhopal
>
> ᐧ
>
> On Thu, Aug 12, 2021 at 7:52 PM Steven R. Brandt <sbrandt at cct.lsu.edu
> <mailto:sbrandt at cct.lsu.edu>> wrote:
>
> Something I'm not understanding. You want to run on 2 nodes, but
> you don't seem to be using a batch queue system... so how does MPI
> know which two nodes to use? Does this machine have slurm installed?
>
> --Steve
>
> On 8/12/2021 12:28 AM, Shamim Haque wrote:
>> Hi Steven,
>>
>> I used the generic Submitscript, no change in that. The Runscript
>> is as follows:
>>
>> /echo "Preparing:"
>> set -x # Output commands
>> set -e # Abort on errors
>> cd @RUNDIR at -active
>> echo "Checking:"
>> pwd
>> hostname
>> date
>> echo "Environment:"
>> module load /home2/mallick/ET/Cactus/openmpi-x86_64
>> export CACTUS_NUM_PROCS=@NUM_PROCS@
>> export CACTUS_NUM_THREADS=@NUM_THREADS@
>> export GMON_OUT_PREFIX=gmon.out
>> export OMP_NUM_THREADS=@NUM_THREADS@
>> env | sort > SIMFACTORY/ENVIRONMENT
>> echo "Starting:"
>> export CACTUS_STARTTIME=$(date +%s)
>> mpirun -np @NUM_PROCS@ @EXECUTABLE@ -L 3 @PARFILE@
>> /
>> /echo "Stopping:"
>> date
>> echo "Done."
>> /
>>
>> I have attached these files here as well. All the output/error
>> files are attached in my previous mail.
>>
>> Regards
>> Shamim Haque
>> Junior Research Fellow (JRF)
>> Department of Physics
>> IISER Bhopal
>>
>> ᐧ
>>
>> On Thu, Aug 12, 2021 at 12:48 AM Steven R. Brandt
>> <sbrandt at cct.lsu.edu <mailto:sbrandt at cct.lsu.edu>> wrote:
>>
>> What SubmitScript and RunScript are you using? Can you show
>> us? Thanks.
>>
>> --Steve
>>
>> On 8/10/2021 2:35 AM, Shamim Haque wrote:
>>> Hello,
>>>
>>> I am trying to run the BNS simulation on the cluster at
>>> IISER Bhopal. Upon using 2 nodes (16x2 cores) my simulation
>>> stalled at this message:
>>> /The environment variable CACTUS_NUM_PROCS is set to 32, but
>>> there are 1 MPI processes. This may indicate a severe
>>> problem with the MPI startup mechanism./
>>> /APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
>>> /
>>>
>>> The command fed into the simfactory via jobscript is as follows:
>>>
>>> /time mpirun /home2/mallick/ET/Cactus/simfactory/bin/sim
>>> create-submit nsns30 --basedir=/home2/mallick/simulations
>>> --procs=32 --ppn=16 --num-threads=1 --num-smt=1
>>> --ppn-used=16 --parfile
>>> /home2/mallick/ET/Cactus/parfile/nsns_vlr_mass_diff.par
>>> --walltime=24:00:00//*
>>> */
>>>
>>> I could not figure out the issue. I am also struggling with
>>> setting up the machine scripts as per the cluster, so I am
>>> not sure if that is somehow hampering the simulation.
>>>
>>> Thanks in advance for helping me with this issue. I have
>>> attached the concerned scripts and outputs for reference.
>>>
>>> Regards
>>> Shamim Haque
>>> Junior Research Fellow (JRF)
>>> Department of Physics
>>> IISER Bhopal
>>> ᐧ
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at einsteintoolkit.org <mailto:Users at einsteintoolkit.org>
>>> http://lists.einsteintoolkit.org/mailman/listinfo/users <http://lists.einsteintoolkit.org/mailman/listinfo/users>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org <mailto:Users at einsteintoolkit.org>
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>> <http://lists.einsteintoolkit.org/mailman/listinfo/users>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20210812/18a97b39/attachment.html
More information about the Users
mailing list