[Users] Error in BNS Simulation at Cluster

Steven R. Brandt sbrandt at cct.lsu.edu
Thu Aug 12 10:03:10 CDT 2021


Ideally, all the #PBS directives should go into your SubmitScript. 
However, that's not your problem.

I am wondering whether your mpirun and configured mpi match. This could 
give you the error you see.

Cactus configures mpi based on which mpic++ (or mpicc) is in your path 
at the time of configuration. Did you do the same module load before 
compiling as you do for running? (module load 
/home2/mallick/ET/Cactus/openmpi-x86_64)

Note, some builds of MPI now use the PMI interface and don't have an 
actual mpirun command. Please make sure this isn't what's going on here.

Curiosity question: Is your batch queue system Torque, or are you using 
the PBS interface to slurm?

--Steve

On 8/12/2021 9:44 AM, Shamim Haque wrote:
> For that, I write a separate jobscript and submit it in the queue. 
> That jobscript looks like this:
>
> /#!/bin/bash
> #PBS -N nsns2
> #PBS -j oe
> #PBS -V
> #PBS -v TEMP=/scratch1
> #PBS -o job_nsns2.out
> #PBS -e job_nsns2.err
> #PBS -l select=2:ncpus=16
> #PBS -l walltime=24:10:00
> #PBS -q p-queue
> export OMP_NUM_THREADS=1
> cd /home2/mallick/ET/Cactus
> CPUS=`cat $PBS_NODEFILE | wc -l`
> DATE=`date +%c`
> echo Job started at $DATE
> module load /home2/mallick/ET/Cactus/openmpi-x86_64/
> /
> /home2/mallick/ET/Cactus/simfactory/bin/sim whoami
>
> /
> /time mpirun /home2/mallick/ET/Cactus/simfactory/bin/sim create-submit 
> nsns30 --basedir=/home2/mallick/simulations --procs=32 --ppn=16 
> --num-threads=1 --num-smt=1 --ppn-used=16 --parfile 
> /home2/mallick/ET/Cactus/parfile/nsns_vlr_mass_diff.par 
> --walltime=24:00:00
> DATE=`date +%c`
> echo Job finished at $DATE/
>
> Regards
> Shamim Haque
> Junior Research Fellow (JRF)
> Department of Physics
> IISER Bhopal
>
>>
> On Thu, Aug 12, 2021 at 7:52 PM Steven R. Brandt <sbrandt at cct.lsu.edu 
> <mailto:sbrandt at cct.lsu.edu>> wrote:
>
>     Something I'm not understanding. You want to run on 2 nodes, but
>     you don't seem to be using a batch queue system... so how does MPI
>     know which two nodes to use? Does this machine have slurm installed?
>
>     --Steve
>
>     On 8/12/2021 12:28 AM, Shamim Haque wrote:
>>     Hi Steven,
>>
>>     I used the generic Submitscript, no change in that. The Runscript
>>     is as follows:
>>
>>     /echo "Preparing:"
>>     set -x                          # Output commands
>>     set -e                          # Abort on errors
>>     cd @RUNDIR at -active
>>     echo "Checking:"
>>     pwd
>>     hostname
>>     date
>>     echo "Environment:"
>>     module load /home2/mallick/ET/Cactus/openmpi-x86_64
>>     export CACTUS_NUM_PROCS=@NUM_PROCS@
>>     export CACTUS_NUM_THREADS=@NUM_THREADS@
>>     export GMON_OUT_PREFIX=gmon.out
>>     export OMP_NUM_THREADS=@NUM_THREADS@
>>     env | sort > SIMFACTORY/ENVIRONMENT
>>     echo "Starting:"
>>     export CACTUS_STARTTIME=$(date +%s)
>>     mpirun -np @NUM_PROCS@ @EXECUTABLE@ -L 3 @PARFILE@
>>     /
>>     /echo "Stopping:"
>>     date
>>     echo "Done."
>>     /
>>
>>     I have attached these files here as well. All the output/error
>>     files are attached in my previous mail.
>>
>>     Regards
>>     Shamim Haque
>>     Junior Research Fellow (JRF)
>>     Department of Physics
>>     IISER Bhopal
>>
>>>>
>>     On Thu, Aug 12, 2021 at 12:48 AM Steven R. Brandt
>>     <sbrandt at cct.lsu.edu <mailto:sbrandt at cct.lsu.edu>> wrote:
>>
>>         What SubmitScript and RunScript are you using? Can you show
>>         us? Thanks.
>>
>>         --Steve
>>
>>         On 8/10/2021 2:35 AM, Shamim Haque wrote:
>>>         Hello,
>>>
>>>         I am trying to run the BNS simulation on the cluster at
>>>         IISER Bhopal. Upon using 2 nodes (16x2 cores) my simulation
>>>         stalled at this message:
>>>         /The environment variable CACTUS_NUM_PROCS is set to 32, but
>>>         there are 1 MPI processes. This may indicate a severe
>>>         problem with the MPI startup mechanism./
>>>         /APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
>>>         /
>>>
>>>         The command fed into the simfactory via jobscript is as follows:
>>>
>>>         /time mpirun /home2/mallick/ET/Cactus/simfactory/bin/sim
>>>         create-submit nsns30 --basedir=/home2/mallick/simulations
>>>         --procs=32 --ppn=16 --num-threads=1 --num-smt=1
>>>         --ppn-used=16 --parfile
>>>         /home2/mallick/ET/Cactus/parfile/nsns_vlr_mass_diff.par
>>>         --walltime=24:00:00//*
>>>         */
>>>
>>>         I could not figure out the issue. I am also struggling with
>>>         setting up the machine scripts as per the cluster, so I am
>>>         not sure if that is somehow hampering the simulation.
>>>
>>>         Thanks in advance for helping me with this issue. I have
>>>         attached the concerned scripts and outputs for reference.
>>>
>>>         Regards
>>>         Shamim Haque
>>>         Junior Research Fellow (JRF)
>>>         Department of Physics
>>>         IISER Bhopal
>>>>>>
>>>         _______________________________________________
>>>         Users mailing list
>>>         Users at einsteintoolkit.org  <mailto:Users at einsteintoolkit.org>
>>>         http://lists.einsteintoolkit.org/mailman/listinfo/users  <http://lists.einsteintoolkit.org/mailman/listinfo/users>
>>         _______________________________________________
>>         Users mailing list
>>         Users at einsteintoolkit.org <mailto:Users at einsteintoolkit.org>
>>         http://lists.einsteintoolkit.org/mailman/listinfo/users
>>         <http://lists.einsteintoolkit.org/mailman/listinfo/users>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20210812/18a97b39/attachment.html 


More information about the Users mailing list