[Users] Getting too many threads per process started on "remote" nodes

Shoup, Anthony shoup.31 at osu.edu
Fri Jan 31 08:03:19 CST 2020


Hi Erik,

Thanks for the info. I will try that.  Tony…

From: Erik Schnetter <schnetter at cct.lsu.edu>
Sent: Friday, January 31, 2020 9:01 AM
To: Shoup, Anthony <shoup.31 at osu.edu>
Cc: Einstein Toolkit Users <users at einsteintoolkit.org>
Subject: Re: [Users] Getting too many threads per process started on "remote" nodes

Anthony

Thus sounds as if the environment variable OMP_NUM_THREADS was not sent to the second node. This would be the fault of the mpirun command. You might need to use a particular option.

-erik

On Thu, Jan 30, 2020 at 22:17 Shoup, Anthony <shoup.31 at osu.edu<mailto:shoup.31 at osu.edu>> wrote:
Hi all,

I am running ETK (2019_10) on a home built cluster consisting of two nodes (8 cores, 16 threads, 64GB 4.3 GHz each).  I just finished my second node and am trying to run a simulation (BBHMedRes) over both nodes. For starters I am just running one process (one thread per process) on each node.  When I execute my simfactory submit command, I get one process with one thread on the node I submitted the simulation on.  However, I get one process with 16 threads on the second node which I don't want.  When I run on just the first node, the number of processes and threads per process I get are just what I specify in the simfactory submit command.  If I submit the simulation on the second node and just run on the second node I get processs/threads just what I specify in the simfactory submit command.  Its only when I run on multiply nodes that don't get the # of processes/threads that I specify.  Is there something I am doing wrong? I am using OpenMPI.

Thanks for any help, Tony...

Relevant data is:


  1.  RunScript:

#!/bin/sh

# This runscript is used internally by simfactory as a template during the
# sim setup and sim setup-silent commands
# Edit at your own risk

echo "Preparing:"
set -x                          # Output commands
set -e                          # Abort on errors

cd @RUNDIR at -active

echo "Checking:"
pwd
hostname
date

echo "Environment:"
export CACTUS_NUM_PROCS=@NUM_PROCS@
export CACTUS_NUM_THREADS=@NUM_THREADS@
export GMON_OUT_PREFIX=gmon.out
export OMP_NUM_THREADS=@NUM_THREADS@
env | sort > SIMFACTORY/ENVIRONMENT

echo "Starting:"
export CACTUS_STARTTIME=$(date +%s)

if [ ${CACTUS_NUM_PROCS} = 1 ]; then
    if [ @RUNDEBUG@ -eq 0 ]; then
     @EXECUTABLE@ -L 3 @PARFILE@
    else
     gdb --args @EXECUTABLE@ -L 3 @PARFILE@
    fi
else
mpirun --hostfile /home/mpiuser/mpi-hosts -np @NUM_PROCS@ @EXECUTABLE@ -L 3 @PARFILE@
fi

echo "Stopping:"
date
echo "Done."


  1.  mpi-hosts file:

localhost slots=1
RZNode2 slots=1


  1.  simfactory submit command: ./simfactory/bin/sim submit BBHMedRes --parfile=par/BBHMedRes.par --procs=2 --num-smt=1 --num-threads=1 --ppn-used=1  --ppn=1 --wallt
ime=99:0:0 | cat
  2.  Machine file on first node (RZNode1):

[RZNode1]

# This machine description file is used internally by simfactory as a template
# during the sim setup and sim setup-silent commands
# Edit at your own risk
# Machine description
nickname        = RZNode1
name            = RZNode1
location        = somewhere
description     = Whatever
status          = personal

# Access to this machine
hostname        = RZNode1
aliaspattern    = ^generic\.some\.where$

# Source tree management
sourcebasedir   = /home/Cactus
optionlist      = generic.cfg
submitscript    = generic.sub
runscript       = generic.run
make            = make -j at MAKEJOBS@
basedir         = /home/mpiuser/simulations
ppn             = 1   # was 16
max-num-threads = 1   # was 16
num-threads     = 1   # was 16
nodes           = 2
submit          = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@/@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
getstatus       = ps @JOB_ID@
stop            = kill @JOB_ID@
submitpattern   = (.*)
statuspattern   = "^ *@JOB_ID@ "
queuedpattern   = $^
runningpattern  = ^
holdingpattern  = $^
exechost        = echo localhost
exechostpattern = (.*)
stdout          = cat @SIMULATION_NAME at .out
stderr          = cat @SIMULATION_NAME at .err
stdout-follow   = tail -n 100 -f @SIMULATION_NAME at .out @SIMULATION_NAME at .err

  1.  Machine file on second node (RZNode2):

[RZNode2]

# This machine description file is used internally by simfactory as a template
# during the sim setup and sim setup-silent commands
# Edit at your own risk
# Machine description
nickname        = RZNode2
name            = RZNode2
location        = somewhere
description     = Whatever
status          = personal

# Access to this machine
hostname        = RZNode2
aliaspattern    = ^generic\.some\.where$

# Source tree management
sourcebasedir   = /home/ET_2019_10
optionlist      = generic.cfg
submitscript    = generic.sub
runscript       = generic.run
make            = make -j at MAKEJOBS@
basedir         = /home/mpiuser/simulations
ppn             = 1
max-num-threads = 1
num-threads     = 1
nodes           = 1
submit          = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@/@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
getstatus       = ps @JOB_ID@
stop            = kill @JOB_ID@
submitpattern   = (.*)
statuspattern   = "^ *@JOB_ID@ "
queuedpattern   = $^
runningpattern  = ^
holdingpattern  = $^
exechost        = echo localhost
exechostpattern = (.*)
stdout          = cat @SIMULATION_NAME at .out
stderr          = cat @SIMULATION_NAME at .err
stdout-follow   = tail -n 100 -f @SIMULATION_NAME at .out @SIMULATION_NAME at .err


[https://email.osu.edu/owa/attachment.ashx?id=RgAAAAAb%2fHy0wVvTSoHQx8OJXAaLBwCiA5IZrwRKTqiLVNbt4xWyAAAAAAFUAADltPc25wRDT4tJbW9en2wXAHKArVn%2fAAAJ&attcnt=1&attid0=EAD8Cse5Lj5uQ6ZJWk98Q%2blj]
Anthony Shoup PhD, Senior Lecturer
College of Arts & Sciences, College of Engineering Departments of Physics, Astronomy, EEIC
315 Science Bldg. | 4250 Campus Dr. Lima, OH 45807
419-995-8018 Office | 419-516-2257 Mobile
shoup.31 at osu.edu<https://email.osu.edu/owa/redir.aspx?C=j5WpnJiBk0W5oVlCbtvB-xiCkA_lbdEIi9hlk7ByHiG7ARrxjwDFmAW8S_XespJbMLJRblY5JKc.&URL=mailto%3ashoup.31%40osu.edu> osu.edu<https://email.osu.edu/owa/redir.aspx?C=j5WpnJiBk0W5oVlCbtvB-xiCkA_lbdEIi9hlk7ByHiG7ARrxjwDFmAW8S_XespJbMLJRblY5JKc.&URL=http%3a%2f%2fosu.edu>
_______________________________________________
Users mailing list
Users at einsteintoolkit.org<mailto:Users at einsteintoolkit.org>
http://lists.einsteintoolkit.org/mailman/listinfo/users<https://urldefense.com/v3/__http:/lists.einsteintoolkit.org/mailman/listinfo/users__;!!KGKeukY!jUlB7XYEfe8UiFCjZejH7x0NHdXyZRt5OCcbfMlxrgf6Z9Ir5xjkxswipzGAyF2t$>
--
Erik Schnetter <schnetter at cct.lsu.edu<mailto:schnetter at cct.lsu.edu>>
http://www.perimeterinstitute.ca/personal/eschnetter/<https://urldefense.com/v3/__http:/www.perimeterinstitute.ca/personal/eschnetter/__;!!KGKeukY!jUlB7XYEfe8UiFCjZejH7x0NHdXyZRt5OCcbfMlxrgf6Z9Ir5xjkxswipxVTOxb_$>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20200131/20d806f0/attachment-0001.html 


More information about the Users mailing list