[Users] Getting too many threads per process started on "remote" nodes
Shoup, Anthony
shoup.31 at osu.edu
Thu Jan 30 21:16:48 CST 2020
Hi all,
I am running ETK (2019_10) on a home built cluster consisting of two nodes (8 cores, 16 threads, 64GB 4.3 GHz each). I just finished my second node and am trying to run a simulation (BBHMedRes) over both nodes. For starters I am just running one process (one thread per process) on each node. When I execute my simfactory submit command, I get one process with one thread on the node I submitted the simulation on. However, I get one process with 16 threads on the second node which I don't want. When I run on just the first node, the number of processes and threads per process I get are just what I specify in the simfactory submit command. If I submit the simulation on the second node and just run on the second node I get processs/threads just what I specify in the simfactory submit command. Its only when I run on multiply nodes that don't get the # of processes/threads that I specify. Is there something I am doing wrong? I am using OpenMPI.
Thanks for any help, Tony...
Relevant data is:
1. RunScript:
#!/bin/sh
# This runscript is used internally by simfactory as a template during the
# sim setup and sim setup-silent commands
# Edit at your own risk
echo "Preparing:"
set -x # Output commands
set -e # Abort on errors
cd @RUNDIR at -active
echo "Checking:"
pwd
hostname
date
echo "Environment:"
export CACTUS_NUM_PROCS=@NUM_PROCS@
export CACTUS_NUM_THREADS=@NUM_THREADS@
export GMON_OUT_PREFIX=gmon.out
export OMP_NUM_THREADS=@NUM_THREADS@
env | sort > SIMFACTORY/ENVIRONMENT
echo "Starting:"
export CACTUS_STARTTIME=$(date +%s)
if [ ${CACTUS_NUM_PROCS} = 1 ]; then
if [ @RUNDEBUG@ -eq 0 ]; then
@EXECUTABLE@ -L 3 @PARFILE@
else
gdb --args @EXECUTABLE@ -L 3 @PARFILE@
fi
else
mpirun --hostfile /home/mpiuser/mpi-hosts -np @NUM_PROCS@ @EXECUTABLE@ -L 3 @PARFILE@
fi
echo "Stopping:"
date
echo "Done."
2. mpi-hosts file:
localhost slots=1
RZNode2 slots=1
3. simfactory submit command: ./simfactory/bin/sim submit BBHMedRes --parfile=par/BBHMedRes.par --procs=2 --num-smt=1 --num-threads=1 --ppn-used=1 --ppn=1 --wallt
ime=99:0:0 | cat
4. Machine file on first node (RZNode1):
[RZNode1]
# This machine description file is used internally by simfactory as a template
# during the sim setup and sim setup-silent commands
# Edit at your own risk
# Machine description
nickname = RZNode1
name = RZNode1
location = somewhere
description = Whatever
status = personal
# Access to this machine
hostname = RZNode1
aliaspattern = ^generic\.some\.where$
# Source tree management
sourcebasedir = /home/Cactus
optionlist = generic.cfg
submitscript = generic.sub
runscript = generic.run
make = make -j at MAKEJOBS@
basedir = /home/mpiuser/simulations
ppn = 1 # was 16
max-num-threads = 1 # was 16
num-threads = 1 # was 16
nodes = 2
submit = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@/@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
getstatus = ps @JOB_ID@
stop = kill @JOB_ID@
submitpattern = (.*)
statuspattern = "^ *@JOB_ID@ "
queuedpattern = $^
runningpattern = ^
holdingpattern = $^
exechost = echo localhost
exechostpattern = (.*)
stdout = cat @SIMULATION_NAME at .out
stderr = cat @SIMULATION_NAME at .err
stdout-follow = tail -n 100 -f @SIMULATION_NAME at .out @SIMULATION_NAME at .err
5. Machine file on second node (RZNode2):
[RZNode2]
# This machine description file is used internally by simfactory as a template
# during the sim setup and sim setup-silent commands
# Edit at your own risk
# Machine description
nickname = RZNode2
name = RZNode2
location = somewhere
description = Whatever
status = personal
# Access to this machine
hostname = RZNode2
aliaspattern = ^generic\.some\.where$
# Source tree management
sourcebasedir = /home/ET_2019_10
optionlist = generic.cfg
submitscript = generic.sub
runscript = generic.run
make = make -j at MAKEJOBS@
basedir = /home/mpiuser/simulations
ppn = 1
max-num-threads = 1
num-threads = 1
nodes = 1
submit = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@/@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
getstatus = ps @JOB_ID@
stop = kill @JOB_ID@
submitpattern = (.*)
statuspattern = "^ *@JOB_ID@ "
queuedpattern = $^
runningpattern = ^
holdingpattern = $^
exechost = echo localhost
exechostpattern = (.*)
stdout = cat @SIMULATION_NAME at .out
stderr = cat @SIMULATION_NAME at .err
stdout-follow = tail -n 100 -f @SIMULATION_NAME at .out @SIMULATION_NAME at .err
[https://email.osu.edu/owa/attachment.ashx?id=RgAAAAAb%2fHy0wVvTSoHQx8OJXAaLBwCiA5IZrwRKTqiLVNbt4xWyAAAAAAFUAADltPc25wRDT4tJbW9en2wXAHKArVn%2fAAAJ&attcnt=1&attid0=EAD8Cse5Lj5uQ6ZJWk98Q%2blj]
Anthony Shoup PhD, Senior Lecturer
College of Arts & Sciences, College of Engineering Departments of Physics, Astronomy, EEIC
315 Science Bldg. | 4250 Campus Dr. Lima, OH 45807
419-995-8018 Office | 419-516-2257 Mobile
shoup.31 at osu.edu<https://email.osu.edu/owa/redir.aspx?C=j5WpnJiBk0W5oVlCbtvB-xiCkA_lbdEIi9hlk7ByHiG7ARrxjwDFmAW8S_XespJbMLJRblY5JKc.&URL=mailto%3ashoup.31%40osu.edu> osu.edu<https://email.osu.edu/owa/redir.aspx?C=j5WpnJiBk0W5oVlCbtvB-xiCkA_lbdEIi9hlk7ByHiG7ARrxjwDFmAW8S_XespJbMLJRblY5JKc.&URL=http%3a%2f%2fosu.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20200131/691439aa/attachment.html
More information about the Users
mailing list