[Users] Getting too many threads per process started on "remote" nodes

Erik Schnetter schnetter at cct.lsu.edu
Fri Jan 31 08:01:10 CST 2020


Anthony

Thus sounds as if the environment variable OMP_NUM_THREADS was not sent to
the second node. This would be the fault of the mpirun command. You might
need to use a particular option.

-erik

On Thu, Jan 30, 2020 at 22:17 Shoup, Anthony <shoup.31 at osu.edu> wrote:

> Hi all,
>
> I am running ETK (2019_10) on a home built cluster consisting of two nodes
> (8 cores, 16 threads, 64GB 4.3 GHz each).  I just finished my second node
> and am trying to run a simulation (BBHMedRes) over both nodes. For starters
> I am just running one process (one thread per process) on each node.  When
> I execute my simfactory submit command, I get one process with one thread
> on the node I submitted the simulation on.  However, I get one process with
> 16 threads on the second node which I don't want.  When I run on just the
> first node, the number of processes and threads per process I get are just
> what I specify in the simfactory submit command.  If I submit the
> simulation on the second node and just run on the second node I get
> processs/threads just what I specify in the simfactory submit command.  Its
> only when I run on multiply nodes that don't get the # of processes/threads
> that I specify.  Is there something I am doing wrong? I am using OpenMPI.
>
> Thanks for any help, Tony...
>
> Relevant data is:
>
>
>    1. RunScript:
>
>    #!/bin/sh
>
>    # This runscript is used internally by simfactory as a template during
>    the
>    # sim setup and sim setup-silent commands
>    # Edit at your own risk
>
>    echo "Preparing:"
>    set -x                          # Output commands
>    set -e                          # Abort on errors
>
>    cd @RUNDIR at -active
>
>    echo "Checking:"
>    pwd
>    hostname
>    date
>
>    echo "Environment:"
>    export CACTUS_NUM_PROCS=@NUM_PROCS@
>    export CACTUS_NUM_THREADS=@NUM_THREADS@
>    export GMON_OUT_PREFIX=gmon.out
>    export OMP_NUM_THREADS=@NUM_THREADS@
>    env | sort > SIMFACTORY/ENVIRONMENT
>
>    echo "Starting:"
>    export CACTUS_STARTTIME=$(date +%s)
>
>    if [ ${CACTUS_NUM_PROCS} = 1 ]; then
>        if [ @RUNDEBUG@ -eq 0 ]; then
>         @EXECUTABLE@ -L 3 @PARFILE@
>        else
>         gdb --args @EXECUTABLE@ -L 3 @PARFILE@
>        fi
>    else
>    mpirun --hostfile /home/mpiuser/mpi-hosts -np @NUM_PROCS@ @EXECUTABLE@
>    -L 3 @PARFILE@
>    fi
>
>    echo "Stopping:"
>    date
>    echo "Done."
>
>    2. mpi-hosts file:
>
>    localhost slots=1
>    RZNode2 slots=1
>
>    3. simfactory submit command: ./simfactory/bin/sim submit BBHMedRes
>    --parfile=par/BBHMedRes.par --procs=2 --num-smt=1 --num-threads=1
>    --ppn-used=1  --ppn=1 --wallt
>    ime=99:0:0 | cat
>
>    4. Machine file on first node (RZNode1):
>
>
>    [RZNode1]
>
>    # This machine description file is used internally by simfactory as a
>    template
>    # during the sim setup and sim setup-silent commands
>    # Edit at your own risk
>    # Machine description
>    nickname        = RZNode1
>    name            = RZNode1
>    location        = somewhere
>    description     = Whatever
>    status          = personal
>
>    # Access to this machine
>    hostname        = RZNode1
>    aliaspattern    = ^generic\.some\.where$
>
>    # Source tree management
>    sourcebasedir   = /home/Cactus
>    optionlist      = generic.cfg
>    submitscript    = generic.sub
>    runscript       = generic.run
>    make            = make -j at MAKEJOBS@
>    basedir         = /home/mpiuser/simulations
>    ppn             = 1   # was 16
>    max-num-threads = 1   # was 16
>    num-threads     = 1   # was 16
>    nodes           = 2
>    submit          = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@
>    /@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
>    getstatus       = ps @JOB_ID@
>    stop            = kill @JOB_ID@
>    submitpattern   = (.*)
>    statuspattern   = "^ *@JOB_ID@ "
>    queuedpattern   = $^
>    runningpattern  = ^
>    holdingpattern  = $^
>    exechost        = echo localhost
>    exechostpattern = (.*)
>    stdout          = cat @SIMULATION_NAME at .out
>    stderr          = cat @SIMULATION_NAME at .err
>    stdout-follow   = tail -n 100 -f @SIMULATION_NAME at .out
>    @SIMULATION_NAME at .err
>
>    5. Machine file on second node (RZNode2):
>
>    [RZNode2]
>
>    # This machine description file is used internally by simfactory as a
>    template
>    # during the sim setup and sim setup-silent commands
>    # Edit at your own risk
>    # Machine description
>    nickname        = RZNode2
>    name            = RZNode2
>    location        = somewhere
>    description     = Whatever
>    status          = personal
>
>    # Access to this machine
>    hostname        = RZNode2
>    aliaspattern    = ^generic\.some\.where$
>
>    # Source tree management
>    sourcebasedir   = /home/ET_2019_10
>    optionlist      = generic.cfg
>    submitscript    = generic.sub
>    runscript       = generic.run
>    make            = make -j at MAKEJOBS@
>    basedir         = /home/mpiuser/simulations
>    ppn             = 1
>    max-num-threads = 1
>    num-threads     = 1
>    nodes           = 1
>    submit          = exec nohup @SCRIPTFILE@ < /dev/null > @RUNDIR@
>    /@SIMULATION_NAME at .out 2> @RUNDIR@/@SIMULATION_NAME at .err & echo $!
>    getstatus       = ps @JOB_ID@
>    stop            = kill @JOB_ID@
>    submitpattern   = (.*)
>    statuspattern   = "^ *@JOB_ID@ "
>    queuedpattern   = $^
>    runningpattern  = ^
>    holdingpattern  = $^
>    exechost        = echo localhost
>    exechostpattern = (.*)
>    stdout          = cat @SIMULATION_NAME at .out
>    stderr          = cat @SIMULATION_NAME at .err
>    stdout-follow   = tail -n 100 -f @SIMULATION_NAME at .out
>    @SIMULATION_NAME at .err
>
>
>
>
> *Anthony Shoup* PhD, Senior Lecturer
> College of Arts & Sciences, College of Engineering Departments of
> Physics, Astronomy, EEIC
> 315 Science Bldg. | 4250 Campus Dr. Lima, OH 45807
> 419-995-8018 Office | 419-516-2257 Mobile
> *shoup.31 at osu.edu*
> <https://email.osu.edu/owa/redir.aspx?C=j5WpnJiBk0W5oVlCbtvB-xiCkA_lbdEIi9hlk7ByHiG7ARrxjwDFmAW8S_XespJbMLJRblY5JKc.&URL=mailto%3ashoup.31%40osu.edu>
>  *osu.edu*
> <https://email.osu.edu/owa/redir.aspx?C=j5WpnJiBk0W5oVlCbtvB-xiCkA_lbdEIi9hlk7ByHiG7ARrxjwDFmAW8S_XespJbMLJRblY5JKc.&URL=http%3a%2f%2fosu.edu>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20200131/e041f63d/attachment-0001.html 


More information about the Users mailing list