[Users] [External] Re: Running with SLURM

Warren, Jessica Sawyer warrenjs at iun.edu
Fri Aug 19 16:42:18 CDT 2022


Hi Roland and Peter,

No worries, thank you for the follow-up!  I added the memory directive and received the same segfault error (and the backtrace looks the same as well).  Each node on Quartz has 512GB for 128 cores, so I requested 4GB per core.  For example, a submission with 10 cores (2 processes with 5 threads each) had:

#SBATCH --mem=40GB

I repeated the attempt with a different setup but the same memory scaling of 4 GB per core, such as 2 processes with 1 thread each (so 8GB memory requested), and failed the same way.  Also tried using an entire node (8 processes with 16 threads per) and all its memory (--mem=0), and failed.

I'm using the default static_tov.par, no changes:

IO::out_dir      = $parfile

IOScalar::outScalar_every = 32
IOScalar::one_file_per_group = yes
IOScalar::outScalar_vars  = "
 HydroBase::rho
 HydroBase::press
 HydroBase::eps
 HydroBase::vel
 ADMBase::lapse
 ADMBase::metric
 ADMBase::curv
 ML_ADMConstraints::ML_Ham
 ML_ADMConstraints::ML_mom

"

Thank you,
Jessica

Dr. Jessica S. Warren
Physics Lecturer
Indiana University Northwest
warrenjs at iun.edu

________________________________
From: Roland Haas
Sent: Thursday, August 18, 2022 3:36 PM
To: Warren, Jessica Sawyer
Cc: users at einsteintoolkit.org
Subject: Re: [Users] [External] Re: Running with SLURM

Hello Jessica,

Sorry for the delay in responding.

We discussed your issue during today's weekly Einstein Toolkit call
(http://lists.einsteintoolkit.org/pipermail/users/2022-August/008660.html)
and there were some suggestions.

The error you see is somewhat puzzling. The segfault happens in a MPI
call that is part of the Carpet Reduction call (during output).

The puzzling thing is that this is not the first MPI call in this run
(there are some much earlier) nor the first reduction (eg there are
reductions for the min/max output that you saw on screen).

There was one suggestions that were mentioned:

* Peter Diener noted that he had seem on one cluster issues with errors
  and segfaults later in the run where he needed to explicitly pass a
  "#SBATCH --memory XGB" to sbatch to request that memory is available

>From the fact that you can see output to screen but the failure in a
reduction to me sounds like the issue is somehow encountered while
executing code in the CarpetIOScalar thorn (IOBasic is to screen,
IOScalar is to disk). Are you passing any "strange" options or special
variables to its outScalar_vars option?

Yours,
Roland

> Hi Roland,
>
> The admins reinstalled openmpi and it now runs the hello script
> correctly.  However, the Toolkit would still produce seg faults after
> srun.  Switching to mvapich seems to have largely done the trick
> though, as the TOV job is now able to start executing.  As long as
> there is only 1 MPI process (with however many threads), the TOV job
> runs to completion correctly.  However, anytime there are multiple
> MPI processes, it crashes at the first time iteration:
>
> INFO (TOVSolver): Done interpolation.
> ---------------------------------------------------------------------------
> Iteration      Time |              ADMBASE::alp |
> HYDROBASE::rho |      minimum      maximum |      minimum      maximum
> ---------------------------------------------------------------------------
>         0     0.000 |    0.6698612    0.9966374 | 1.000000e-10
> 0.0012800 Rank 1 with PID 3964893 received signal 11
> Writing backtrace to static_tov/backtrace.1.txt
> srun: error: c40: task 1: Segmentation fault (core dumped)
>
> The backtrace is attached, as well as the last portion of the output,
> and it looks like the issue is tied to Carpet.  Are there some
> settings in the parameter file that need adjusting or setting to fix
> this?  Or perhaps specific settings for the number of ranks and
> threads?
>
> Thank you,
> Jessica
>
>
> Dr. Jessica S. Warren
> Physics Lecturer
> Indiana University Northwest
> warrenjs at iun.edu
>
> ________________________________
> From: Roland Haas
> Sent: Thursday, August 11, 2022 8:32 AM
> To: Warren, Jessica Sawyer
> Cc: users at einsteintoolkit.org
> Subject: Re: [Users] [External] Re: Running with SLURM
>
> Hello Jessica,
>
> If you get the same error from hello-world and from Cactus then it
> would seem that there is still something off with the MPI stack.
>
> The -lmpi_cxx option instructs the linker to link in C++ bindings for
> MPI though for just the hello world example, it being C code, this is
> not required and -lmpi alone is sufficient.
>
> I would see two options that would let you get running somewhat
> quickly:
>
> 1. report your issues with OpenMPI and hello-world (including link to
> the source code on the web, and the exact command line to compile) to
> the admins and ask them for help
>
> 1.5 instead of using gcc to compile for OpenMPI do use the MPI
> official compiler wrapper mpicc which would just be:
>
> mpicc -o hello hello.c
>
> that is you do not have to pass and library or inlcude options. If
> this fails, I would definitely talk to the admins.
>
> 2. compile hello-world using mvapich. For this the easiest way is to
> make sure to load the mvapich module and then use the same compiler
> wrapper invication to compile:
>
> mpicc -o hello hello.c
>
> If 2 works then you can also compile the Einstein Toolkit with
> mvapich. You have to make sure to load the correct module before
> compiling the toolkit and then ExternalLibraries/MPI should figure
> out (from the mpicc wrapper) how to compile the toolkit.
>
> Yours,
> Roland
>
>
> > Hi Roland,
> >
> > Thank you so much.  The compute nodes are able to be used for
> > compilation, and the directories match what is listed in
> > make.MPI.defn.  When doing the 'hello' example you linked to, it was
> > unable to compile due to a linker error (/usr/bin/ld: cannot find
> > -lmpi_cxx).  I re-ran it in verbose mode and found the directory it
> > was searching did exist and did have lmpi but not lmpi_cxx.  The
> > admins said they had had some issues installing openmpi (couldn't
> > recall exactly what), and recommended mpavich (since that does have
> > lmpicxx installed and is their preferred implementation).  However,
> > they reinstalled openmpi in an effort to get that to work and it did
> > allow the 'hello' script to compile, but when executed it produced:
> >
> > --------------------------------------------------------------------------
> > No OpenFabrics connection schemes reported that they were able to be
> > used on a specific port.  As such, the openib BTL (OpenFabrics
> > support) will be disabled for this port.
> >
> >   Local host:           h1
> >   Local device:         mlx5_0
> >   Local port:           1
> >   CPCs attempted:       rdmacm, udcm
> > --------------------------------------------------------------------------
> > Hello world from processor h1.quartz.uits.iu.edu, rank 0 out of 1
> > processors
> >
> > Similarly, doing the TOV job via sbatch, after the srun command it
> > gave the same OpenFabrics message (for each MPI rank) and then the
> > same segmentation faults as before.  I've contacted the admins about
> > this and am waiting to hear back.  Do you have any recommendations -
> > perhaps it would be easier to try switching over to mvapich?  If so,
> > could you point me to some resources on how to reconfigure?
> >
> > Thank you,
> > Jessica
> >
> > Dr. Jessica S. Warren
> > Physics Lecturer
> > Indiana University Northwest
> > warrenjs at iun.edu
> > ________________________________
> > From: Roland Haas <rhaas at illinois.edu>
> > Sent: Tuesday, August 9, 2022 9:48 AM
> > To: Warren, Jessica Sawyer <warrenjs at iun.edu>
> > Cc: users at einsteintoolkit.org <users at einsteintoolkit.org>
> > Subject: [External] Re: [Users] Running with SLURM
> >
> > Hello Jessica,
> >
> > You may also find something useful in the setting up a new machine
> > seminar presentation:
> >
> > https://urldefense.com/v3/__https://www.einsteintoolkit.org/seminars/2022_02_24/index.html__;!!DZ3fjg!9JAgxc4juluJwklwTQgJGsYLXJIzzdHOqX8rwuiuymRXLrFedDv4PXSatzu0HVAYDfBFpiYxw1_jUDmUew$
> >
> > Yours,
> > Roland
> >
> > --
> > My email is as private as my paper mail. I therefore support
> > encrypting and signing email messages. Get my PGP key from
> > https://urldefense.com/v3/__http://pgp.mit.edu__;!!DZ3fjg!9JAgxc4juluJwklwTQgJGsYLXJIzzdHOqX8rwuiuymRXLrFedDv4PXSatzu0HVAYDfBFpiYxw19et3mEyg$
> >  .
>
>
> --
> My email is as private as my paper mail. I therefore support
> encrypting and signing email messages. Get my PGP key from
> https://urldefense.com/v3/__http://pgp.mit.edu__;!!DZ3fjg!_ZQHbCvNiX5H7WOd1mpD6m4ZA8ifF0XyFfV1P_ciu1NcIUBzbMZrd5MUw2aPPDdBii4pcb2ZGT1cTOAsRw$
>  .


--
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20220819/3208f4c1/attachment-0001.html 


More information about the Users mailing list