<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Hi Roland and Peter,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
No worries, thank you for the follow-up! I added the memory directive and received the same segfault error (and the backtrace looks the same as well). Each node on Quartz has 512GB for 128 cores, so I requested 4GB per core. For example, a submission with
10 cores (2 processes with 5 threads each) had:</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<i>#SBATCH --mem=40GB</i><br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
I repeated the attempt with a different setup but the same memory scaling of 4 GB per core, such as 2 processes with 1 thread each (so 8GB memory requested), and failed the same way. Also tried using an entire node (8 processes with 16 threads per) and all
its memory (--mem=0), and failed. <br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
I'm using the default <i>static_tov.par</i>, no changes:</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<i><br>
</i></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<i>IO::out_dir = $parfile</i></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
<div><i>IOScalar::outScalar_every = 32</i></div>
<div><i>IOScalar::one_file_per_group = yes</i></div>
<div><i>IOScalar::outScalar_vars = "</i></div>
<div><i> HydroBase::rho</i></div>
<div><i> HydroBase::press</i></div>
<div><i> HydroBase::eps</i></div>
<div><i> HydroBase::vel</i></div>
<div><i> ADMBase::lapse</i></div>
<div><i> ADMBase::metric</i></div>
<div><i> ADMBase::curv</i></div>
<div><i> ML_ADMConstraints::ML_Ham</i></div>
<div><i> ML_ADMConstraints::ML_mom</i></div>
<div><i><br>
</i></div>
<div><i>"</i></div>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Thank you,</div>
<div class="elementToProof">Jessica
<div id="Signature">
<div>
<div name="divtagdefaultwrapper" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:; margin:0">
<b><br>
</b></div>
<div name="divtagdefaultwrapper" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:; margin:0">
<b>Dr. Jessica S. Warren</b> </div>
<div name="divtagdefaultwrapper" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:; margin:0">
<div>Physics Lecturer</div>
<div>Indiana University Northwest</div>
<div>warrenjs@iun.edu</div>
</div>
<div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0);">
<br>
<hr tabindex="-1" style="display:inline-block; width:98%;">
<b>From:</b> Roland Haas<br>
<b>Sent:</b> Thursday, August 18, 2022 3:36 PM<br>
<b>To:</b> Warren, Jessica Sawyer<br>
<b>Cc:</b> users@einsteintoolkit.org<br>
<b>Subject:</b> Re: [Users] [External] Re: Running with SLURM
<div><br>
</div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">Hello Jessica,<br>
<br>
Sorry for the delay in responding.<br>
<br>
We discussed your issue during today's weekly Einstein Toolkit call<br>
(<a href="http://lists.einsteintoolkit.org/pipermail/users/2022-August/008660.html" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable">http://lists.einsteintoolkit.org/pipermail/users/2022-August/008660.html</a>)<br>
and there were some suggestions.<br>
<br>
The error you see is somewhat puzzling. The segfault happens in a MPI<br>
call that is part of the Carpet Reduction call (during output).<br>
<br>
The puzzling thing is that this is not the first MPI call in this run<br>
(there are some much earlier) nor the first reduction (eg there are<br>
reductions for the min/max output that you saw on screen).<br>
<br>
There was one suggestions that were mentioned:<br>
<br>
* Peter Diener noted that he had seem on one cluster issues with errors<br>
and segfaults later in the run where he needed to explicitly pass a<br>
"#SBATCH --memory XGB" to sbatch to request that memory is available<br>
<br>
>From the fact that you can see output to screen but the failure in a<br>
reduction to me sounds like the issue is somehow encountered while<br>
executing code in the CarpetIOScalar thorn (IOBasic is to screen,<br>
IOScalar is to disk). Are you passing any "strange" options or special<br>
variables to its outScalar_vars option?<br>
<br>
Yours,<br>
Roland<br>
<br>
> Hi Roland,<br>
> <br>
> The admins reinstalled openmpi and it now runs the hello script<br>
> correctly. However, the Toolkit would still produce seg faults after<br>
> srun. Switching to mvapich seems to have largely done the trick<br>
> though, as the TOV job is now able to start executing. As long as<br>
> there is only 1 MPI process (with however many threads), the TOV job<br>
> runs to completion correctly. However, anytime there are multiple<br>
> MPI processes, it crashes at the first time iteration:<br>
> <br>
> INFO (TOVSolver): Done interpolation.<br>
> ---------------------------------------------------------------------------<br>
> Iteration Time | ADMBASE::alp |<br>
> HYDROBASE::rho | minimum maximum | minimum maximum<br>
> ---------------------------------------------------------------------------<br>
> 0 0.000 | 0.6698612 0.9966374 | 1.000000e-10<br>
> 0.0012800 Rank 1 with PID 3964893 received signal 11<br>
> Writing backtrace to static_tov/backtrace.1.txt<br>
> srun: error: c40: task 1: Segmentation fault (core dumped)<br>
> <br>
> The backtrace is attached, as well as the last portion of the output,<br>
> and it looks like the issue is tied to Carpet. Are there some<br>
> settings in the parameter file that need adjusting or setting to fix<br>
> this? Or perhaps specific settings for the number of ranks and<br>
> threads?<br>
> <br>
> Thank you,<br>
> Jessica<br>
> <br>
> <br>
> Dr. Jessica S. Warren<br>
> Physics Lecturer<br>
> Indiana University Northwest<br>
> warrenjs@iun.edu<br>
> <br>
> ________________________________<br>
> From: Roland Haas<br>
> Sent: Thursday, August 11, 2022 8:32 AM<br>
> To: Warren, Jessica Sawyer<br>
> Cc: users@einsteintoolkit.org<br>
> Subject: Re: [Users] [External] Re: Running with SLURM<br>
> <br>
> Hello Jessica,<br>
> <br>
> If you get the same error from hello-world and from Cactus then it<br>
> would seem that there is still something off with the MPI stack.<br>
> <br>
> The -lmpi_cxx option instructs the linker to link in C++ bindings for<br>
> MPI though for just the hello world example, it being C code, this is<br>
> not required and -lmpi alone is sufficient.<br>
> <br>
> I would see two options that would let you get running somewhat<br>
> quickly:<br>
> <br>
> 1. report your issues with OpenMPI and hello-world (including link to<br>
> the source code on the web, and the exact command line to compile) to<br>
> the admins and ask them for help<br>
> <br>
> 1.5 instead of using gcc to compile for OpenMPI do use the MPI<br>
> official compiler wrapper mpicc which would just be:<br>
> <br>
> mpicc -o hello hello.c<br>
> <br>
> that is you do not have to pass and library or inlcude options. If<br>
> this fails, I would definitely talk to the admins.<br>
> <br>
> 2. compile hello-world using mvapich. For this the easiest way is to<br>
> make sure to load the mvapich module and then use the same compiler<br>
> wrapper invication to compile:<br>
> <br>
> mpicc -o hello hello.c<br>
> <br>
> If 2 works then you can also compile the Einstein Toolkit with<br>
> mvapich. You have to make sure to load the correct module before<br>
> compiling the toolkit and then ExternalLibraries/MPI should figure<br>
> out (from the mpicc wrapper) how to compile the toolkit.<br>
> <br>
> Yours,<br>
> Roland<br>
> <br>
> <br>
> > Hi Roland,<br>
> ><br>
> > Thank you so much. The compute nodes are able to be used for<br>
> > compilation, and the directories match what is listed in<br>
> > make.MPI.defn. When doing the 'hello' example you linked to, it was<br>
> > unable to compile due to a linker error (/usr/bin/ld: cannot find<br>
> > -lmpi_cxx). I re-ran it in verbose mode and found the directory it<br>
> > was searching did exist and did have lmpi but not lmpi_cxx. The<br>
> > admins said they had had some issues installing openmpi (couldn't<br>
> > recall exactly what), and recommended mpavich (since that does have<br>
> > lmpicxx installed and is their preferred implementation). However,<br>
> > they reinstalled openmpi in an effort to get that to work and it did<br>
> > allow the 'hello' script to compile, but when executed it produced:<br>
> ><br>
> > --------------------------------------------------------------------------<br>
> > No OpenFabrics connection schemes reported that they were able to be<br>
> > used on a specific port. As such, the openib BTL (OpenFabrics<br>
> > support) will be disabled for this port.<br>
> ><br>
> > Local host: h1<br>
> > Local device: mlx5_0<br>
> > Local port: 1<br>
> > CPCs attempted: rdmacm, udcm<br>
> > --------------------------------------------------------------------------<br>
> > Hello world from processor h1.quartz.uits.iu.edu, rank 0 out of 1<br>
> > processors<br>
> ><br>
> > Similarly, doing the TOV job via sbatch, after the srun command it<br>
> > gave the same OpenFabrics message (for each MPI rank) and then the<br>
> > same segmentation faults as before. I've contacted the admins about<br>
> > this and am waiting to hear back. Do you have any recommendations -<br>
> > perhaps it would be easier to try switching over to mvapich? If so,<br>
> > could you point me to some resources on how to reconfigure?<br>
> ><br>
> > Thank you,<br>
> > Jessica<br>
> ><br>
> > Dr. Jessica S. Warren<br>
> > Physics Lecturer<br>
> > Indiana University Northwest<br>
> > warrenjs@iun.edu<br>
> > ________________________________<br>
> > From: Roland Haas <rhaas@illinois.edu><br>
> > Sent: Tuesday, August 9, 2022 9:48 AM<br>
> > To: Warren, Jessica Sawyer <warrenjs@iun.edu><br>
> > Cc: users@einsteintoolkit.org <users@einsteintoolkit.org><br>
> > Subject: [External] Re: [Users] Running with SLURM<br>
> ><br>
> > Hello Jessica,<br>
> ><br>
> > You may also find something useful in the setting up a new machine<br>
> > seminar presentation:<br>
> ><br>
> > <a href="https://urldefense.com/v3/__https://www.einsteintoolkit.org/seminars/2022_02_24/index.html__;!!DZ3fjg!9JAgxc4juluJwklwTQgJGsYLXJIzzdHOqX8rwuiuymRXLrFedDv4PXSatzu0HVAYDfBFpiYxw1_jUDmUew$" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable">
https://urldefense.com/v3/__https://www.einsteintoolkit.org/seminars/2022_02_24/index.html__;!!DZ3fjg!9JAgxc4juluJwklwTQgJGsYLXJIzzdHOqX8rwuiuymRXLrFedDv4PXSatzu0HVAYDfBFpiYxw1_jUDmUew$</a><br>
> ><br>
> > Yours,<br>
> > Roland<br>
> ><br>
> > --<br>
> > My email is as private as my paper mail. I therefore support<br>
> > encrypting and signing email messages. Get my PGP key from<br>
> > <a href="https://urldefense.com/v3/__http://pgp.mit.edu__;!!DZ3fjg!9JAgxc4juluJwklwTQgJGsYLXJIzzdHOqX8rwuiuymRXLrFedDv4PXSatzu0HVAYDfBFpiYxw19et3mEyg$" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable">
https://urldefense.com/v3/__http://pgp.mit.edu__;!!DZ3fjg!9JAgxc4juluJwklwTQgJGsYLXJIzzdHOqX8rwuiuymRXLrFedDv4PXSatzu0HVAYDfBFpiYxw19et3mEyg$</a><br>
> > . <br>
> <br>
> <br>
> --<br>
> My email is as private as my paper mail. I therefore support<br>
> encrypting and signing email messages. Get my PGP key from<br>
> <a href="https://urldefense.com/v3/__http://pgp.mit.edu__;!!DZ3fjg!_ZQHbCvNiX5H7WOd1mpD6m4ZA8ifF0XyFfV1P_ciu1NcIUBzbMZrd5MUw2aPPDdBii4pcb2ZGT1cTOAsRw$" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable">
https://urldefense.com/v3/__http://pgp.mit.edu__;!!DZ3fjg!_ZQHbCvNiX5H7WOd1mpD6m4ZA8ifF0XyFfV1P_ciu1NcIUBzbMZrd5MUw2aPPDdBii4pcb2ZGT1cTOAsRw$</a><br>
> .<br>
<br>
<br>
-- <br>
My email is as private as my paper mail. I therefore support encrypting<br>
and signing email messages. Get my PGP key from <a href="http://pgp.mit.edu" target="_blank" rel="noopener noreferrer" data-auth="NotApplicable">
http://pgp.mit.edu</a> .<br>
</div>
</span></font></div>
</div>
</div>
</div>
</div>
</body>
</html>