<div dir="ltr"><div>Dear Sir,</div><div><br></div><div><div>I am a little worried about the performance because this is a new cluster we have and it is supposed to be performing well. I am inclined to think that some libraries/compiler <span class="gmail-gr_ gmail-gr_1804 gmail-gr-alert gmail-gr_gramm gmail-gr_inline_cards gmail-gr_run_anim gmail-Style gmail-multiReplace" id="gmail-1804">options / settings</span> might be the bottleneck.<span class="gmail-gr_ gmail-gr_1904 gmail-gr-alert gmail-gr_gramm gmail-gr_inline_cards gmail-gr_run_anim gmail-Punctuation gmail-multiReplace" id="gmail-1904"></span></div><div><span class="gmail-gr_ gmail-gr_1904 gmail-gr-alert gmail-gr_gramm gmail-gr_inline_cards gmail-gr_run_anim gmail-Punctuation gmail-multiReplace" id="gmail-1904"><br></span></div><div><span class="gmail-gr_ gmail-gr_1904 gmail-gr-alert gmail-gr_gramm gmail-gr_inline_cards gmail-gr_run_anim gmail-Punctuation gmail-multiReplace" id="gmail-1904"><br></span></div></div><div>I am presently running two simulations, both using the same parameter file GW150914.rpar. <br></div><div><br></div><div>The first one is using mpich-3.3.1, the same as in the simulation mentioned in the previous thread. I am using one node consisting of 2*16 cores, and 32 mpiprocs. <br></div><div><br></div><div>The second one is using openmpi-3.1.2 with openmp. It uses 128 procs, distributed among 16 mpiprocs and 8 openmp threads per mpiproc. Since I have 32 PPN, it is launching 4 mpiprocs per node.</div><div><br></div><div>I am herewith attaching the carpet-timing..asc file from both these runs. </div><div><br></div><div>Thanking you <br></div><div><br></div><div>Regards,<br></div><div>Vaishak<br></div><div> <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 4, 2019 at 8:05 PM Haas, Roland <<a href="mailto:rhaas@illinois.edu">rhaas@illinois.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello Vaishak,<br>
<br>
I do not see anything obviously wrong with the setup.<br>
<br>
It uses 128 MPI ranks for the 4 nodes which fits with there being 2x16<br>
cores per node. <br>
<br>
Lookin at the timer tree output at iteration 1024 (search for<br>
"gettimeof " and you will find the spot) out of 5977s spend during<br>
Evolve about 2143s were spent in "syncs" which is communication and<br>
about the same amount of time in "thorns" that is doing computation.<br>
While this ratio is not great (spending more time sending data than<br>
doing computation) it is also not unheard of.<br>
<br>
Getting the original output files for the gallery data from Zenodo<br>
(link is on the gallery page):<br>
<br>
wget<br>
<a href="https://zenodo.org/record/155394/files/GW150914_28.tar.xz?download=1" rel="noreferrer" target="_blank">https://zenodo.org/record/155394/files/GW150914_28.tar.xz?download=1</a><br>
<br>
you can see (in GW150914_28/output-0000/GW150914_28.out) that that one<br>
took about 137s for syncs and 198s for thorns, so the same ratio but<br>
about a factor of 10 faster.<br>
<br>
I am reaching for straws here, but sometimes having too many MPI ranks<br>
can be detrimental if there is not enough work to split up (OpenMP can<br>
be a bit more forgiving in that respect, the original gallery run<br>
used 120 cores on 10 nodes using 6 OpenMP threads per MPI rank).<br>
<br>
Since each node has lots of RAM (more than the 96GB required to run the<br>
simulation), can you try and see what would happen if you were to run<br>
on only a single node?<br>
<br>
Also if you could add the parameter:<br>
<br>
Carpet::output_timers_every = 1024<br>
<br>
then provide the files carpet-timing-statistics*.asc that would let us<br>
know in even more detail where the time is spent.<br>
<br>
Running for a short time (2048 iterations) is enough to get data to<br>
compare.<br>
<br>
Yours,<br>
Roland<br>
<br>
> Dear All,<br>
> <br>
> I am running the simulation GW150914 using the parameter file available at<br>
> the ETK gallery at (GW150914-ETK gallery<br>
> <<a href="https://einsteintoolkit.org/gallery/bbh/index.html" rel="noreferrer" target="_blank">https://einsteintoolkit.org/gallery/bbh/index.html</a>>) using 128 cores.<br>
> <br>
> Each compute node consists of 2 X 16 Cores Intel SkyLake ( Intel(R) Xeon(R)<br>
> Gold 6142 CPU @ 2.60GHz) and 384 GB RAM . I have compiled and am running<br>
> Einstein Toolkit without OpenMP and using mpich-3.3.1.<br>
> <br>
> <br>
> The issue is that the simulation seems to be running at a very slow pace.<br>
> The number of physical time per hour that it is completing is only about<br>
> 1.3 units. At this rate to complete 1700 units, it would take about 54<br>
> days, in contrast to 2.8 days on (Intel(R) Xeon(R) CPU E5-2630 v3 @<br>
> 2.40GHz) as per the details at the example run of GW150914 available at the<br>
> gallery (GW150914-ETK gallery<br>
> <<a href="https://einsteintoolkit.org/gallery/bbh/index.html" rel="noreferrer" target="_blank">https://einsteintoolkit.org/gallery/bbh/index.html</a>>).<br>
> <br>
> I have also tried using intel mpi (impi) but with simular results.<br>
> <br>
> I am also attaching the out file from the simulation.<br>
> <br>
> Looking forward to your inputs.<br>
> <br>
> <br>
> Thanks and regards,<br>
> <br>
> <br>
> <br>
> <br>
> <br>
> Vaishak P<br>
> <br>
> PhD Scholar,<br>
> Shyama Prasad Mukherjee Fellow<br>
> Inter-University Center for Astronomy and Astrophysics (IUCAA)<br>
> Pune, India<br>
<br>
<br>
<br>
-- <br>
My email is as private as my paper mail. I therefore support encrypting<br>
and signing email messages. Get my PGP key from <a href="http://pgp.mit.edu" rel="noreferrer" target="_blank">http://pgp.mit.edu</a> .<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Regards,<br>Vaishak P<br><br>PhD Scholar,<br>Shyama Prasad Mukherjee Fellow<br>
Inter-University Center for Astronomy and Astrophysics (IUCAA)<br>
Pune, India<div><br></div></div></div>