<div dir="ltr">Hi Erik,<div><br></div><div>> You mention that mvapich has the best performance. Is there any reason<br>> to use any other MPI implementation?<br></div><div><br></div><div>The version of mvapich required is the latest one. Among the two AMD</div><div>clusters I have access to, only Expanse has this version and it is not even </div><div>the default one. </div><div><br></div><div>The performance with OpenMPI is the exactly the same on one node,<br></div><div>but worse on two nodes.</div><div><br></div><div>> Did you check that mvapich is configured correctly? Does it use the<br>> network efficiently?<br><br>How do I do this? Is it on my end, or on the system's end?<br><br>> You need to use SystemTopology, or ensure otherwise that the way<br>> threads and processes are mapped to hardware is reasonable.<br></div><div><br></div><div>I am using SystemTopology.</div><div><br></div><div>> What is the ratio of ghost/buffer to actually evolved grid points in your setup?<br></div><div><br></div><div>Is there a quick way to find this out? I am using 14 refinement levels, so</div><div>I bet I have a lot of buffer zones. (However, the domain is big.)</div><div><br></div><div>> If MPI performance is slow, then the usual way out is to use OpenMP.<br>> You implied using 4 threads per process; did you try using 8 threads<br>> per process or more? This will also reduce memory consumption since<br>> there are fewer ghost zones. Unfortunately, OpenMP multi-threading in<br>> Carpet is not as efficient as it could be. CarpetX is much better in<br>> this respect.<br></div><div><br></div><div>I found that on one node, using 4 threads is noticeably faster than</div><div>anything else. The admins also recommended this configuration, as it</div><div>maps nicely to the hardware. I haven't tried on multiple nodes, though.</div><div><br></div><div>> Our way of discretizing equations (high-order methods with 3 ghost<br>> zones, AMR with buffer zones), combined with having many evolved<br>> variables, require a lot of storage, and also have a rather high<br>> parallelization overhead. A few ways out (none are production ready)<br>> are:<br>> - Use DGFE instead of finite differences; see e.g. Jonah Miller's PhD<br>> thesis and the respective McLachlan branch<br>> - Avoid buffer zones by using an improved time interpolation scheme<br>> (I've seen papers, I don't know about 3d code)<br>> - Switch to CarpetX to avoid subcycling in time.<br></div><div><br></div><div>The same parameter file scales "acceptably" on Frontera, so I should be</div><div>able to use the same algorithms on Expanse too and obtain some form</div><div>of scaling.</div><div><br></div><div>> If memory usage is high only on a single node, then this is probably<br>> caused by a serial code. Known serial pieces of code are ASCII output,<br>> wave extraction, or the apparent horizon finder. Try disabling these<br>> to see which (if any) is the culprit.</div><div><br></div><div>Carpet prints out a total memory usage that is much less than the</div><div>available one, but the simulation still crashes due to out-of-memory error.</div><div>Is it possible that one MPI uses much more memory than the others?<br><br>> Finally, if OpenMP performance is bad, you can try using only every<br>> second core and leaving the remainder idle, and see whether this<br></div><div><br></div><div>Yes, this is possible, but I would like to try to find better solutions.</div><div><br></div><div><span class="gmail-im" style="color:rgb(80,0,80)">> > - Avoid buffer zones by using an improved time interpolation scheme<br>> > (I've seen papers, I don't know about 3d code)<br></span>> Eloisa, Ian, I and Erik worked on this for a bit quite a while ago. The<br>> results can be found in branch ianhinder/rkprol:<br>><br>> git clone -b ianhinder/rkprol <a href="https://bitbucket.org/cactuscode/cactusnumerical.git" rel="noreferrer" target="_blank">https://bitbucket.org/cactuscode/cactusnumerical.git</a><br>> <br>> note that this is actually a change to MoL not not so much Carpet. This<br>> implementation should work, but is not optimized and most likely still<br>> has (way) too much communication. Based on two presentations at the ET<br>> meeting in Stockholm:<br>> <br>> <a href="https://docs.einsteintoolkit.org/et-docs/ET_Workshop_2015" rel="noreferrer" target="_blank">https://docs.einsteintoolkit.org/et-docs/ET_Workshop_2015</a><br>> <br>> by Bishop Mongwane and Saran Tunyasuvunakool.<br></div><div><br></div><div>If what I report here is a common problem in new AMD systems, it might be</div><div>worth seeing if this can find its way into master.</div><div><br></div><div>Gabriele</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Aug 27, 2021 at 9:57 AM Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu">schnetter@cct.lsu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Gabriele<br>
<br>
Thanks for your thoughts.<br>
<br>
I have some general remarks:<br>
<br>
You mention that mvapich has the best performance. Is there any reason<br>
to use any other MPI implementation?<br>
<br>
Did you check that mvapich is configured correctly? Does it use the<br>
network efficiently?<br>
<br>
You need to use SystemTopology, or ensure otherwise that the way<br>
threads and processes are mapped to hardware is reasonable.<br>
<br>
What is the ratio of ghost/buffer to actually evolved grid points in your setup?<br>
<br>
If MPI performance is slow, then the usual way out is to use OpenMP.<br>
You implied using 4 threads per process; did you try using 8 threads<br>
per process or more? This will also reduce memory consumption since<br>
there are fewer ghost zones. Unfortunately, OpenMP multi-threading in<br>
Carpet is not as efficient as it could be. CarpetX is much better in<br>
this respect.<br>
<br>
Our way of discretizing equations (high-order methods with 3 ghost<br>
zones, AMR with buffer zones), combined with having many evolved<br>
variables, require a lot of storage, and also have a rather high<br>
parallelization overhead. A few ways out (none are production ready)<br>
are:<br>
- Use DGFE instead of finite differences; see e.g. Jonah Miller's PhD<br>
thesis and the respective McLachlan branch<br>
- Avoid buffer zones by using an improved time interpolation scheme<br>
(I've seen papers, I don't know about 3d code)<br>
- Switch to CarpetX to avoid subcycling in time.<br>
<br>
If memory usage is high only on a single node, then this is probably<br>
caused by a serial code. Known serial pieces of code are ASCII output,<br>
wave extraction, or the apparent horizon finder. Try disabling these<br>
to see which (if any) is the culprit.<br>
<br>
Finally, if OpenMP performance is bad, you can try using only every<br>
second core and leaving the remainder idle, and see whether this<br>
helps.<br>
<br>
-erik<br>
<br>
<br>
On Fri, Aug 27, 2021 at 12:45 PM Gabriele Bozzola<br>
<<a href="mailto:bozzola.gabriele@gmail.com" target="_blank">bozzola.gabriele@gmail.com</a>> wrote:<br>
><br>
> Hello,<br>
><br>
> Last week I opened a PR to add the configuration files<br>
> for Expanse to simfactory. Expanse is an example of<br>
> the new generation of AMD supercomputers. Others are<br>
> Anvil, one of the other new XSEDE machines, or Puma,<br>
> the newest cluster at The University of Arizona.<br>
><br>
> I have some experience with Puma and Expanse and<br>
> I would like to share some thoughts, some of which come<br>
> from interacting with the admins of Expanse. The problem<br>
> is that I am finding terrible multi-node performance on both<br>
> these machines, and I don't know if this will be a common<br>
> thread among new AMD clusters.<br>
><br>
> These supercomputers have similar characteristics.<br>
><br>
> First, they have very high cores/node count (typically<br>
> 128/node) but low memory per core (typically 2 GB / core).<br>
> In these conditions, it is very easy to have a job killed by<br>
> the OOM daemon. My suspicion is that it is rank 0 that<br>
> goes out of memory, and the entire run is aborted.<br>
><br>
> Second, depending on the MPI implementation, MPI collective<br>
> operations can be extremely expensive. I was told that<br>
> the best implementation is mvapich 2.3.6 (at the moment).<br>
> This seems to be due to the high core count.<br>
><br>
> I found that the code does not scale well. This is possibly<br>
> related to the previous point. If your job can fit on a single node,<br>
> it will run wonderfully. However, when you perform the same<br>
> simulation on two nodes, the code will actually be slower.<br>
> This indicates that there's no strong scaling at all from<br>
> 1 node to 2 (128 to 256 cores, or 32 to 64 MPI ranks).<br>
> Using mvapich 2.3.6 improves the situation, but it is still<br>
> faster to use fewer nodes.<br>
><br>
> (My benchmark is a par file I've tested extensively on Frontera)<br>
><br>
> I am working with Expanse's support staff to see what we can<br>
> do, but I wonder if anyone has had a positive experience with<br>
> this architecture and has some tips to share.<br>
><br>
> Gabriele<br>
><br>
> _______________________________________________<br>
> Users mailing list<br>
> <a href="mailto:Users@einsteintoolkit.org" target="_blank">Users@einsteintoolkit.org</a><br>
> <a href="http://lists.einsteintoolkit.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.einsteintoolkit.org/mailman/listinfo/users</a><br>
<br>
<br>
<br>
-- <br>
Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>><br>
<a href="http://www.perimeterinstitute.ca/personal/eschnetter/" rel="noreferrer" target="_blank">http://www.perimeterinstitute.ca/personal/eschnetter/</a><br>
</blockquote></div>