[Users] Einstein Toolkit and modern AMD supercomputer

Fri Aug 27 12:18:18 CDT 2021

Hi Erik,

> You mention that mvapich has the best performance. Is there any reason
> to use any other MPI implementation?

The version of mvapich required is the latest one. Among the two AMD
clusters I have access to, only Expanse has this version and it is not even
the default one.

The performance with OpenMPI is the exactly the same on one node,
but worse on two nodes.

> Did you check that mvapich is configured correctly? Does it use the
> network efficiently?

How do I do this? Is it on my end, or on the system's end?

> You need to use SystemTopology, or ensure otherwise that the way
> threads and processes are mapped to hardware is reasonable.

I am using SystemTopology.

> What is the ratio of ghost/buffer to actually evolved grid points in your
setup?

Is there a quick way to find this out? I am using 14 refinement levels, so
I bet I have a lot of buffer zones. (However, the domain is big.)

> If MPI performance is slow, then the usual way out is to use OpenMP.
> You implied using 4 threads per process; did you try using 8 threads
> per process or more? This will also reduce memory consumption since
> there are fewer ghost zones. Unfortunately, OpenMP multi-threading in
> Carpet is not as efficient as it could be. CarpetX is much better in
> this respect.

I found that on one node, using 4 threads is noticeably faster than
anything else. The admins also recommended this configuration, as it
maps nicely to the hardware. I haven't tried on multiple nodes, though.

> Our way of discretizing equations (high-order methods with 3 ghost
> zones, AMR with buffer zones), combined with having many evolved
> variables, require a lot of storage, and also have a rather high
> parallelization overhead. A few ways out (none are production ready)
> are:
> - Use DGFE instead of finite differences; see e.g. Jonah Miller's PhD
> thesis and the respective McLachlan branch
> - Avoid buffer zones by using an improved time interpolation scheme
> (I've seen papers, I don't know about 3d code)
> - Switch to CarpetX to avoid subcycling in time.

The same parameter file scales "acceptably" on Frontera, so I should be
able to use the same algorithms on Expanse too and obtain some form
of scaling.

> If memory usage is high only on a single node, then this is probably
> caused by a serial code. Known serial pieces of code are ASCII output,
> wave extraction, or the apparent horizon finder. Try disabling these
> to see which (if any) is the culprit.

Carpet prints out a total memory usage that is much less than the
available one, but the simulation still crashes due to out-of-memory error.
Is it possible that one MPI uses much more memory than the others?

> Finally, if OpenMP performance is bad, you can try using only every
> second core and leaving the remainder idle, and see whether this

Yes, this is possible, but I would like to try to find better solutions.

> > - Avoid buffer zones by using an improved time interpolation scheme
> > (I've seen papers, I don't know about 3d code)
> Eloisa, Ian, I and Erik worked on this for a bit quite a while ago. The
> results can be found in branch ianhinder/rkprol:
>
> git clone -b ianhinder/rkprol
https://bitbucket.org/cactuscode/cactusnumerical.git
>
> note that this is actually a change to MoL not not so much Carpet. This
> implementation should work, but is not optimized and most likely still
> has (way) too much communication. Based on two presentations at the ET
> meeting in Stockholm:
>
> https://docs.einsteintoolkit.org/et-docs/ET_Workshop_2015
>
> by Bishop Mongwane and Saran Tunyasuvunakool.

If what I report here is a common problem in new AMD systems, it might be
worth seeing if this can find its way into master.

Gabriele

On Fri, Aug 27, 2021 at 9:57 AM Erik Schnetter <schnetter at cct.lsu.edu>
wrote:

> Gabriele
>
> Thanks for your thoughts.
>
> I have some general remarks:
>
> You mention that mvapich has the best performance. Is there any reason
> to use any other MPI implementation?
>
> Did you check that mvapich is configured correctly? Does it use the
> network efficiently?
>
> You need to use SystemTopology, or ensure otherwise that the way
> threads and processes are mapped to hardware is reasonable.
>
> What is the ratio of ghost/buffer to actually evolved grid points in your
> setup?
>
> If MPI performance is slow, then the usual way out is to use OpenMP.
> You implied using 4 threads per process; did you try using 8 threads
> per process or more? This will also reduce memory consumption since
> there are fewer ghost zones. Unfortunately, OpenMP multi-threading in
> Carpet is not as efficient as it could be. CarpetX is much better in
> this respect.
>
> Our way of discretizing equations (high-order methods with 3 ghost
> zones, AMR with buffer zones), combined with having many evolved
> variables, require a lot of storage, and also have a rather high
> parallelization overhead. A few ways out (none are production ready)
> are:
> - Use DGFE instead of finite differences; see e.g. Jonah Miller's PhD
> thesis and the respective McLachlan branch
> - Avoid buffer zones by using an improved time interpolation scheme
> (I've seen papers, I don't know about 3d code)
> - Switch to CarpetX to avoid subcycling in time.
>
> If memory usage is high only on a single node, then this is probably
> caused by a serial code. Known serial pieces of code are ASCII output,
> wave extraction, or the apparent horizon finder. Try disabling these
> to see which (if any) is the culprit.
>
> Finally, if OpenMP performance is bad, you can try using only every
> second core and leaving the remainder idle, and see whether this
> helps.
>
> -erik
>
>
> On Fri, Aug 27, 2021 at 12:45 PM Gabriele Bozzola
> <bozzola.gabriele at gmail.com> wrote:
> >
> > Hello,
> >
> > Last week I opened a PR to add the configuration files
> > for Expanse to simfactory. Expanse is an example of
> > the new generation of AMD supercomputers. Others are
> > Anvil, one of the other new XSEDE machines, or Puma,
> > the newest cluster at The University of Arizona.
> >
> > I have some experience with Puma and Expanse and
> > I would like to share some thoughts, some of which come
> > from interacting with the admins of Expanse. The problem
> > is that I am finding terrible multi-node performance on both
> > these machines, and I don't know if this will be a common
> > thread among new AMD clusters.
> >
> > These supercomputers have similar characteristics.
> >
> > First, they have very high cores/node count (typically
> > 128/node) but low memory per core (typically 2 GB / core).
> > In these conditions, it is very easy to have a job killed by
> > the OOM daemon. My suspicion is that it is rank 0 that
> > goes out of memory, and the entire run is aborted.
> >
> > Second, depending on the MPI implementation, MPI collective
> > operations can be extremely expensive. I was told that
> > the best implementation is mvapich 2.3.6 (at the moment).
> > This seems to be due to the high core count.
> >
> > I found that the code does not scale well. This is possibly
> > related to the previous point. If your job can fit on a single node,
> > it will run wonderfully. However, when you perform the same
> > simulation on two nodes, the code will actually be slower.
> > This indicates that there's no strong scaling at all from
> > 1 node to 2 (128 to 256 cores, or 32 to 64 MPI ranks).
> > Using mvapich 2.3.6 improves the situation, but it is still
> > faster to use fewer nodes.
> >
> > (My benchmark is a par file I've tested extensively on Frontera)
> >
> > I am working with Expanse's support staff to see what we can
> > do, but I wonder if anyone has had a positive experience with
> > this architecture and has some tips to share.
> >
> > Gabriele
> >
> > _______________________________________________
> > Users mailing list
> > Users at einsteintoolkit.org
> > http://lists.einsteintoolkit.org/mailman/listinfo/users
>
>
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20210827/dbd2c4a0/attachment-0001.html