[Users] XSEDE's Expanse and failing tests

Wed Aug 18 10:19:52 CDT 2021

Hello Gabriele,

Thank you for contributing these.

The test suites are quick running parfiles with small grids, so running
them on large numbers of MPI ranks (they are designed for 1 or 2 MPI
ranks) can lead to unexpected situations (such as an MPI rank having no
grid points at all).

Generally, if the tests work for 1,2,4 ranks (4 being the largest
number of procs requested by any test.ccl file) then this is sufficient.

In principle even running on more MPI ranks should work, so if you know
which tests fail with the larger number of MPI ranks and were to list
them in a ticket, maybe someone could look into this.

Note that you can undersubscribe  compute node, in particular for
tests, if you do not need / want to use all cores.

Can you create a pull request for the "linux" architecture file with
the changes for the AMD compiler you found, please? So far it sees you
mostly only changed the detection part, does it then not also require
some changes in the "set values" part of the file? Eg default values
for optimization, preprocessor or so?

Yours,
Roland

> Hello,
> 
> Two days ago, I opened a PR to the simfactory repo to add Expanse,
> the newest machine at the San Diego Supercomputing Center, based on
> AMD Epyc "Rome" CPUs and part of XSEDE. In the meantime, I realized
> that some tests are failing miserably, but I couldn't figure out why.
> 
> Before I describe what I found, let me start with a side node on AMD
> compilers.
> 
> <side node>
> 
> There are four compilers available on Expanse: GNU, Intel, AMD, and PGI.
> I did not touch the PGI compilers. I briefly tried (and failed) to compile
> with
> the AMD compilers (aocc and flang). I did not try hard, and it seems that
> most of the libraries on Expanse are compiled with gcc anyways.
> 
> A first step to support these compilers is adding the lines:
> 
>    elif test "`$F90 --version 2>&1 | grep AMD`" ; then
>      LINUX_F90_COMP=AMD
>    else
> 
>  elif test "`$CC --version 2>&1 | grep AMD`" ; then
>    LINUX_C_COMP=AMD
>  fi
> 
>  elif test "`$CC --version 2>&1 | grep AMD`" ; then
>    LINUX_CXX_COMP=AMD
>  fi
> 
> in the obvious places in flesh/lib/make/known-architecture/linux.
> 
> </side node>
> 
> I successfully compiled the Einstein Toolkit with
> - gcc 10.2.0 and OpenMPI 4.0.4
> - gcc 9.2.0 and OpenMPI 4.0.4
> - intel 2019 and Intel MPI 2019
> 
> I noticed that some tests, like ADMMass/tov_carpet.par, gave
> completely incorrect results. For example, the expected value is 1.3,
> but I would find 1.6.
> 
> I disabled all the optimizations, but the test would keep failing. At the
> end, I noticed that if I ran with 8/16/32 MPI processes per node, and
> the corresponding number of OpenMP threads (128/N_MPI), the test
> would fail, but if I ran with 4/2/1 MPI processes, the test would pass.
> 
> Most of my experiments were with gcc 10, but the test fails also with
> the Intel suite.
> 
> I tried increasing the OMP_STACK_SIZE to a very large value, but
> it didn't help.
> 
> Any idea of what the problem might be?
> 
> Gabriele

-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20210818/07e9ac18/attachment.bin