[Users] strong scaling tests on Stampede (TACC)

Erik Schnetter schnetter at cct.lsu.edu
Fri Feb 22 09:09:34 CST 2013


Bruno

The amount of data output varies greatly between simulations, as well as
the intervals between output. And usually, I/O scales very differently than
the computation itself. Therefore, it is customary (at least while trying
to understand results) to test evolution and output separately. That is,
the evolution benchmark would probably output the maximum of rho only, and
an I/O benchmark would output (and/or recover) just Minkowski data without
time evolution.

Regarding the number of threads: To determine the ideal number of threads
to use, we run single-node benchmarks of code that is well parallellised
via OpenMP. Code that is not really parallel (e.g. some initial data
routines) would distort these results, as would using a large number of
cores, since this involves also parallel scalability. Using this "ideal"
number of cores, we then run scalability benchmarks to see where parallel
scaling breaks down.

For any given physics situation, one then has to strike a compromise:
- non-parallel sections of the code prefer using 1 OpenMP thread
- parallel MPI scaling prefers using as few MPI processes as possible, i.e.
using many OpenMP threads
The balance thus shifts with the number of cores -- the more cores you use,
the more OpenMP threads you will also want to use to counter-act MPI
scaling problems.

As others have said, binding threads to cores and binding memory to
processes is also very important, and is visible in particular on
single-node benchmarks.

For any benchmark I run, I also look at detailed timer output. The standard
Cactus timers are not good enough for this; you will have to use
TimerReport or Carpet's timers for this. One interesting quantity is to see
what fraction of the time is spent in the actual evolution thorns (not just
CCTK_EVOL; this also measures some infrastructure tasks). The other
interesting effect to watch is how this time distribution changes as the
number of MPI processes increases -- this shows the effect of scaling
problems. The latter should e.g. show that ASCII output becomes more time
consuming, or that synchronisation or load balancing takes more time.

Finally, I found it very difficult to come up with a "good" benchmark
parameter file. Such a benchmark should run both on few and on many cores,
should contains all the relevant thorns, should not do I/O, should not
contain anything that is known not to scale, should not encounter nans or
con2prim problems, should be "close" to actual parameter files that people
actually want to use, etc.

I think it's time to create a wiki page for benchmarking! There we could
describe (a) the tools available (timers, etc.), (b) the pitfalls to avoid
(e.g. measure evolution and I/O separately), and (c) discuss results that
we find.

In this case -- I think this means we should improve ASCII output! Writing
ASCII files is always slow, but collecting data onto a single process
shouldn't be. That's no more than a reduction operation, and with
InfiniBand bandwidths of tens of Gigabytes per second, we are very far away
from what the hardware performance allows us to do. Let's open a bug report
for this. We should either correct this, or should people prominently warn
about this.

-erik



On Thu, Feb 21, 2013 at 8:31 PM, Bruno Giacomazzo <
bruno.giacomazzo at jila.colorado.edu> wrote:

> Hi,
>         I did some strong scaling tests on Stampede (
> https://www.xsede.org/tacc-stampede) with both Whisky and GRHydro (using
> the development version of ET in both cases). I used a Carpet par file that
> Roberto DePietri provided me and that he used for similar tests on an
> Italian cluster (I have attached the GRHydro version, the Whisky one is
> similar except for using Whisky instead of GRHydro). I used both Intel MPI
> and Mvapich and I did both pure MPI and MPI/OpenMP runs.
>
>         I have attached a text file with my results. The first column is
> the name of the run (if it starts with mvapich it used mvapich otherwise it
> used Intel MPI), the second one is the number of cores (option --procs in
> simfactory), the third one the number of threads (--num-threads), the
> fourth one the time in seconds spent in CCTK_EVOL, and the fifth one the
> walltime in seconds (i.e., the total time used by the run as measured on
> the cluster). I have also attached a couple of figures that show CCTK_EVOL
> vs #cores and walltime vs #cores (only for Intel MPI runs).
>
>         First of all, in pure MPI runs (--num-threads=1) I was unable to
> run on more than 1024 cores using Intel MPI (the run was just crashing
> before iteration zero or hanging up). No problem instead when using
> --num-threads=8 or --num-threads=16. I also noticed that scaling was
> particularly bad in pure MPI runs and that a lot of time was spent outside
> CCTK_EVOL (both with Intel MPI and MVAPICH). After speaking with Roberto, I
> found out that the problem is due to 1D ASCII output (which is active in
> that parfile) and that makes the runs particularly slow above ~100 cores on
> this machine. In plot_scaling_walltime_all.pdf I plot also two pure MPI
> runs, but without 1D ASCII output and the scaling is much better in this
> case (the time spent in CCTK_EVOL is identical to the case with 1D output
> and hence I didn't plot them in the other figure). I didn't try using 1D
> hdf5 output instead, does anyone use it?
>
>         According to my tests, --num-threads=16 performs better than
> --num-threads=8 (which is the current default value in simfactory) and
> Intel MPI seems to be better than MVAPICH. Is there a particular reason for
> using 8 instead of 16 threads as the default simfactory value on Stampede?
>
>         Let me know if you have any comment or suggestion.
>
> Cheers,
> Bruno
>
> Dr. Bruno Giacomazzo
> JILA - University of Colorado
> 440 UCB
> Boulder, CO 80309
> USA
>
> Tel.  : +1 303-492-5170
> Fax  : +1 303-492-5235
> email : bruno.giacomazzo at jila.colorado.edu
> web: http://www.brunogiacomazzo.org
>
> ----------------------------------------------------------------------
> There are only 10 types of people in the world:
> Those who understand binary, and those who don't
> ----------------------------------------------------------------------
>
>
>
>
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
>
>


-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20130222/5ac34856/attachment.html 


More information about the Users mailing list