[Users] Using Stampede2 SKX

Roland Haas roland.haas at physics.gatech.edu
Wed Feb 21 12:21:11 CST 2018


Hello Jim,

thank you for benchmarking these. I have just updated the defaults in
simfactory to be 2 threads per node (ie 24 MPI ranks) since this gave
you the fastest simulation when using hwloc (though not without).

I suspect the hwloc requirement is due to bad default layout of the
threads by TACC.

I have pushed (hopefully saner) settings for task binding into a branch
rhaas/stampede2 (git pull ; git checkout rhaas/stampede2). If you have
time, would you mind benchmarking using those settings as well, please?

Yours,
Roland

> Very good! That looks like a 25% speed improvement in the mid-range of #MPI
> processes per node.
> 
> It also looks as if the maximum speed is achieved by using between 8 and 24
> MPI processes per node, i.e. between 2 and 6 OpenMP threads per MPI process.
> 
> -erik
> 
> On Mon, Feb 19, 2018 at 10:07 AM, James Healy <jchsma at rit.edu> wrote:
> 
> > Hello all,
> >
> > I followed up our previous discussion a few weeks ago by redoing the
> > scaling tests but with hwloc and SystemTopology turned on.  I attached a
> > plot showing the difference when using or not using the openmp tasks
> > changes to prolongation.  I also attached the stdout files for the
> > ranks=24  for tasks on and off with hwloc with the output from TimerReport.
> >
> > Thanks,
> > Jim
> >
> >
> > On 01/26/2018 10:26 AM, Roland Haas wrote:
> >  
> >> Hello Jim,
> >>
> >> thank you very much for giving this a spin.
> >>
> >> Yours,
> >> Roland
> >>
> >> Hi Erik, Roland, all,  
> >>>
> >>> After our discussion on last week's telecon, I followed Roland's
> >>> instructions on how to get the branch which has changes to how Carpet
> >>> handles prolongation with respect to OpenMP.  I reran my simple scaling
> >>> test on Stampede Skylake nodes using this branch of Carpet
> >>> (rhaas/openmp-tasks) to test the scalability.
> >>>
> >>> Attached is a plot showing the speeds for a variety of number of nodes
> >>> and how the 48 threads are distributed on the nodes between MPI processes
> >>> and OpenMP threads.  I did this for three versions of the ETK.  1. Fresh
> >>> checkout of ET_2017_06.  2. The ET_2017_06 with Carpet switched to the
> >>> rhaas/openmp-tasks (labelled "Test On") 3. Again with the checkout from #2,
> >>> but without the parameters to enable the new prolongation code (labelled
> >>> "Test Off").  The run speeds used were grabbed at iteration 256 from
> >>> Carpet::physical_time_per_hour.  No IO or regridding.
> >>>
> >>> For 4 and 8 nodes (ie 192 and 384 cores), there wasn't much difference
> >>> between the 3 trials.  However, for 16 and 24 nodes (768 and 1152 cores),
> >>> we see some improvement in run speed (10-15%) for many choices of
> >>> distribution of threads, again with a slight preference for 8 ranks/node.
> >>>
> >>> I also ran the previous test (not using the openmp-tasks branch) on
> >>> comet, and found similar results as before.
> >>>
> >>> Thanks,
> >>> Jim
> >>>
> >>> On 01/21/2018 01:07 PM, Erik Schnetter wrote:
> >>>  
> >>>> James
> >>>>
> >>>> I looked at OpenMP performance in the Einstein Toolkit a few months >
> >>>> ago, and I found that Carpet's prolongation operators are not well >
> >>>> parallelized. There is a branch in Carpet (and a few related thorns) > that
> >>>> apply a different OpenMP parallelization strategy, which seems to > be more
> >>>> efficient. We are currently looking into cherry-picking the > relevant
> >>>> changes from this branch (there are also many unrelated > changes, since I
> >>>> experimented a lot) and putting them back into the > master branch.
> >>>>
> >>>> These changes only help with prolongation, which seems to be a major >
> >>>> contributor to non-OpenMP-scalability. I experimented with other > changes
> >>>> as well. My findings (unfortunately without good solutions so > far) are:
> >>>>
> >>>> - The standard OpenMP parallelization of loops over grid functions is >
> >>>> not good for data cache locality. I experimented with padding arrays, >
> >>>> ensuring that loop boundaries align with cache line boundaries, etc., > but
> >>>> this never worked quite satisfactorily -- MPI parallelization is > still
> >>>> faster than OpenMP. In effect, the only reason one would use > OpenMP is
> >>>> once one encounters MPI's scalability limits, so that > OpenMP's
> >>>> non-scalability is less worse.
> >>>>
> >>>> - We could overlap calculations with communication. To do so, I have >
> >>>> experimental changes that break loops over grid functions into tiles. >
> >>>> Outer tiles need to wait for communication (synchronization or >
> >>>> parallelization) to finish, while inner tiles can be calculated right >
> >>>> away. Unfortunately, OpenMP does not support open-ended threads like >
> >>>> this, so I'm using Qthreads <https://github.com/Qthreads/qthreads> and  
> >>>> > FunHPC <https://bitbucket.org/eschnett/funhpc.cxx> for this. The >  
> >>>> respective changes to Carpet, the scheduler, and thorns are > significant,
> >>>> and I couldn't prove any performance improvements yet. > However, once we
> >>>> removed other, more prominent non-scalability causes, > I hope that this
> >>>> will become interesting.
> >>>>
> >>>> I haven't been attending the ET phone calls recently because Monday >
> >>>> mornings aren't good for me schedule-wise. If you are interested, then > we
> >>>> can ensure that we both attend at the same time and then discuss > this. We
> >>>> need to make sure the Roland Haas is then also attending.
> >>>>
> >>>> -erik
> >>>>
> >>>>
> >>>> On Sat, Jan 20, 2018 at 10:21 AM, James Healy <jchsma at rit.edu >
> >>>> <mailto:jchsma at rit.edu>> wrote:
> >>>>
> >>>>      Hello all,
> >>>>
> >>>>      I am trying to run on the new skylake processors on Stampede2 and
> >>>>      while the run speeds we are obtaining are very good, we are
> >>>>      concerned that we aren't optimizing properly when it comes to
> >>>>      OpenMP.  For instance, we see the best speeds when we use 8 MPI
> >>>>      processors per node (with 6 threads each for a total of 48 total
> >>>>      threads/node).  Based on the architecture, we were expecting to
> >>>>      see the best speeds with 2 MPI/node.  Here is what I have tried:
> >>>>
> >>>>       1. Using the simfactory files for stampede2-skx (config file, run
> >>>>          and submit scripts, and modules loaded) I compiled a version
> >>>>          of ET_2017_06 using LazEv (RIT's evolution thorn) and
> >>>>          McLachlan and submitted a series of runs that change both the
> >>>>          number of nodes used, and how I distribute the 48 threads/node
> >>>>          between MPI processes.
> >>>>       2. I use a standard low resolution grid, with no IO or
> >>>>          regridding.  Parameter file attached.
> >>>>       3. Run speeds are measured from Carpet::physical_time_per_hour at
> >>>>          iteration 256.
> >>>>       4. I tried both with and without hwloc/SystemTopology.
> >>>>       5. For both McLachlan and LazEv, I see similar results, with 2
> >>>>          MPI/node giving the worst results (see attached plot for
> >>>>          McLachlan) and a slight preferences for 8 MPI/node.
> >>>>
> >>>>      So my questions are:
> >>>>
> >>>>       1. Has there been any tests run by any other users on stampede2
> >>>> skx?
> >>>>       2. Should we expect 2 MPI/node to be the optimal choice?
> >>>>       3. If so, are there any other configurations we can try that
> >>>>          could help optimize?
> >>>>
> >>>>      Thanks in advance!
> >>>>
> >>>>      Jim Healy
> >>>>
> >>>>
> >>>>      _______________________________________________
> >>>>      Users mailing list
> >>>>      Users at einsteintoolkit.org <mailto:Users at einsteintoolkit.org>
> >>>>      http://lists.einsteintoolkit.org/mailman/listinfo/users
> >>>>      <http://lists.einsteintoolkit.org/mailman/listinfo/users>
> >>>>
> >>>>
> >>>>  
> >>>>   -- > Erik Schnetter <schnetter at cct.lsu.edu <mailto:  
> >>>> schnetter at cct.lsu.edu>>  
> >>>> http://www.perimeterinstitute.ca/personal/eschnetter/
> >>>>
> >>>>  
> >>>  
> >>
> >>  
> >  
> 
> 



-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20180221/9ac5e18b/attachment.bin 


More information about the Users mailing list