[Users] Using Stampede2 SKX

Wed Feb 21 18:54:22 CST 2018

Jim

You can enable thorn SystemTopology, but disable its setting that makes it
change CPU bindings. This will output the bindings that OpenMP chooses.

The problem with setting the bindings via OpenMP is that OpenMP is not
aware of the multiple MPI processes, and thus cannot prevent that multiple
threads from different MPI processes bind to the same core. Whether this is
a problem in practice is then evident from SystemTopology's output.

-erik

On Wed, Feb 21, 2018 at 6:51 PM, James Healy <jchsma at rit.edu> wrote:

> Hi Roland, all,
>
> I tried the changes Roland made to the runscript on stampede2. The point
> being to see if by choosing a different OpenMP binding than what Stampede2
> uses by default, we can achieve better run speeds without enabling
> hwloc/SystemTopology.  The answer is yes.
>
> I looked at the case with 2 threads per node (24 MPI ranks) on 4, 8, 16,
> and 24 nodes in 4 different situations.  Attached is a plot showing the
> results.
>
> The lines are labeled "Binding" with a Yes or No, and "h/ST" for
> hwloc/SystemTopology with either a Yes or No.  The runs with "Binding Y"
> include the following 2 lines in the runscript:
>
> export OMP_PLACES=cores
> export OMP_PROC_BIND=close
>
> There is no noticeable difference between the bindings when hwloc/ST are
> active.  But when hwloc/ST aren't active, choosing the bindings as above
> brings the run speeds in line with the hwloc/ST lines.
>
> Thanks,
> Jim
>
>
> On 02/21/2018 01:21 PM, Roland Haas wrote:
>
>> Hello Jim,
>>
>> thank you for benchmarking these. I have just updated the defaults in
>> simfactory to be 2 threads per node (ie 24 MPI ranks) since this gave
>> you the fastest simulation when using hwloc (though not without).
>>
>> I suspect the hwloc requirement is due to bad default layout of the
>> threads by TACC.
>>
>> I have pushed (hopefully saner) settings for task binding into a branch
>> rhaas/stampede2 (git pull ; git checkout rhaas/stampede2). If you have
>> time, would you mind benchmarking using those settings as well, please?
>>
>> Yours,
>> Roland
>>
>> Very good! That looks like a 25% speed improvement in the mid-range of
>>> #MPI
>>> processes per node.
>>>
>>> It also looks as if the maximum speed is achieved by using between 8 and
>>> 24
>>> MPI processes per node, i.e. between 2 and 6 OpenMP threads per MPI
>>> process.
>>>
>>> -erik
>>>
>>> On Mon, Feb 19, 2018 at 10:07 AM, James Healy <jchsma at rit.edu> wrote:
>>>
>>> Hello all,
>>>>
>>>> I followed up our previous discussion a few weeks ago by redoing the
>>>> scaling tests but with hwloc and SystemTopology turned on.  I attached a
>>>> plot showing the difference when using or not using the openmp tasks
>>>> changes to prolongation.  I also attached the stdout files for the
>>>> ranks=24  for tasks on and off with hwloc with the output from
>>>> TimerReport.
>>>>
>>>> Thanks,
>>>> Jim
>>>>
>>>>
>>>> On 01/26/2018 10:26 AM, Roland Haas wrote:
>>>>
>>>>
>>>>> Hello Jim,
>>>>>
>>>>> thank you very much for giving this a spin.
>>>>>
>>>>> Yours,
>>>>> Roland
>>>>>
>>>>> Hi Erik, Roland, all,
>>>>>
>>>>>> After our discussion on last week's telecon, I followed Roland's
>>>>>> instructions on how to get the branch which has changes to how Carpet
>>>>>> handles prolongation with respect to OpenMP.  I reran my simple
>>>>>> scaling
>>>>>> test on Stampede Skylake nodes using this branch of Carpet
>>>>>> (rhaas/openmp-tasks) to test the scalability.
>>>>>>
>>>>>> Attached is a plot showing the speeds for a variety of number of nodes
>>>>>> and how the 48 threads are distributed on the nodes between MPI
>>>>>> processes
>>>>>> and OpenMP threads.  I did this for three versions of the ETK.  1.
>>>>>> Fresh
>>>>>> checkout of ET_2017_06.  2. The ET_2017_06 with Carpet switched to the
>>>>>> rhaas/openmp-tasks (labelled "Test On") 3. Again with the checkout
>>>>>> from #2,
>>>>>> but without the parameters to enable the new prolongation code
>>>>>> (labelled
>>>>>> "Test Off").  The run speeds used were grabbed at iteration 256 from
>>>>>> Carpet::physical_time_per_hour.  No IO or regridding.
>>>>>>
>>>>>> For 4 and 8 nodes (ie 192 and 384 cores), there wasn't much difference
>>>>>> between the 3 trials.  However, for 16 and 24 nodes (768 and 1152
>>>>>> cores),
>>>>>> we see some improvement in run speed (10-15%) for many choices of
>>>>>> distribution of threads, again with a slight preference for 8
>>>>>> ranks/node.
>>>>>>
>>>>>> I also ran the previous test (not using the openmp-tasks branch) on
>>>>>> comet, and found similar results as before.
>>>>>>
>>>>>> Thanks,
>>>>>> Jim
>>>>>>
>>>>>> On 01/21/2018 01:07 PM, Erik Schnetter wrote:
>>>>>>
>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> I looked at OpenMP performance in the Einstein Toolkit a few months >
>>>>>>> ago, and I found that Carpet's prolongation operators are not well >
>>>>>>> parallelized. There is a branch in Carpet (and a few related thorns)
>>>>>>> > that
>>>>>>> apply a different OpenMP parallelization strategy, which seems to >
>>>>>>> be more
>>>>>>> efficient. We are currently looking into cherry-picking the >
>>>>>>> relevant
>>>>>>> changes from this branch (there are also many unrelated > changes,
>>>>>>> since I
>>>>>>> experimented a lot) and putting them back into the > master branch.
>>>>>>>
>>>>>>> These changes only help with prolongation, which seems to be a major
>>>>>>> >
>>>>>>> contributor to non-OpenMP-scalability. I experimented with other >
>>>>>>> changes
>>>>>>> as well. My findings (unfortunately without good solutions so > far)
>>>>>>> are:
>>>>>>>
>>>>>>> - The standard OpenMP parallelization of loops over grid functions
>>>>>>> is >
>>>>>>> not good for data cache locality. I experimented with padding
>>>>>>> arrays, >
>>>>>>> ensuring that loop boundaries align with cache line boundaries,
>>>>>>> etc., > but
>>>>>>> this never worked quite satisfactorily -- MPI parallelization is >
>>>>>>> still
>>>>>>> faster than OpenMP. In effect, the only reason one would use >
>>>>>>> OpenMP is
>>>>>>> once one encounters MPI's scalability limits, so that > OpenMP's
>>>>>>> non-scalability is less worse.
>>>>>>>
>>>>>>> - We could overlap calculations with communication. To do so, I have
>>>>>>> >
>>>>>>> experimental changes that break loops over grid functions into
>>>>>>> tiles. >
>>>>>>> Outer tiles need to wait for communication (synchronization or >
>>>>>>> parallelization) to finish, while inner tiles can be calculated
>>>>>>> right >
>>>>>>> away. Unfortunately, OpenMP does not support open-ended threads like
>>>>>>> >
>>>>>>> this, so I'm using Qthreads <https://github.com/Qthreads/qthreads>
>>>>>>> and
>>>>>>>
>>>>>>>> FunHPC <https://bitbucket.org/eschnett/funhpc.cxx> for this. The >
>>>>>>>>
>>>>>>> respective changes to Carpet, the scheduler, and thorns are >
>>>>>>> significant,
>>>>>>> and I couldn't prove any performance improvements yet. > However,
>>>>>>> once we
>>>>>>> removed other, more prominent non-scalability causes, > I hope that
>>>>>>> this
>>>>>>> will become interesting.
>>>>>>>
>>>>>>> I haven't been attending the ET phone calls recently because Monday >
>>>>>>> mornings aren't good for me schedule-wise. If you are interested,
>>>>>>> then > we
>>>>>>> can ensure that we both attend at the same time and then discuss >
>>>>>>> this. We
>>>>>>> need to make sure the Roland Haas is then also attending.
>>>>>>>
>>>>>>> -erik
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jan 20, 2018 at 10:21 AM, James Healy <jchsma at rit.edu >
>>>>>>> <mailto:jchsma at rit.edu>> wrote:
>>>>>>>
>>>>>>>       Hello all,
>>>>>>>
>>>>>>>       I am trying to run on the new skylake processors on Stampede2
>>>>>>> and
>>>>>>>       while the run speeds we are obtaining are very good, we are
>>>>>>>       concerned that we aren't optimizing properly when it comes to
>>>>>>>       OpenMP.  For instance, we see the best speeds when we use 8 MPI
>>>>>>>       processors per node (with 6 threads each for a total of 48
>>>>>>> total
>>>>>>>       threads/node).  Based on the architecture, we were expecting to
>>>>>>>       see the best speeds with 2 MPI/node.  Here is what I have
>>>>>>> tried:
>>>>>>>
>>>>>>>        1. Using the simfactory files for stampede2-skx (config file,
>>>>>>> run
>>>>>>>           and submit scripts, and modules loaded) I compiled a
>>>>>>> version
>>>>>>>           of ET_2017_06 using LazEv (RIT's evolution thorn) and
>>>>>>>           McLachlan and submitted a series of runs that change both
>>>>>>> the
>>>>>>>           number of nodes used, and how I distribute the 48
>>>>>>> threads/node
>>>>>>>           between MPI processes.
>>>>>>>        2. I use a standard low resolution grid, with no IO or
>>>>>>>           regridding.  Parameter file attached.
>>>>>>>        3. Run speeds are measured from
>>>>>>> Carpet::physical_time_per_hour at
>>>>>>>           iteration 256.
>>>>>>>        4. I tried both with and without hwloc/SystemTopology.
>>>>>>>        5. For both McLachlan and LazEv, I see similar results, with 2
>>>>>>>           MPI/node giving the worst results (see attached plot for
>>>>>>>           McLachlan) and a slight preferences for 8 MPI/node.
>>>>>>>
>>>>>>>       So my questions are:
>>>>>>>
>>>>>>>        1. Has there been any tests run by any other users on
>>>>>>> stampede2
>>>>>>> skx?
>>>>>>>        2. Should we expect 2 MPI/node to be the optimal choice?
>>>>>>>        3. If so, are there any other configurations we can try that
>>>>>>>           could help optimize?
>>>>>>>
>>>>>>>       Thanks in advance!
>>>>>>>
>>>>>>>       Jim Healy
>>>>>>>
>>>>>>>
>>>>>>>       _______________________________________________
>>>>>>>       Users mailing list
>>>>>>>       Users at einsteintoolkit.org <mailto:Users at einsteintoolkit.org>
>>>>>>>       http://lists.einsteintoolkit.org/mailman/listinfo/users
>>>>>>>       <http://lists.einsteintoolkit.org/mailman/listinfo/users>
>>>>>>>
>>>>>>>
>>>>>>>      -- > Erik Schnetter <schnetter at cct.lsu.edu <mailto:
>>>>>>> schnetter at cct.lsu.edu>>
>>>>>>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20180221/caccfa81/attachment.html