<div dir="ltr">Very good! That looks like a 25% speed improvement in the mid-range of #MPI processes per node.<div><br></div><div>It also looks as if the maximum speed is achieved by using between 8 and 24 MPI processes per node, i.e. between 2 and 6 OpenMP threads per MPI process.<br><div><br></div><div>-erik</div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 19, 2018 at 10:07 AM, James Healy <span dir="ltr">&lt;<a href="mailto:jchsma@rit.edu" target="_blank">jchsma@rit.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello all,<br>

<br>

I followed up our previous discussion a few weeks ago by redoing the scaling tests but with hwloc and SystemTopology turned on.  I attached a plot showing the difference when using or not using the openmp tasks changes to prolongation.  I also attached the stdout files for the ranks=24  for tasks on and off with hwloc with the output from TimerReport.<br>

<br>

Thanks,<br>

Jim<div class="HOEnZb"><div class="h5"><br>

<br>

On 01/26/2018 10:26 AM, Roland Haas wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hello Jim,<br>

<br>

thank you very much for giving this a spin.<br>

<br>

Yours,<br>

Roland<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Erik, Roland, all,<br>

<br>

After our discussion on last week&#39;s telecon, I followed Roland&#39;s instructions on how to get the branch which has changes to how Carpet handles prolongation with respect to OpenMP.  I reran my simple scaling test on Stampede Skylake nodes using this branch of Carpet (rhaas/openmp-tasks) to test the scalability.<br>

<br>

Attached is a plot showing the speeds for a variety of number of nodes and how the 48 threads are distributed on the nodes between MPI processes and OpenMP threads.  I did this for three versions of the ETK.  1. Fresh checkout of ET_2017_06.  2. The ET_2017_06 with Carpet switched to the rhaas/openmp-tasks (labelled &quot;Test On&quot;) 3. Again with the checkout from #2, but without the parameters to enable the new prolongation code (labelled &quot;Test Off&quot;).  The run speeds used were grabbed at iteration 256 from Carpet::physical_time_per_hour<wbr>.  No IO or regridding.<br>

<br>

For 4 and 8 nodes (ie 192 and 384 cores), there wasn&#39;t much difference between the 3 trials.  However, for 16 and 24 nodes (768 and 1152 cores), we see some improvement in run speed (10-15%) for many choices of distribution of threads, again with a slight preference for 8 ranks/node.<br>

<br>

I also ran the previous test (not using the openmp-tasks branch) on comet, and found similar results as before.<br>

<br>

Thanks,<br>

Jim<br>

<br>

On 01/21/2018 01:07 PM, Erik Schnetter wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

James<br>

<br>

I looked at OpenMP performance in the Einstein Toolkit a few months &gt; ago, and I found that Carpet&#39;s prolongation operators are not well &gt; parallelized. There is a branch in Carpet (and a few related thorns) &gt; that apply a different OpenMP parallelization strategy, which seems to &gt; be more efficient. We are currently looking into cherry-picking the &gt; relevant changes from this branch (there are also many unrelated &gt; changes, since I experimented a lot) and putting them back into the &gt; master branch.<br>

<br>

These changes only help with prolongation, which seems to be a major &gt; contributor to non-OpenMP-scalability. I experimented with other &gt; changes as well. My findings (unfortunately without good solutions so &gt; far) are:<br>

<br>

- The standard OpenMP parallelization of loops over grid functions is &gt; not good for data cache locality. I experimented with padding arrays, &gt; ensuring that loop boundaries align with cache line boundaries, etc., &gt; but this never worked quite satisfactorily -- MPI parallelization is &gt; still faster than OpenMP. In effect, the only reason one would use &gt; OpenMP is once one encounters MPI&#39;s scalability limits, so that &gt; OpenMP&#39;s non-scalability is less worse.<br>

<br>

- We could overlap calculations with communication. To do so, I have &gt; experimental changes that break loops over grid functions into tiles. &gt; Outer tiles need to wait for communication (synchronization or &gt; parallelization) to finish, while inner tiles can be calculated right &gt; away. Unfortunately, OpenMP does not support open-ended threads like &gt; this, so I&#39;m using Qthreads &lt;<a href="https://github.com/Qthreads/qthreads" rel="noreferrer" target="_blank">https://github.com/Qthreads/q<wbr>threads</a>&gt; and &gt; FunHPC &lt;<a href="https://bitbucket.org/eschnett/funhpc.cxx" rel="noreferrer" target="_blank">https://bitbucket.org/eschnet<wbr>t/funhpc.cxx</a>&gt; for this. The &gt; respective changes to Carpet, the scheduler, and thorns are &gt; significant, and I couldn&#39;t prove any performance improvements yet. &gt; However, once we removed other, more prominent non-scalability causes, &gt; I hope that this will become interesting.<br>

<br>

I haven&#39;t been attending the ET phone calls recently because Monday &gt; mornings aren&#39;t good for me schedule-wise. If you are interested, then &gt; we can ensure that we both attend at the same time and then discuss &gt; this. We need to make sure the Roland Haas is then also attending.<br>

<br>

-erik<br>

<br>

<br>

On Sat, Jan 20, 2018 at 10:21 AM, James Healy &lt;<a href="mailto:jchsma@rit.edu" target="_blank">jchsma@rit.edu</a> &gt; &lt;mailto:<a href="mailto:jchsma@rit.edu" target="_blank">jchsma@rit.edu</a>&gt;&gt; wrote:<br>

<br>

     Hello all,<br>

<br>

     I am trying to run on the new skylake processors on Stampede2 and<br>

     while the run speeds we are obtaining are very good, we are<br>

     concerned that we aren&#39;t optimizing properly when it comes to<br>

     OpenMP.  For instance, we see the best speeds when we use 8 MPI<br>

     processors per node (with 6 threads each for a total of 48 total<br>

     threads/node).  Based on the architecture, we were expecting to<br>

     see the best speeds with 2 MPI/node.  Here is what I have tried:<br>

<br>

      1. Using the simfactory files for stampede2-skx (config file, run<br>

         and submit scripts, and modules loaded) I compiled a version<br>

         of ET_2017_06 using LazEv (RIT&#39;s evolution thorn) and<br>

         McLachlan and submitted a series of runs that change both the<br>

         number of nodes used, and how I distribute the 48 threads/node<br>

         between MPI processes.<br>

      2. I use a standard low resolution grid, with no IO or<br>

         regridding.  Parameter file attached.<br>

      3. Run speeds are measured from Carpet::physical_time_per_hour at<br>

         iteration 256.<br>

      4. I tried both with and without hwloc/SystemTopology.<br>

      5. For both McLachlan and LazEv, I see similar results, with 2<br>

         MPI/node giving the worst results (see attached plot for<br>

         McLachlan) and a slight preferences for 8 MPI/node.<br>

<br>

     So my questions are:<br>

<br>

      1. Has there been any tests run by any other users on stampede2 skx?<br>

      2. Should we expect 2 MPI/node to be the optimal choice?<br>

      3. If so, are there any other configurations we can try that<br>

         could help optimize?<br>

<br>

     Thanks in advance!<br>

<br>

     Jim Healy<br>

<br>

<br>

     _____________________________<wbr>__________________<br>

     Users mailing list<br>

     <a href="mailto:Users@einsteintoolkit.org" target="_blank">Users@einsteintoolkit.org</a> &lt;mailto:<a href="mailto:Users@einsteintoolkit.org" target="_blank">Users@einsteintoolkit.<wbr>org</a>&gt;<br>

     <a href="http://lists.einsteintoolkit.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.einsteintoolkit.<wbr>org/mailman/listinfo/users</a><br>

     &lt;<a href="http://lists.einsteintoolkit.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.<wbr>einsteintoolkit.org/mailman/<wbr>listinfo/users</a>&gt;<br>

<br>

<br>

<br>

  -- &gt; Erik Schnetter &lt;<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a> &lt;mailto:<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>&gt;<wbr>&gt;<br>

<a href="http://www.perimeterinstitute.ca/personal/eschnetter/" rel="noreferrer" target="_blank">http://www.perimeterinstitute.<wbr>ca/personal/eschnetter/</a><br>

  <br>

</blockquote></blockquote>

<br>

<br>

</blockquote>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Erik Schnetter &lt;<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>&gt;<br><a href="http://www.perimeterinstitute.ca/personal/eschnetter/" target="_blank">http://www.perimeterinstitute.ca/personal/eschnetter/</a></div><div><br></div></div></div>

</div>