<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Hi Erik, Roland, all,<br>

      <br>

      After our discussion on last week's telecon, I followed Roland's

      instructions on how to get the branch which has changes to how

      Carpet handles prolongation with respect to OpenMP.  I reran my

      simple scaling test on Stampede Skylake nodes using this branch of

      Carpet (rhaas/openmp-tasks) to test the scalability.  <br>

      <br>

      Attached is a plot showing the speeds for a variety of number of

      nodes and how the 48 threads are distributed on the nodes between

      MPI processes and OpenMP threads.  I did this for three versions

      of the ETK.  1. Fresh checkout of ET_2017_06.  2. The ET_2017_06

      with Carpet switched to the rhaas/openmp-tasks (labelled "Test

      On") 3. Again with the checkout from #2, but without the

      parameters to enable the new prolongation code (labelled "Test

      Off").  The run speeds used were grabbed at iteration 256 from

      Carpet::physical_time_per_hour.  No IO or regridding.<br>

      <br>

      For 4 and 8 nodes (ie 192 and 384 cores), there wasn't much

      difference between the 3 trials.  However, for 16 and 24 nodes

      (768 and 1152 cores), we see some improvement in run speed

      (10-15%) for many choices of distribution of threads, again with a

      slight preference for 8 ranks/node.  <br>

      <br>

      I also ran the previous test (not using the openmp-tasks branch)

      on comet, and found similar results as before.<br>

      <br>

      Thanks,<br>

      Jim<br>

      <br>

      On 01/21/2018 01:07 PM, Erik Schnetter wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CADKQjjfTkaNWWMgohEwOVnGtFmdy4H_A6ixhrDtKgVmXYkUACQ@mail.gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <div dir="ltr">James

        <div><br>

        </div>

        <div>I looked at OpenMP performance in the Einstein Toolkit a

          few months ago, and I found that Carpet's prolongation

          operators are not well parallelized. There is a branch in

          Carpet (and a few related thorns) that apply a different

          OpenMP parallelization strategy, which seems to be more

          efficient. We are currently looking into cherry-picking the

          relevant changes from this branch (there are also many

          unrelated changes, since I experimented a lot) and putting

          them back into the master branch.</div>

        <div><br>

        </div>

        <div>These changes only help with prolongation, which seems to

          be a major contributor to non-OpenMP-scalability. I

          experimented with other changes as well. My findings

          (unfortunately without good solutions so far) are:</div>

        <div><br>

        </div>

        <div>- The standard OpenMP parallelization of loops over grid

          functions is not good for data cache locality. I experimented

          with padding arrays, ensuring that loop boundaries align with

          cache line boundaries, etc., but this never worked quite

          satisfactorily -- MPI parallelization is still faster than

          OpenMP. In effect, the only reason one would use OpenMP is

          once one encounters MPI's scalability limits, so that OpenMP's

          non-scalability is less worse.</div>

        <div><br>

        </div>

        <div>- We could overlap calculations with communication. To do

          so, I have experimental changes that break loops over grid

          functions into tiles. Outer tiles need to wait for

          communication (synchronization or parallelization) to finish,

          while inner tiles can be calculated right away. Unfortunately,

          OpenMP does not support open-ended threads like this, so I'm

          using Qthreads &lt;<a

            href="https://github.com/Qthreads/qthreads"

            moz-do-not-send="true">https://github.com/Qthreads/qthreads</a>&gt;

          and FunHPC &lt;<a

            href="https://bitbucket.org/eschnett/funhpc.cxx"

            moz-do-not-send="true">https://bitbucket.org/eschnett/funhpc.cxx</a>&gt;

          for this. The respective changes to Carpet, the scheduler, and

          thorns are significant, and I couldn't prove any performance

          improvements yet. However, once we removed other, more

          prominent non-scalability causes, I hope that this will become

          interesting.</div>

        <div><br>

        </div>

        <div>I haven't been attending the ET phone calls recently

          because Monday mornings aren't good for me schedule-wise. If

          you are interested, then we can ensure that we both attend at

          the same time and then discuss this. We need to make sure the

          Roland Haas is then also attending.</div>

        <div><br>

        </div>

        <div>-erik</div>

        <div><br>

        </div>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On Sat, Jan 20, 2018 at 10:21 AM, James

          Healy <span dir="ltr">&lt;<a href="mailto:jchsma@rit.edu"

              target="_blank" moz-do-not-send="true">jchsma@rit.edu</a>&gt;</span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF">

              <p>Hello all,</p>

              <p>I am trying to run on the new skylake processors on

                Stampede2 and while the run speeds we are obtaining are

                very good, we are concerned that we aren't optimizing

                properly when it comes to OpenMP.  For instance, we see

                the best speeds when we use 8 MPI processors per node

                (with 6 threads each for a total of 48 total

                threads/node).  Based on the architecture, we were

                expecting to see the best speeds with 2 MPI/node.  Here

                is what I have tried:</p>

              <ol>

                <li>Using the simfactory files for stampede2-skx (config

                  file, run and submit scripts, and modules loaded) I

                  compiled a version of ET_2017_06 using LazEv (RIT's

                  evolution thorn) and McLachlan and submitted a series

                  of runs that change both the number of nodes used, and

                  how I distribute the 48 threads/node between MPI

                  processes.<br>

                </li>

                <li>I use a standard low resolution grid, with no IO or

                  regridding.  Parameter file attached.</li>

                <li>Run speeds are measured from

                  Carpet::physical_time_per_hour at iteration 256. <br>

                </li>

                <li>I tried both with and without hwloc/SystemTopology.<br>

                </li>

                <li>For both McLachlan and LazEv, I see similar results,

                  with 2 MPI/node giving the worst results (see attached

                  plot for McLachlan) and a slight preferences for 8

                  MPI/node.<br>

                </li>

              </ol>

              <p>So my questions are:</p>

              <ol>

                <li>Has there been any tests run by any other users on

                  stampede2 skx?<br>

                </li>

                <li>Should we expect 2 MPI/node to be the optimal

                  choice? <br>

                </li>

                <li>If so, are there any other configurations we can try

                  that could help optimize?</li>

              </ol>

              <p>Thanks in advance!</p>

              <p>Jim Healy</p>

            </div>

            <br>

            ______________________________<wbr>_________________<br>

            Users mailing list<br>

            <a href="mailto:Users@einsteintoolkit.org"

              moz-do-not-send="true">Users@einsteintoolkit.org</a><br>

            <a

              href="http://lists.einsteintoolkit.org/mailman/listinfo/users"

              rel="noreferrer" target="_blank" moz-do-not-send="true">http://lists.einsteintoolkit.<wbr>org/mailman/listinfo/users</a><br>

            <br>

          </blockquote>

        </div>

        <br>

        <br clear="all">

        <div><br>

        </div>

        -- <br>

        <div class="gmail_signature" data-smartmail="gmail_signature">

          <div dir="ltr">

            <div>Erik Schnetter &lt;<a

                href="mailto:schnetter@cct.lsu.edu" target="_blank"

                moz-do-not-send="true">schnetter@cct.lsu.edu</a>&gt;<br>

              <a

                href="http://www.perimeterinstitute.ca/personal/eschnetter/"

                target="_blank" moz-do-not-send="true">http://www.perimeterinstitute.ca/personal/eschnetter/</a></div>

            <div><br>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    <p><br>

    </p>

  </body>

</html>