<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body><div><font face="sans-serif">Hi, for another application we found that 4-8 mpi ranks per node was necessary in order to saturate network bandwidth. Since the application was network bandwidth limited, this was key to performance. Joel&nbsp;</font></div><div><br></div><div><br></div><div><br></div><div id="composer_signature"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><div style="font-size:85%;color:#575757">Sent from my Samsung Galaxy S8</div></div><div><br></div><div style="font-size:100%;color:#000000"><!-- originalMessage --><div>-------- Original message --------</div><div>From: James Healy &lt;jchsma@rit.edu&gt; </div><div>Date: 1/20/18  10:21 AM  (GMT-05:00) </div><div>To: Einstein Toolkit Users &lt;users@einsteintoolkit.org&gt;, Yosef Zlochower &lt;yosef@astro.rit.edu&gt;, Carlos Lousto &lt;lousto@astro.rit.edu&gt; </div><div>Subject: [Users] Using Stampede2 SKX </div><div><br></div></div>

    <p>Hello all,</p>

    <p>I am trying to run on the new skylake processors on Stampede2 and

      while the run speeds we are obtaining are very good, we are

      concerned that we aren't optimizing properly when it comes to

      OpenMP.&nbsp; For instance, we see the best speeds when we use 8 MPI

      processors per node (with 6 threads each for a total of 48 total

      threads/node).&nbsp; Based on the architecture, we were expecting to

      see the best speeds with 2 MPI/node.&nbsp; Here is what I have tried:</p>

    <ol>

      <li>Using the simfactory files for stampede2-skx (config file, run

        and submit scripts, and modules loaded) I compiled a version of

        ET_2017_06 using LazEv (RIT's evolution thorn) and McLachlan and

        submitted a series of runs that change both the number of nodes

        used, and how I distribute the 48 threads/node between MPI

        processes.<br>

      </li>

      <li>I use a standard low resolution grid, with no IO or

        regridding.&nbsp; Parameter file attached.</li>

      <li>Run speeds are measured from Carpet::physical_time_per_hour at

        iteration 256. <br>

      </li>

      <li>I tried both with and without hwloc/SystemTopology.<br>

      </li>

      <li>For both McLachlan and LazEv, I see similar results, with 2

        MPI/node giving the worst results (see attached plot for

        McLachlan) and a slight preferences for 8 MPI/node.<br>

      </li>

    </ol>

    <p>So my questions are:</p>

    <ol>

      <li>Has there been any tests run by any other users on stampede2

        skx?<br>

      </li>

      <li>Should we expect 2 MPI/node to be the optimal choice? <br>

      </li>

      <li>If so, are there any other configurations we can try that

        could help optimize?</li>

    </ol>

    <p>Thanks in advance!</p>

    <p>Jim Healy</p>

  </body></html>