<div dir="ltr">On Tue, Feb 19, 2013 at 1:24 PM, Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF MARYLAND BALTIMORE COUNTY] <span dir="ltr">&lt;<a href="mailto:bernard.j.kelly@nasa.gov" target="_blank">bernard.j.kelly@nasa.gov</a>&gt;</span> wrote:<br>

<div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Ian (and Frank and Erik). Thanks for the further insight on the<br>


profiling.<br>

<br>

<br>

[Please ignore the new mail that just came through with the 400KB<br>

attachment. That was my first attempt that was held for moderation because<br>

of the attachment size. Then I sent the slimmed-down attachments, but this<br>

was still in the pipeline.]<br>

<br>

I was looking at *all* the processor outputs (that is, all the<br>

TimerReport_XXXXXX files), but not necessarily at all fields in all of<br>

them. I concentrated on the CCTK_EVOL section of the report, and then only<br>

looked closely at discrepancies between a sample &quot;longer SelectBoundCond&quot;<br>

processor and each of the five or six &quot;shorter SelectBoundcond&quot;<br>

processors. I suppose to do a more complete job, I&#39;d have to start<br>

scripting ...<br>

<br>

Anyway, I *hadn&#39;t* been using those profiling parameters before, so my<br>

conclusions were probably dodgy as you say. After your reply I re-enabled<br>

them and restarted the run. Since it&#39;s so slow, I&#39;m now looking at the<br>

TimerReports from earlier in the new run, and no longer see any<br>

discrepancies between different processors (that is, there don&#39;t seem to<br>

be any &quot;shorter SelectBoundcond&quot; processors any more).<br>

<br>

So if *all* the processors are showing essentially the same information,<br>

and the &quot;schedule_barriers&quot; and &quot;sync_barriers&quot; are in place, then there&#39;s<br>

no significant load imbalance? And yet it is slow as hell ...<br></blockquote><div><br></div><div style>With schedule barriers, load imbalance is hidden in these barriers. That is, you would need to measure how much time each process spends in these barriers. I expect that some processes will spend 0s there, while others will spend 50,000s there. That would be your load imbalance.</div>

<div style><br></div><div style>-erik</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

I&#39;m now testing with the actual repository McLachlan instead.<br>

<span class="HOEnZb"><font color="#888888"><br>

Bernard<br>

</font></span><div class="im HOEnZb"><br>

On 2/18/13 3:40 PM, &quot;Ian Hinder&quot; &lt;<a href="mailto:ian.hinder@aei.mpg.de">ian.hinder@aei.mpg.de</a>&gt; wrote:<br>

<br>

&gt;<br>

&gt;On 18 Feb 2013, at 21:11, &quot;Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF<br>

&gt;MARYLAND BALTIMORE COUNTY]&quot; &lt;<a href="mailto:bernard.j.kelly@nasa.gov">bernard.j.kelly@nasa.gov</a>&gt; wrote:<br>

&gt;<br>

&gt;&gt; [re-sent, with smaller attachment]<br>

&gt;&gt;<br>

&gt;&gt; Hi Roland, and thanks for your reply. I&#39;m still a bit confused, I<br>

&gt;&gt;confess<br>

&gt;&gt; (see below) ...<br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;&gt;<br>

</div><div class="HOEnZb"><div class="h5">&gt;&gt;&gt;&gt; I wouldn&#39;t mind, but while trying to understand why ML_BSSN was<br>

&gt;&gt;&gt;&gt;evolving<br>

&gt;&gt;&gt;&gt; so slowly on one of our machines, I looked at the TimerReport files,<br>

&gt;&gt;&gt;&gt;and<br>

&gt;&gt;&gt;&gt; saw that SelectBoundConds was taking *much* more time (like 20 times<br>

&gt;&gt;&gt;&gt;as<br>

&gt;&gt;&gt;&gt; long) than the actual RHS calculation routines.<br>

&gt;&gt;&gt; The long time is most likely caused by the fact that the boundary<br>

&gt;&gt;&gt; selection routine tends to be the one calling SYNC which means it is<br>

&gt;&gt;&gt;the<br>

&gt;&gt;&gt; one that does an MPI wait (if there is load imbalance) and communicates<br>

&gt;&gt;&gt; data for buffer zone prolongation etc.<br>

&gt;&gt;<br>

&gt;&gt; So it might be spending most of the time waiting for other cores to<br>

&gt;&gt;catch<br>

&gt;&gt; up?<br>

&gt;<br>

&gt;If you look at timer output just for one process, you will almost<br>

&gt;certainly reach erroneous conclusions due to things like this.  I<br>

&gt;recommend to look at the output on all processes (yes, performance<br>

&gt;profiling is hard).<br>

&gt;<br>

&gt;&gt; But if it&#39;s really waiting for prior routines to finish on other<br>

&gt;&gt; processors, then on the handful of cores where SBC appears significantly<br>

&gt;&gt; *quicker* than usual (e.g. ~50,000 seconds instead of ~100,000) I should<br>

&gt;&gt; see earlier routines taking correspondingly *longer*, right? But I<br>

&gt;&gt;don&#39;t.<br>

&gt;<br>

&gt;It may also be that timings change significantly from one iteration to<br>

&gt;the next.  Have you set your CPU affinity settings correctly?<br>

&gt;<br>

&gt;I recommend to set the parameters<br>

&gt;<br>

&gt;Carpet::schedule_barriers = yes<br>

&gt;Carpet::sync_barriers = yes<br>

&gt;<br>

&gt;This will insert an MPI barrier before and after each scheduled function<br>

&gt;call and sync.  Then you can rely on the timings of the individual<br>

&gt;functions, and also see how much time is spent waiting to catch up (i.e.<br>

&gt;in load imbalance).  At the moment, the function timers for functions<br>

&gt;which do communication will include time spent waiting for the other<br>

&gt;process to catch up.<br>

&gt;<br>

&gt;&gt; I&#39;m attaching TimerReport files for two cores on the same (128-core)<br>

&gt;&gt; evolution. Core 000 is typical. Line 184 (the most up-to-date instance<br>

&gt;&gt;of<br>

&gt;&gt; &quot;large&quot; SBC behaviour) shows about 100K seconds spent cumulatively over<br>

&gt;&gt; the simulation so far. Core 052 shows only about half as much time used<br>

&gt;&gt;in<br>

&gt;&gt; the same routine, but I can&#39;t see what other EVOL routines might be<br>

&gt;&gt;taking<br>

&gt;&gt; up the slack.<br>

&gt;&gt;<br>

&gt;&gt; (Note, BTW, that what I&#39;m running isn&#39;t vanilla ML_BSSN, but a locally<br>

&gt;&gt; modified version called MH_BSSN. The scheduling and most routines are<br>

&gt;&gt; almost identical to McLachlan)<br>

&gt;&gt;<br>

&gt;&gt; Bernard<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; Yours,<br>

&gt;&gt;&gt; Roland<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; --<br>

&gt;&gt;&gt; My email is as private as my paper mail. I therefore support encrypting<br>

&gt;&gt;&gt; and signing email messages. Get my PGP key from <a href="http://keys.gnupg.net" target="_blank">http://keys.gnupg.net</a>.<br>

&gt;&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;&lt;TimerReports_LATEST_BJK.tgz&gt;____________________________________________<br>

&gt;&gt;___<br>

&gt;&gt; Users mailing list<br>

&gt;&gt; <a href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a><br>

&gt;&gt; <a href="http://lists.einsteintoolkit.org/mailman/listinfo/users" target="_blank">http://lists.einsteintoolkit.org/mailman/listinfo/users</a><br>

&gt;<br>

&gt;--<br>

&gt;Ian Hinder<br>

&gt;<a href="http://numrel.aei.mpg.de/people/hinder" target="_blank">http://numrel.aei.mpg.de/people/hinder</a><br>

&gt;<br>

<br>

_______________________________________________<br>

Users mailing list<br>

<a href="mailto:Users@einsteintoolkit.org">Users@einsteintoolkit.org</a><br>

<a href="http://lists.einsteintoolkit.org/mailman/listinfo/users" target="_blank">http://lists.einsteintoolkit.org/mailman/listinfo/users</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Erik Schnetter &lt;<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>&gt;<br><a href="http://www.perimeterinstitute.ca/personal/eschnetter/" target="_blank">http://www.perimeterinstitute.ca/personal/eschnetter/</a>

</div></div>