[Users] logic of scheduling SelectBoundConds in McLachlan?

Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF MARYLAND BALTIMORE COUNTY] bernard.j.kelly at nasa.gov
Tue Feb 19 12:24:53 CST 2013

Hi Ian (and Frank and Erik). Thanks for the further insight on the

I was looking at *all* the processor outputs (that is, all the
TimerReport_XXXXXX files), but not necessarily at all fields in all of
them. I concentrated on the CCTK_EVOL section of the report, and then only
looked closely at discrepancies between a sample "longer SelectBoundCond"
processor and each of the five or six "shorter SelectBoundcond"
processors. I suppose to do a more complete job, I'd have to start
scripting ...

Anyway, I *hadn't* been using those profiling parameters before, so my
conclusions were probably dodgy as you say. After your reply I re-enabled
them and restarted the run. Since it's so slow, I'm now looking at the
TimerReports from earlier in the new run, and no longer see any
discrepancies between different processors (that is, there don't seem to
be any "shorter SelectBoundcond" processors any more).

So if *all* the processors are showing essentially the same information,
and the "schedule_barriers" and "sync_barriers" are in place, then there's
no significant load imbalance? And yet it is slow as hell ...

I'm now testing with the actual repository McLachlan instead.


On 2/18/13 3:40 PM, "Ian Hinder" <ian.hinder at aei.mpg.de> wrote:

>On 18 Feb 2013, at 21:11, "Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF
>MARYLAND BALTIMORE COUNTY]" <bernard.j.kelly at nasa.gov> wrote:
>> [re-sent, with smaller attachment]
>> Hi Roland, and thanks for your reply. I'm still a bit confused, I
>> (see below) ...
>>>> I wouldn't mind, but while trying to understand why ML_BSSN was
>>>> so slowly on one of our machines, I looked at the TimerReport files,
>>>> saw that SelectBoundConds was taking *much* more time (like 20 times
>>>> long) than the actual RHS calculation routines.
>>> The long time is most likely caused by the fact that the boundary
>>> selection routine tends to be the one calling SYNC which means it is
>>> one that does an MPI wait (if there is load imbalance) and communicates
>>> data for buffer zone prolongation etc.
>> So it might be spending most of the time waiting for other cores to
>> up?
>If you look at timer output just for one process, you will almost
>certainly reach erroneous conclusions due to things like this.  I
>recommend to look at the output on all processes (yes, performance
>profiling is hard).
>> But if it's really waiting for prior routines to finish on other
>> processors, then on the handful of cores where SBC appears significantly
>> *quicker* than usual (e.g. ~50,000 seconds instead of ~100,000) I should
>> see earlier routines taking correspondingly *longer*, right? But I
>It may also be that timings change significantly from one iteration to
>the next.  Have you set your CPU affinity settings correctly?
>I recommend to set the parameters
>Carpet::schedule_barriers = yes
>Carpet::sync_barriers = yes
>This will insert an MPI barrier before and after each scheduled function
>call and sync.  Then you can rely on the timings of the individual
>functions, and also see how much time is spent waiting to catch up (i.e.
>in load imbalance).  At the moment, the function timers for functions
>which do communication will include time spent waiting for the other
>process to catch up.
>> I'm attaching TimerReport files for two cores on the same (128-core)
>> evolution. Core 000 is typical. Line 184 (the most up-to-date instance
>> "large" SBC behaviour) shows about 100K seconds spent cumulatively over
>> the simulation so far. Core 052 shows only about half as much time used
>> the same routine, but I can't see what other EVOL routines might be
>> up the slack.
>> (Note, BTW, that what I'm running isn't vanilla ML_BSSN, but a locally
>> modified version called MH_BSSN. The scheduling and most routines are
>> almost identical to McLachlan)
>> Bernard
>Ian Hinder

