[Users] logic of scheduling SelectBoundConds in McLachlan?
Ian Hinder
ian.hinder at aei.mpg.de
Tue Feb 19 15:16:01 CST 2013
On 19 Feb 2013, at 20:16, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
> On Tue, Feb 19, 2013 at 1:24 PM, Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF MARYLAND BALTIMORE COUNTY] <bernard.j.kelly at nasa.gov> wrote:
> Hi Ian (and Frank and Erik). Thanks for the further insight on the
> profiling.
>
>
> [Please ignore the new mail that just came through with the 400KB
> attachment. That was my first attempt that was held for moderation because
> of the attachment size. Then I sent the slimmed-down attachments, but this
> was still in the pipeline.]
>
> I was looking at *all* the processor outputs (that is, all the
> TimerReport_XXXXXX files), but not necessarily at all fields in all of
> them. I concentrated on the CCTK_EVOL section of the report, and then only
> looked closely at discrepancies between a sample "longer SelectBoundCond"
> processor and each of the five or six "shorter SelectBoundcond"
> processors. I suppose to do a more complete job, I'd have to start
> scripting ...
>
> Anyway, I *hadn't* been using those profiling parameters before, so my
> conclusions were probably dodgy as you say. After your reply I re-enabled
> them and restarted the run. Since it's so slow, I'm now looking at the
> TimerReports from earlier in the new run, and no longer see any
> discrepancies between different processors (that is, there don't seem to
> be any "shorter SelectBoundcond" processors any more).
>
> So if *all* the processors are showing essentially the same information,
> and the "schedule_barriers" and "sync_barriers" are in place, then there's
> no significant load imbalance? And yet it is slow as hell ...
>
> With schedule barriers, load imbalance is hidden in these barriers. That is, you would need to measure how much time each process spends in these barriers. I expect that some processes will spend 0s there, while others will spend 50,000s there. That would be your load imbalance.
When I added the sync barriers, I added timers on all the barriers. You should see timer entries named ".../Barrier". Do you see these, and are they taking a lot of time? The timer names are hierarchical, so you should be able to see which function barriers are causing the slowdown.
When I have done tests using schedule barriers, they did not impose a huge penalty like the one you are describing. Maybe 30%, no more.
--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20130219/d0fd0fb9/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20130219/d0fd0fb9/attachment.bin
More information about the Users
mailing list