[Users] logic of scheduling SelectBoundConds in McLachlan?

Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF MARYLAND BALTIMORE COUNTY] bernard.j.kelly at nasa.gov
Wed Feb 20 14:33:19 CST 2013


Hi Ian.

The TimerReport.XXXXXX.txt files have no instance of "Barrier" (any capitalisation)  appearing. Does it depend on anything else apart from the parameters you've specified? I'm attaching one process's TimerReport (only the last time output, for compactness), and the associated parfile here.

Bernard

From: Ian Hinder <ian.hinder at aei.mpg.de<mailto:ian.hinder at aei.mpg.de>>
Date: Tuesday, February 19, 2013 4:16 PM
To: Erik Schnetter <schnetter at cct.lsu.edu<mailto:schnetter at cct.lsu.edu>>
Cc: Bernard Kelly <bernard.j.kelly at nasa.gov<mailto:bernard.j.kelly at nasa.gov>>, "users at einsteintoolkit.org<mailto:users at einsteintoolkit.org>" <users at einsteintoolkit.org<mailto:users at einsteintoolkit.org>>
Subject: Re: [Users] logic of scheduling SelectBoundConds in McLachlan?


On 19 Feb 2013, at 20:16, Erik Schnetter <schnetter at cct.lsu.edu<mailto:schnetter at cct.lsu.edu>> wrote:

On Tue, Feb 19, 2013 at 1:24 PM, Kelly, Bernard J. (GSFC-660.0)[UNIVERSITY OF MARYLAND BALTIMORE COUNTY] <bernard.j.kelly at nasa.gov<mailto:bernard.j.kelly at nasa.gov>> wrote:
Hi Ian (and Frank and Erik). Thanks for the further insight on the
profiling.


[Please ignore the new mail that just came through with the 400KB
attachment. That was my first attempt that was held for moderation because
of the attachment size. Then I sent the slimmed-down attachments, but this
was still in the pipeline.]

I was looking at *all* the processor outputs (that is, all the
TimerReport_XXXXXX files), but not necessarily at all fields in all of
them. I concentrated on the CCTK_EVOL section of the report, and then only
looked closely at discrepancies between a sample "longer SelectBoundCond"
processor and each of the five or six "shorter SelectBoundcond"
processors. I suppose to do a more complete job, I'd have to start
scripting ...

Anyway, I *hadn't* been using those profiling parameters before, so my
conclusions were probably dodgy as you say. After your reply I re-enabled
them and restarted the run. Since it's so slow, I'm now looking at the
TimerReports from earlier in the new run, and no longer see any
discrepancies between different processors (that is, there don't seem to
be any "shorter SelectBoundcond" processors any more).

So if *all* the processors are showing essentially the same information,
and the "schedule_barriers" and "sync_barriers" are in place, then there's
no significant load imbalance? And yet it is slow as hell ...

With schedule barriers, load imbalance is hidden in these barriers. That is, you would need to measure how much time each process spends in these barriers. I expect that some processes will spend 0s there, while others will spend 50,000s there. That would be your load imbalance.

When I added the sync barriers, I added timers on all the barriers.  You should see timer entries named ".../Barrier".  Do you see these, and are they taking a lot of time?  The timer names are hierarchical, so you should be able to see which function barriers are causing the slowdown.

When I have done tests using schedule barriers, they did not impose a huge penalty like the one you are describing.  Maybe 30%, no more.

--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20130220/ccf4d9c5/attachment-0001.html 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: TimerReport.000000.txt
Url: http://lists.einsteintoolkit.org/pipermail/users/attachments/20130220/ccf4d9c5/attachment-0001.txt 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: X2_U0_d8.2_1o96_McHahndol.par
Type: application/octet-stream
Size: 16114 bytes
Desc: X2_U0_d8.2_1o96_McHahndol.par
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20130220/ccf4d9c5/attachment-0001.obj 


More information about the Users mailing list