[Users] Benchmarking results for McLachlan rewrite

Fri Jul 24 16:32:22 CDT 2015

On 24 Jul 2015, at 23:01, Erik Schnetter <schnetter at cct.lsu.edu> wrote:

> On Fri, Jul 24, 2015 at 3:43 PM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
> 
> On 24 Jul 2015, at 20:39, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
> 
>> On Fri, Jul 24, 2015 at 1:58 PM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>> 
>> On 24 Jul 2015, at 19:42, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>> 
>>> On Fri, Jul 24, 2015 at 1:39 PM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>>> 
>>> On 24 Jul 2015, at 19:15, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>>> 
>>>> On Fri, Jul 24, 2015 at 11:57 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>>>> 
>>>> On 8 Jul 2015, at 16:53, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>>>> 
>>>>> 
>>>>> On 8 Jul 2015, at 15:14, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>>>>> 
>>>>>> I added a second benchmark, using a Thornburg04 patch system, 8th order finite differencing, and 4th order patch interpolation. The results are
>>>>>> 
>>>>>> original: 8.53935e-06 sec
>>>>>> rewrite:  8.55188e-06 sec
>>>>>> 
>>>>>> this time with 1 thread per MPI process, since that was most efficient in both cases. Most of the time is spent in inter-patch interpolation, which is much more expensive than in a "regular" case since this benchmark is run on a single node and hence with very small grids.
>>>>>> 
>>>>>> With these numbers under our belt, can we merge the rewrite branch?
>>>>> 
>>>>> The "jacobian" benchmark that I gave you was still a pure kernel benchmark, involving no interpatch interpolation.  It just measured the speed of the RHSs when Jacobians were included.  I would also not use a single-threaded benchmark with very small grid sizes; this might have been fastest in this artificial case, but in practice I don't think we would use that configuration.  The benchmark you have now run seems to be more of a "complete system" benchmark, which is useful, but different.
>>>>> 
>>>>> I think it is important that the kernel itself has not gotten slower, even if the kernel is not currently a major contributor to runtime.  We specifically split out the advection derivatives because they made the code with 8th order and Jacobians a fair bit slower.  I would just like to see that this is not still the case with the new version, which has changed the way this is handled.
>>>> 
>>>> I have now run my benchmarks on both the original and the rewritten McLachlan.  I seem to find that the ML_BSSN_* functions in
>>>> Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns, excluding the constraint calculations, are between 11% and 15% slower with the rewrite branch, depending on the details of the evolution.  See attached plot.  This is on Datura with quite old CPUs (Intel Xeon CPU X5650 2.67GHz).
>>>> 
>>>> What exactly do you measure -- which bins or routines? Does this involve communication? Are you using thorn Dissipation?
>>> 
>>> 
>>> I take all the timers in Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns that start with ML_BSSN_ and eliminate the ones containing "constraints" (case insensitive).  This is running on two processes, one node, 6 threads per node.  Threads are correctly bound to cores.  There is ghostzone exchange between the processes, so yes, there is communication in the ML_BSSN_SelectBCs SYNC calls, but it is node-local.
>>> 
>>> Can you include thorn Dissipation in the "before" case, and use McLachlan's dissipation in the "after" case?
>> 
>> There is no dissipation in either case.
>> 
>> The output data is in
>> 
>> 	http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/orig/20150724-174334
>> 	http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/rewrite/20150724-170542
>> 
>> including the parameter files.
>> 
>> Actually, what I said before was wrong; the timers I am using are under "thorns", not "syncs", so even the node-local communication should not be counted.
>> 
>> McLachlan has not been optimized for runs without dissipation. If you this this is important, then we can introduce a special case. I expect this to improve performance. However, running BSSN without dissipation is not what one would do in production, so I didn't investigate this case.
> 
> I agree that runs without dissipation are not relevant, but since I usually use the Dissipation thorn, I didn't include it in the benchmark, which was a benchmark of McLachlan.  I assume that McLachlan now always calculates the dissipation term, even when it is zero, and that is what you mean by "not optimised"?  This will introduce a performance regression (if this is the reason for the increased benchmark time, then presumably only on the level of ~15% for the kernel, hence less for a whole simulation) for any simulation which uses dissipation from the Dissipation thorn.  Since McLachlan's dissipation was previously very slow, this is presumably what most existing parameter files use.  
> 
> Regarding switching to use McLachlan for dissipation: McLachlan's dissipation is a bit more limited than the Dissipation thorn; it looks like McLachlan is hard-coded to use dissipation of order 1+fdOrder, rather than the dissipation order being chosen separately.  Sometimes lower orders are used as an optimisation (the effect on convergence being judged to be minimal).  And actually, critically, there is no way to specify different dissipation orders on different refinement levels.  This is typically used in production binary simulations.
> 
> In other words, you are asking for a version of ML_BSSN where it is efficient to not use dissipation. Currently, that means that dissipation is disabled. The question is -- should this be the default?
> 
> Do you think it is faster to use dissipation from McLachlan than to use that provided by Dissipation?
> 
> Yes, I think so. 

I don't know.  Without knowing performance numbers, it is difficult to judge.  Since people may be using McLachlan's dissipation in their parameter files (even though it is slow), it's probably not a good idea to disable it by default. 

Is it possible to make McLachlan efficient when dissipation is disabled, but keep the code for it there?  e.g. by wrapping it in a conditional?  If the condition is a scalar, this should be fine even with vectorisation, no?

-- 
Ian Hinder
http://members.aei.mpg.de/ianhin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20150724/3a893de9/attachment-0001.html