<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On 24 Jul 2015, at 20:39, Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu">schnetter@cct.lsu.edu</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">On Fri, Jul 24, 2015 at 1:58 PM, Ian Hinder <span dir="ltr"><<a href="mailto:ian.hinder@aei.mpg.de" target="_blank">ian.hinder@aei.mpg.de</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><br><div><span class=""><div>On 24 Jul 2015, at 19:42, Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>> wrote:</div><br><blockquote type="cite"><div dir="ltr">On Fri, Jul 24, 2015 at 1:39 PM, Ian Hinder <span dir="ltr"><<a href="mailto:ian.hinder@aei.mpg.de" target="_blank">ian.hinder@aei.mpg.de</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><span><br><div><div>On 24 Jul 2015, at 19:15, Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>> wrote:</div><br><blockquote type="cite"><div dir="ltr">On Fri, Jul 24, 2015 at 11:57 AM, Ian Hinder <span dir="ltr"><<a href="mailto:ian.hinder@aei.mpg.de" target="_blank">ian.hinder@aei.mpg.de</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><br><div><span><div>On 8 Jul 2015, at 16:53, Ian Hinder <<a href="mailto:ian.hinder@aei.mpg.de" target="_blank">ian.hinder@aei.mpg.de</a>> wrote:</div><br><blockquote type="cite"><div style="word-wrap:break-word"><br><div><div>On 8 Jul 2015, at 15:14, Erik Schnetter <<a href="mailto:schnetter@cct.lsu.edu" target="_blank">schnetter@cct.lsu.edu</a>> wrote:</div><br><blockquote type="cite"><div dir="ltr">I added a second benchmark, using a Thornburg04 patch system, 8th order finite differencing, and 4th order patch interpolation. The results are<div><br></div><div><div style="margin:0px;font-size:10px;font-family:Menlo">original: 8.53935e-06 sec</div><div style="margin:0px;font-size:10px;font-family:Menlo">rewrite: 8.55188e-06 sec</div><div style="margin:0px;font-size:10px;font-family:Menlo"><br></div><div style="margin:0px;font-size:10px;font-family:Menlo"><span style="font-family:arial,sans-serif;font-size:small">this time with 1 thread per MPI process, since that was most efficient in both cases. Most of the time is spent in inter-patch interpolation, which is much more expensive than in a "regular" case since this benchmark is run on a single node and hence with very small grids.</span><br></div><div style="margin:0px;font-size:10px;font-family:Menlo"><span style="font-family:arial,sans-serif;font-size:small"><br></span></div><div style="margin:0px;font-size:10px;font-family:Menlo"><span style="font-family:arial,sans-serif;font-size:small">With these numbers under our belt, can we merge the rewrite branch?</span></div></div></div></blockquote><div><br></div><div>The "jacobian" benchmark that I gave you was still a pure kernel benchmark, involving no interpatch interpolation. It just measured the speed of the RHSs when Jacobians were included. I would also not use a single-threaded benchmark with very small grid sizes; this might have been fastest in this artificial case, but in practice I don't think we would use that configuration. The benchmark you have now run seems to be more of a "complete system" benchmark, which is useful, but different.</div><div><br></div><div>I think it is important that the kernel itself has not gotten slower, even if the kernel is not currently a major contributor to runtime. We specifically split out the advection derivatives because they made the code with 8th order and Jacobians a fair bit slower. I would just like to see that this is not still the case with the new version, which has changed the way this is handled.</div></div></div></blockquote><div><br></div></span><div>I have now run my benchmarks on both the original and the rewritten McLachlan. I seem to find that the ML_BSSN_* functions in</div><div>Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns, excluding the constraint calculations, are between 11% and 15% slower with the rewrite branch, depending on the details of the evolution. See attached plot. This is on Datura with quite old CPUs (Intel Xeon CPU X5650 2.67GHz).</div></div></div></blockquote><div><br></div><div>What exactly do you measure -- which bins or routines? Does this involve communication? Are you using thorn Dissipation?</div></div></div></div></blockquote></div><div><br></div></span><div>I take all the timers in Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns that start with ML_BSSN_ and eliminate the ones containing "constraints" (case insensitive). This is running on two processes, one node, 6 threads per node. Threads are correctly bound to cores. There is ghostzone exchange between the processes, so yes, there is communication in the ML_BSSN_SelectBCs SYNC calls, but it is node-local.</div></div></blockquote><div><br></div><div>Can you include thorn Dissipation in the "before" case, and use McLachlan's dissipation in the "after" case?</div></div></div></div></blockquote><div><br></div></span><div>There is no dissipation in either case.</div><div><br></div><div>The output data is in</div><div><br></div><div><span style="white-space:pre-wrap">        </span><a href="http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/orig/20150724-174334" target="_blank">http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/orig/20150724-174334</a></div><div><span style="white-space:pre-wrap">        </span><a href="http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/rewrite/20150724-170542" target="_blank">http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/rewrite/20150724-170542</a></div><div><br></div><div>including the parameter files.</div><div><br></div><div>Actually, what I said before was wrong; the timers I am using are under "thorns", not "syncs", so even the node-local communication should not be counted.</div></div></div></blockquote><div><br></div><div>McLachlan has not been optimized for runs without dissipation. If you this this is important, then we can introduce a special case. I expect this to improve performance. However, running BSSN without dissipation is not what one would do in production, so I didn't investigate this case.</div></div></div></div></blockquote><div><br></div><div>I agree that runs without dissipation are not relevant, but since I usually use the Dissipation thorn, I didn't include it in the benchmark, which was a benchmark of McLachlan. I assume that McLachlan now always calculates the dissipation term, even when it is zero, and that is what you mean by "not optimised"? This will introduce a performance regression (if this is the reason for the increased benchmark time, then presumably only on the level of ~15% for the kernel, hence less for a whole simulation) for any simulation which uses dissipation from the Dissipation thorn. Since McLachlan's dissipation was previously very slow, this is presumably what most existing parameter files use. </div><div><br></div><div>Regarding switching to use McLachlan for dissipation: McLachlan's dissipation is a bit more limited than the Dissipation thorn; it looks like McLachlan is hard-coded to use dissipation of order 1+fdOrder, rather than the dissipation order being chosen separately. Sometimes lower orders are used as an optimisation (the effect on convergence being judged to be minimal). And actually, critically, there is no way to specify different dissipation orders on different refinement levels. This is typically used in production binary simulations.</div><div><br></div><div>Do you think it is faster to use dissipation from McLachlan than to use that provided by Dissipation?</div><div><br></div></div><div apple-content-edited="true">
<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>-- </div><div>Ian Hinder</div><div><a href="http://members.aei.mpg.de/ianhin">http://members.aei.mpg.de/ianhin</a></div></div></div></div></div>
</div>
<br></body></html>