[Users] cactus performance

Jose Fiestas Iquira jafiestas at lbl.gov
Thu Mar 29 01:55:30 CDT 2012


Hello,

I reduced the simulation time by setting Cactus::cctk_final_time = .01 in
order to measure performance with CrayPat. It run only 8 iterations. I used
16 and 24 cores for testing, and obtained almost the same performance
(~1310 sec. simulation time, and ~16MFlops).

It remembers me Fig.2 in the reference you sent
http://arxiv.org/abs/1111.3344

which I don't really understand. I would expect shorter times with larger
number of cores. Why does it not happen here?

I am using McLachlan to simulate a binary system. So, all my regards are
concerning this specific application. Do you think it will scale in the
sense that simulation time will be shorter, the larger of number of cores I
use?

Thanks,
Jose



On Wed, Mar 21, 2012 at 5:08 AM, Erik Schnetter <schnetter at cct.lsu.edu>wrote:

> On Tue, Mar 20, 2012 at 10:45 PM, Frank Loeffler <knarf at cct.lsu.edu>
> wrote:
> > Hi,
> >
> > On Tue, Mar 20, 2012 at 05:14:38PM -0700, Jose Fiestas Iquira wrote:
> >> Is there documentation about performance of Cactus ETK in large
> machines. I
> >> have some questions regarding best performance according to initial
> >> conditions, calculation time required, etc.
> >
> > Performance very much depends on the specific setup. One poorly scaling
> > function can ruin the otherwise best run.
> >
> >> If there are performance plots like Flops vs. Number of nodes would
> help me
> >> as well.
> >
> > Flops are very problem-dependent. There isn't such thing as flops/s for
> > Cactus, not even for one given machine. If we talk about the Einstein
> > equations and a typical production run I would expect a few percent of
> > the peak performance of any given CPU, as we are most of the time bound
> by
> > memory bandwidth.
>
> I would like to add some more numbers to Frank's description:
>
> One some problems (e.g. evaluating the BSSN equations with a
> higher-order stencil), I have measured more than 20% of the
> theoretical peak performance. The bottleneck seem to be L1 data cache
> accesses, because the BSSN equation kernels require a large number of
> local (temporary) variables.
>
> If you look for parallel scaling, then e.g.
> <http://arxiv.org/abs/1111.3344> contains a scaling graph for the BSSN
> equations evolved with mesh refinement. This shows that, for this
> benchmark, the Einstein Toolkit scales well to more than 12k cores.
>
> -erik
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20120328/6cff5816/attachment.html 


More information about the Users mailing list