[Users] Benchmarking

Fri May 5 10:52:48 CDT 2017

On Fri, May 5, 2017 at 11:37 AM, Khamesra, Bhavesh <
bhaveshkhamesra at gatech.edu> wrote:
>
> Hi Erik, Thanks for the reply. I tried playing with num-threads options
in machine files and was able to run the QC0 on development node. Reducing
the num-threads to 17 keeping number of cores to 64

Bhavesh

This combination doesn't make sense; the number of cores needs to be a
multiple of the number of threads. I would try 68 cores and 4 threads; as I
mentioned, using fewer threads is currently more efficient.

> showed some increase in the speed but it is still quite low -  around
13-16M/hour compared the 55-65 M/hour in stampede. For GW150914, the speed
on KNL is around 3.5-4M/hour compared to 12M/hour on Stampede. I also
briefly looked at TimerReport but any particular thorn did not stand out. I
will study it in more detail.

When you measured these speeds, how many nodes did you use each time?

Yes, you will need to look at timer output to find out e.g. whether the
time is spend doing I/O.

You can also look at how much memory is used per core. As a general rule,
using more memory per core is more efficient. If you are using only a very
small fraction of the available memory (<10%), or if there are many more
ghost points than interior points, then your setup is likely inefficient.

> In general, how can I find the optimized values of 'turning knobs'
(except trial and error method) and what are the constraints on them? What
are the general options/parameters I can change to boost up the
performance? I also had several questions about various options in machine
files and about optimization and MPI in general. Can you suggest some
reference where I can read more about this?

That is a very good question. I do not have a good answer to it. (This is
why it is a good question.) This information is, unfortunately, only passed
on from grad student to grad student (or from postdoc to postdoc). There
really should be a tutorial and some larger documentation that addresses
this.

Before you can optimize the values of the knobs, you need to know what
knobs there actually are.

The best offer I can make is to ask many questions while keeping notes, or
to visit an experienced Einstein Toolkit user and camp out near their desk
asking many questions. You can also ask people for their tuned parameter
files and compare, and also ask people about their run times and speeds so
that you have a basis for the comparison. (You are already doing this.)

There are three kinds of knobs that influence performance
- knobs that change the physics; usually, when you write a paper, you want
to keep these fixed
- knobs that change the numerical approximate, such as the resolution or
grid structure
- knobs that change how the simulation runs, such as the number of cores,
threads, or how often to do output

When running a simulation, you want to begin by tuning the third kind of
knob. When you've found some reasonable optimum, you can move on to the
second kind and experiment what resolution etc. you need to solve a
particular physics question.

> Lastly, the crashing the GW150914 in normal queue doesn't seem to be due
to this reason (but I may be wrong). The error file shows segmentation
fault errors. I was browsing through the past tickets and found that you
had also encountered a similar segfault issue on KNL. Were you able to
resolve it? I am attaching the error file, could you please look at it?

I have heard of such a segfault before. I assume it is caused by using too
many processes or too many threads for a particular resolution. I have not
yet reproduced it, and I don't know what causes it. It would be helpful if
you could produce a stack backtrace or similar. On the other hand, if this
segfault only appears for very inefficient configurations, then there is no
urgent need to debug this, as people won't be interested in using such
configurations anyway.

All the best. Please keep asking.

-erik

>
>
> Thanks
>
> .............................
>
> Bhavesh Khamesra
>
> Graduate Student
>
> Centre of Relativistic Astrophysics
>
> Georgia Institute of Technology
>
> ________________________________
> From: schnetter at gmail.com <schnetter at gmail.com> on behalf of Erik
Schnetter <schnetter at cct.lsu.edu>
> Sent: Wednesday, May 3, 2017 4:59:16 PM
> To: Khamesra, Bhavesh
> Cc: users at einsteintoolkit.org
> Subject: Re: [Users] Benchmarking
>
> Bhavesh
>
> To be exact, the remedy for this particular Slab error is not to use more
cores, but to use more MPI processes. You can keep the number of cores
constant if you reduce the number of OpenMP threads per MPI process.
>
> Given that you are benchmarking, you should anyway experiment with these
parameters, as performance can crucially depend on them. Usually, using
fewer threads and more processes is more efficient for small core counts.
>
> Finally, only comparing the overall run time is not sufficient to make a
statement about performance. Each run has several "tuning knobs", and
choosing the right values for these is important to achieve good
performance. Using the default settings will often lead to quite poor
performance. Cactus timer output as well as experience with performing runs
on HPC systems is indispensable to get good performance.
>
> -erik
>
>
> On Tue, May 2, 2017 at 5:09 PM, Khamesra, Bhavesh <
bhaveshkhamesra at gatech.edu> wrote:
>>
>> Hi, I have sent the pull request with the optionlist for Stampede - KNL
on Bitbucket simfactory repo. I have tested this with a couple of
thornlists including the einsteintoolkit.th and GW150914.th. This is still
in experimental stage and so would be great if someone could also test it.
>>
>>
>> Working on benchmarking the performance on Stampede KNL, I was able to
do some test runs using the GW150914 simulation. However, I have been
running into some issues with it.
>>
>>
>> 1. I tried running QC0 simulation on both Stampede SandyBridge and KNL.
While it runs fine on Stampede but it crashes on KNL with this error -
>>
>> while executing schedule bin BoundaryConditions, routine
RotatingSymmetry180::Rot180_ApplyBC in thorn RotatingSymmetry180, file
/work/04082/tg833814/Cactus_ETK_dev/arrangements/CactusNumerical/RotatingSymmetry180/src/rotatingsymmetry180.c:460:

>>   -> TAT/Slab can only be used if there is a single local component per
MPI process
>> TACC: MPI job exited with code: 134
>> I looked up at previous tickets and found the solution to increase the
number of cores. But if the same simulation can be run on stampede on 64
cores, why does it require higher number of cores on KNL? Or is it some
other issue?
>>
>> 2. I was able to run GW150914 on development queue (68 cores) and the
speeds on Stampede were around 12.9M while that on KNL goes around 2.4M. To
understand the reason for such small speeds, I tried running this on higher
number of cores on  Stampede (128) and it runs at speed of around 20.9M
(tested the run for 12 hours). However, on doing the same in normal queue
in KNL, the simulation crashes after a couple of iterations on KNL with
some segmentation fault error. Also, before crashing, the speed on KNL is
around 4.2M. I have attached the error file of the simulation.
>>
>>
>> Could someone please look at this? Let me know if you need any other
information.
>>
>>
>> Thanks
>>
>> .............................
>>
>> Bhavesh Khamesra
>>
>> Graduate Student
>>
>> Centre of Relativistic Astrophysics
>>
>> Georgia Institute of Technology
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at einsteintoolkit.org
>> http://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>
>
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>

--
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20170505/552ea8b9/attachment.html