[Users] ET on KNL.

Wed Mar 1 14:43:33 CST 2017

On 1 Mar 2017, at 15:07, Erik Schnetter <schnetter at cct.lsu.edu> wrote:

> On Wed, Mar 1, 2017 at 7:04 AM, Eloisa Bentivegna <eloisa.bentivegna at ct.infn.it>wrote:
> On 28/02/17 23:17, David Radice wrote:
> > Hello Eloisa,
> >
> > sorry for the delay in the reply. For the records I did manage to
> > compile and run ET on KNL (stampede), but I did not manage to run any
> > benchmark with it yet. The current status is:
> >
> > * intel-17: the compiler fails to compile Carpet and either gives an
> > internal error or segfaults. * gcc-6.3: used to compile and run with
> > Erik's spack installation (it is currently broken). I did not really
> > manage to benchmark it since even a low-resolution TOV test did not
> > run to completion (meaning less than 4 coarse grid steps) within 30
> > minutes on 4 nodes.
> >
> > This was using the current stable release of the ET (2016-11) and
> > WhiskyTHC. You might have more luck with GRHydro / pure-vacuum runs.
> 
> Hi David and all,
> 
> thanks for all the help. It turned out that consolidating my
> configuration made things significantly better: I was using Intel 16 (to
> avoid the Carpet problem with Intel 17) along with a strange mix of
> libraries (mostly compiled with Intel 17, and the only available on
> Marconi), and that seemed to impact the performance quite strongly. With
> everything Intel 17 (and using -no-vec on bbox.cc), I now obtain a
> runspeed on a Marconi KNL node which is around 80% of a Xeon E5 v4.
> 
> There are still some puzzling features, though. One is that using
> -no-vec, along with the settings:
> 
> VECTORISE                       = no
> VECTORISE_ALIGNED_ARRAYS        = no
> VECTORISE_INLINE                = no
> VECTORISE_ALIGN_FOR_CACHE       = no
> VECTORISE_ALIGN_INTERIOR        = no
> 
> I would expect that VECTORISE=yes (keep the others to "no") might improve performance, in particular if you do not use hyperthreading, so that each thread has more L1 cache space available.
> 
> in my optionlist, I obtain essentially the same throughput. This is a
> vacuum McLachlan run with very little else turned on (but I can run a
> QC0 benchmark for definiteness, if people are interested). I too am
> using the November release.
> 
> Second, hyperthreading decreases the runspeed significantly. I am using
> 272 threads on the 68-core KNL, and for what I can gather from the
> Carpet output, all of the cores are engaged. More cores are reported,
> however, than available on the node:
> 
> INFO (Carpet): MPI is enabled
> INFO (Carpet): Carpet is running on 1 processes
> INFO (Carpet): This is process 0
> INFO (Carpet): OpenMP is enabled
> INFO (Carpet): This process contains 272 threads, this is thread 0
> INFO (Carpet): There are 272 threads in total
> INFO (Carpet): There are 272 threads per process
> INFO (Carpet): This process runs on host r098c04s01, pid=2465840
> INFO (Carpet): This process runs on 272 cores: 0-271
> INFO (Carpet): Thread 0 runs on 1 core: 0
> INFO (Carpet): Thread 1 runs on 1 core: 68
> INFO (Carpet): Thread 2 runs on 1 core: 136
> INFO (Carpet): Thread 3 runs on 1 core: 204
> …
> 
> The nomenclature is inconsistent since it changes so often. This output looks correct. (Carpet cannot easily distinguish between hyperthreads and cores.) As long as there is only one thread per core, this is fine.
> 
> Notice that I am requesting hyperthreading by using num-smt=4 and
> num-threads=272. Is this correct?
> 
> This looks correct. You might also need to play with "ppn=" and "ppn-used=".

Eloisa: does Carpet report the vector size of the KNL as 8?  From the wikipedia entry, I would expect that to be the case, but I think you mentioned to me that it was reporting 4.

Erik: is the thread assignment optimal?  Back when I was playing with the KNCs, it was necessary to interleave the threads in a very specific way, so that threads on the same core were not blocked waiting for memory reads at the same time, due to being close to each other on the grid.  I don't remember the exact details of this.  I think that you want each core (i.e. each consecutive set of 4 reported cores, which are actually hardware threads) to have carpet threads from very different parts of the grid.  I'm not sure that the thread assignment above does this.

I attach a very brief report of some results I obtained in 2015 after attending a KNC workshop.
> Conclusions: By using 244 threads, with the domain split into tiles of size 8 × 4 × 4 points, and OpenMP threads assigned one per tile as they become available, the MIC was able to outperform the single CPU by a factor of 1.5. The same tiling strategy was used on the CPU, as it has been found to give good performance there in the past. Since we have not yet optimised the code for the MIC architecture, we believe that further speed improvements will be possible, and that solving the Einstein equations on the MIC architecture should be feasible. 
> 

Eloisa, are you using LoopControl?  There are tiling parameters which can also help with performance on these devices.

-- 
Ian Hinder
http://members.aei.mpg.de/ianhin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20170301/271a4eed/attachment-0002.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: micbenchmarks.pdf
Type: application/pdf
Size: 135824 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/users/attachments/20170301/271a4eed/attachment-0001.pdf 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20170301/271a4eed/attachment-0003.html