[Users] ET on KNL.

Thu Mar 2 09:03:38 CST 2017

On 2 Mar 2017, at 14:37, Erik Schnetter <schnetter at cct.lsu.edu> wrote:

> I am currently redesigning the tiling infrastructure, also to allow multithreading via Qthreads instead of OpenMP and to allow for aligning arrays with cache line boundaries. The new approach (different from the current LoopControl) is to choose a fixed tile size, either globally or per loop, and then assign individual tiles to threads. This also works will with DG derivative where the DG element size dictates a granularity for the tile size, and the new efficient tiled derivative operators. Most of this is still in flux. I have seen large efficiency improvements in the RHS calculation, but two puzzling items remain:
> 
> (1) It remains more efficient to use MPI than multi-threading for parallelization, at least on regular CPUs. On KNL my results are still somewhat random.

When using MPI vs multi-threading on the same number of cores, the component will be smaller, meaning that more of it is likely to fit in the cache.  Would that explain this observation?

> (2) MoL_Add is quite expensive compared to the RHS evaluation.

That is indeed odd.

> The main thing that changed since our last round of thorough benchmarks is that CPU became much more powerful while memory bandwidth hasn't. I'm beginning to think that things such as vectorization or parallelization basically don't matter any more if we ensure that we pull data from memory into caches efficiently.
> 
> I have not yet collected PAPI statistics.
> 
> -erik
> 
> 
> On Thu, Mar 2, 2017 at 6:57 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
> 
> On 1 Mar 2017, at 22:10, David Radice <dradice at astro.princeton.edu> wrote:
> 
>> Hi Ian, Erik, Eloisa,
>> 
>>> I attach a very brief report of some results I obtained in 2015 after attending a KNC workshop.
>>>> Conclusions: By using 244 threads, with the domain split into tiles of size 8 × 4 × 4 points, and OpenMP threads assigned one per tile as they become available, the MIC was able to outperform the single CPU by a factor of 1.5. The same tiling strategy was used on the CPU, as it has been found to give good performance there in the past. Since we have not yet optimised the code for the MIC architecture, we believe that further speed improvements will be possible, and that solving the Einstein equations on the MIC architecture should be feasible.
>>>> 
>>> Eloisa, are you using LoopControl?  There are tiling parameters which can also help with performance on these devices.
>> 
>> how does tiling work with LoopControl? Is it documented somewhere? I naively thought that the point of tiling was to have chunks of data stored contiguously in memory...
> 
> Ideally yes, but this would need to be done in Carpet not LoopControl, and I think you would then require ghost zones around each tile.  Since we have huge numbers of ghost zones, I'm not sure it is practical.
> 
> LoopControl has parameters such as tilesize and loopsize, but Erik would know better how to use these. It was a long time ago, and I can't immediately find my parameter files.
> 
>> BTW, at the moment I am using this macro for all of my loop needs:
>> 
>> #define UTILS_LOOP3(NAME,I,SI,EI,J,SJ,EJ,K,SK,EK)                              \
>>    _Pragma("omp for collapse(3)")                                             \
>>    for(int I = SI; I < EI; ++I)                                               \
>>    for(int J = SJ; J < EJ; ++J)                                               \
>>    for(int K = SK; K < EK; ++K)
>> 
>> How would I convert it to something equivalent using LoopControl?
>> 
>> Thanks,
>> 
>> David
>> 
>> PS. Seeing that Eloisa was able to compile bbox.cc with the intel-17.0.0 with -no-vec, I made a patch to disable vectorization using pragmas inside bbox.cc (to avoid having to compile it manually):
>> 
>> https://bitbucket.org/eschnett/carpet/pull-requests/16/carpetlib-fix-compilation-with-intel-1700/diff
> 
> -- 
> Ian Hinder
> http://members.aei.mpg.de/ianhin
> 
> 
> 
> 
> -- 
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/

-- 
Ian Hinder
http://members.aei.mpg.de/ianhin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20170302/40877ee0/attachment.html