[Users] ET on KNL.

Thu Mar 2 07:37:03 CST 2017

I am currently redesigning the tiling infrastructure, also to allow
multithreading via Qthreads instead of OpenMP and to allow for aligning
arrays with cache line boundaries. The new approach (different from the
current LoopControl) is to choose a fixed tile size, either globally or per
loop, and then assign individual tiles to threads. This also works will
with DG derivative where the DG element size dictates a granularity for the
tile size, and the new efficient tiled derivative operators. Most of this
is still in flux. I have seen large efficiency improvements in the RHS
calculation, but two puzzling items remain:

(1) It remains more efficient to use MPI than multi-threading for
parallelization, at least on regular CPUs. On KNL my results are still
somewhat random.

(2) MoL_Add is quite expensive compared to the RHS evaluation.

The main thing that changed since our last round of thorough benchmarks is
that CPU became much more powerful while memory bandwidth hasn't. I'm
beginning to think that things such as vectorization or parallelization
basically don't matter any more if we ensure that we pull data from memory
into caches efficiently.

I have not yet collected PAPI statistics.

-erik

On Thu, Mar 2, 2017 at 6:57 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:

>
> On 1 Mar 2017, at 22:10, David Radice <dradice at astro.princeton.edu> wrote:
>
> Hi Ian, Erik, Eloisa,
>
> I attach a very brief report of some results I obtained in 2015 after
> attending a KNC workshop.
>
> Conclusions: By using 244 threads, with the domain split into tiles of
> size 8 × 4 × 4 points, and OpenMP threads assigned one per tile as they
> become available, the MIC was able to outperform the single CPU by a factor
> of 1.5. The same tiling strategy was used on the CPU, as it has been found
> to give good performance there in the past. Since we have not yet optimised
> the code for the MIC architecture, we believe that further speed
> improvements will be possible, and that solving the Einstein equations on
> the MIC architecture should be feasible.
>
> Eloisa, are you using LoopControl?  There are tiling parameters which can
> also help with performance on these devices.
>
>
> how does tiling work with LoopControl? Is it documented somewhere? I
> naively thought that the point of tiling was to have chunks of data stored
> contiguously in memory...
>
>
> Ideally yes, but this would need to be done in Carpet not LoopControl, and
> I think you would then require ghost zones around each tile.  Since we have
> huge numbers of ghost zones, I'm not sure it is practical.
>
> LoopControl has parameters such as tilesize and loopsize, but Erik would
> know better how to use these. It was a long time ago, and I can't
> immediately find my parameter files.
>
> BTW, at the moment I am using this macro for all of my loop needs:
>
> #define UTILS_LOOP3(NAME,I,SI,EI,J,SJ,EJ,K,SK,EK)
>                              \
>    _Pragma("omp for collapse(3)")
>               \
>    for(int I = SI; I < EI; ++I)
>                 \
>    for(int J = SJ; J < EJ; ++J)
>                 \
>    for(int K = SK; K < EK; ++K)
>
> How would I convert it to something equivalent using LoopControl?
>
> Thanks,
>
> David
>
> PS. Seeing that Eloisa was able to compile bbox.cc with the intel-17.0.0
> with -no-vec, I made a patch to disable vectorization using pragmas inside
> bbox.cc (to avoid having to compile it manually):
>
> https://bitbucket.org/eschnett/carpet/pull-requests/
> 16/carpetlib-fix-compilation-with-intel-1700/diff
>
>
> --
> Ian Hinder
> http://members.aei.mpg.de/ianhin
>
>

-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20170302/3969ba1c/attachment-0001.html