[Users] ET on KNL.

Thu Mar 2 10:03:21 CST 2017

On Thu, Mar 2, 2017 at 10:03 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:

>
> On 2 Mar 2017, at 14:37, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>
> I am currently redesigning the tiling infrastructure, also to allow
> multithreading via Qthreads instead of OpenMP and to allow for aligning
> arrays with cache line boundaries. The new approach (different from the
> current LoopControl) is to choose a fixed tile size, either globally or per
> loop, and then assign individual tiles to threads. This also works will
> with DG derivative where the DG element size dictates a granularity for the
> tile size, and the new efficient tiled derivative operators. Most of this
> is still in flux. I have seen large efficiency improvements in the RHS
> calculation, but two puzzling items remain:
>
> (1) It remains more efficient to use MPI than multi-threading for
> parallelization, at least on regular CPUs. On KNL my results are still
> somewhat random.
>
>
> When using MPI vs multi-threading on the same number of cores, the
> component will be smaller, meaning that more of it is likely to fit in the
> cache.  Would that explain this observation?
>

My wild guess is that an explicit MPI parallelization exhibits more data
locality, leading to better performance.

-erik

(2) MoL_Add is quite expensive compared to the RHS evaluation.
>
>
> That is indeed odd.
>
> The main thing that changed since our last round of thorough benchmarks is
> that CPU became much more powerful while memory bandwidth hasn't. I'm
> beginning to think that things such as vectorization or parallelization
> basically don't matter any more if we ensure that we pull data from memory
> into caches efficiently.
>
> I have not yet collected PAPI statistics.
>
> -erik
>
>
> On Thu, Mar 2, 2017 at 6:57 AM, Ian Hinder <ian.hinder at aei.mpg.de> wrote:
>
>>
>> On 1 Mar 2017, at 22:10, David Radice <dradice at astro.princeton.edu>
>> wrote:
>>
>> Hi Ian, Erik, Eloisa,
>>
>> I attach a very brief report of some results I obtained in 2015 after
>> attending a KNC workshop.
>>
>> Conclusions: By using 244 threads, with the domain split into tiles of
>> size 8 × 4 × 4 points, and OpenMP threads assigned one per tile as they
>> become available, the MIC was able to outperform the single CPU by a factor
>> of 1.5. The same tiling strategy was used on the CPU, as it has been found
>> to give good performance there in the past. Since we have not yet optimised
>> the code for the MIC architecture, we believe that further speed
>> improvements will be possible, and that solving the Einstein equations on
>> the MIC architecture should be feasible.
>>
>> Eloisa, are you using LoopControl?  There are tiling parameters which can
>> also help with performance on these devices.
>>
>>
>> how does tiling work with LoopControl? Is it documented somewhere? I
>> naively thought that the point of tiling was to have chunks of data stored
>> contiguously in memory...
>>
>>
>> Ideally yes, but this would need to be done in Carpet not LoopControl,
>> and I think you would then require ghost zones around each tile.  Since we
>> have huge numbers of ghost zones, I'm not sure it is practical.
>>
>> LoopControl has parameters such as tilesize and loopsize, but Erik would
>> know better how to use these. It was a long time ago, and I can't
>> immediately find my parameter files.
>>
>> BTW, at the moment I am using this macro for all of my loop needs:
>>
>> #define UTILS_LOOP3(NAME,I,SI,EI,J,SJ,EJ,K,SK,EK)
>>                              \
>>    _Pragma("omp for collapse(3)")
>>               \
>>    for(int I = SI; I < EI; ++I)
>>                 \
>>    for(int J = SJ; J < EJ; ++J)
>>                 \
>>    for(int K = SK; K < EK; ++K)
>>
>> How would I convert it to something equivalent using LoopControl?
>>
>> Thanks,
>>
>> David
>>
>> PS. Seeing that Eloisa was able to compile bbox.cc with the intel-17.0.0
>> with -no-vec, I made a patch to disable vectorization using pragmas inside
>> bbox.cc (to avoid having to compile it manually):
>>
>> https://bitbucket.org/eschnett/carpet/pull-requests/16/
>> carpetlib-fix-compilation-with-intel-1700/diff
>>
>>
>> --
>> Ian Hinder
>> http://members.aei.mpg.de/ianhin
>>
>>
>
>
> --
> Erik Schnetter <schnetter at cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>
>
> --
> Ian Hinder
> http://members.aei.mpg.de/ianhin
>
>

-- 
Erik Schnetter <schnetter at cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20170302/a40e2380/attachment-0001.html