[Performanceoptimization-wg] SIMD Puzzle

Tue Jul 23 08:35:26 CDT 2019

Hi all,

I used NRPy+ to create a "minimal example" SIMD-intrinsics-enabled PDE
solver kernel -- solving the scalar wave equation in 3 spatial dimensions.

With AVX256+FMA intrinsics, neither Intel nor GNU compilers report success
at fully vectorizing the RHS eval loop. E.g., the Intel compiler yields the
cryptic message when compiling the innermost loop:

         remark #15310: loop was not vectorized: operation cannot be
vectorized   [ ScalarWave/ScalarWave_RHSs-SIMD.h(31,52) ]

The line it's referring to has to do with loading data from memory
_mm256_loadu_pd(&a).

The entire source code is attached to this email, and I've been compiling
using

icc -restrict -align -qopenmp -xHost -O2 -qopt-report=5 -qopt-report-phase
ipo -qopt-report-phase vec -vec-threshold1 -qopt-prefetch=4
ScalarWave_Playground-SIMD.c -o ScalarWave_Playground-SIMD

for Intel 19, and for GNU (gcc 9):

gcc -fsimd-cost-model=unlimited -Ofast -fopenmp -march=native
ScalarWave_Playground-SIMD.c -fopt-info-vec-optimized-missed -o
ScalarWave_Playground-SIMD -lm

When I look at the Intel-generated annotated assembly of the innermost RHS
loop (using icc -S -g -restrict -align -qopenmp -xHost -O2 -qopt-report=5
-qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1
-qopt-prefetch=4 ScalarWave_Playground-SIMD.c), I see many 256-bit "ymmX"'s
and corresponding instructions that seem to be consistent with the *math*
intrinsics. I can't decipher much beyond that, though. Notably I didn't see
any assembler instructions that look like _mm256_loadu_pd().

I fiddled around a bit with what goes inside the _mm256_loadu_pd(), just to
see what might be causing the cryptic remark above. I found that if I
remove dependence on the innermost loop variable "i0" on certain (obviously
this would break the functionality, but the compiler doesn't care), then it
is capable of vectorizing that loop.

Note that the version of the code that does not use intrinsics is about
1.8x slower with either compiler, so I think intrinsics are providing some
good benefit. However, I am discouraged by the compiler telling me that the
inner loop cannot be fully vectorized.

Any tips would be greatly appreciated!

-Zach

*     *     *
Prof. Zachariah Etienne
West Virginia University
*https://math.wvu.edu/~zetienne/ <https://math.wvu.edu/~zetienne/>*
https://blackholesathome.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/performanceoptimization-wg/attachments/20190723/767f734a/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ScalarWave_NGHOSTS.h
Type: text/x-chdr
Size: 93 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/performanceoptimization-wg/attachments/20190723/767f734a/attachment-0004.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ScalarWave_ExactSolution.h
Type: text/x-chdr
Size: 1326 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/performanceoptimization-wg/attachments/20190723/767f734a/attachment-0005.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ScalarWave_Playground-SIMD.c
Type: text/x-csrc
Size: 15060 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/performanceoptimization-wg/attachments/20190723/767f734a/attachment-0006.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ScalarWave_RHSs-SIMD.h
Type: text/x-chdr
Size: 5176 bytes
Desc: not available
Url : http://lists.einsteintoolkit.org/pipermail/performanceoptimization-wg/attachments/20190723/767f734a/attachment-0007.bin