[Performanceoptimization-wg] SIMD Puzzle

Tue Jul 23 10:08:27 CDT 2019

Zach

The compiler is correct: The intrinsic _mm256_loadu_pd cannot be
vectorized because ... it is already vectorized! If you are using
intrinsics for vectorization, then the compiler does not need to
perform any work.

Did you look at the generated code? Try calling the compiler with the
"-S" option to get assembler output, or use "objdump -d" to
disassemble the object file. You should see lots of "ymms" mentioned
in your memory access and arithmetic operations.

-erik

On Tue, Jul 23, 2019 at 9:35 AM Zach Etienne <zachetie at gmail.com> wrote:
>
> Hi all,
>
> I used NRPy+ to create a "minimal example" SIMD-intrinsics-enabled PDE solver kernel -- solving the scalar wave equation in 3 spatial dimensions.
>
> With AVX256+FMA intrinsics, neither Intel nor GNU compilers report success at fully vectorizing the RHS eval loop. E.g., the Intel compiler yields the cryptic message when compiling the innermost loop:
>
>          remark #15310: loop was not vectorized: operation cannot be vectorized   [ ScalarWave/ScalarWave_RHSs-SIMD.h(31,52) ]
>
> The line it's referring to has to do with loading data from memory _mm256_loadu_pd(&a).
>
> The entire source code is attached to this email, and I've been compiling using
>
> icc -restrict -align -qopenmp -xHost -O2 -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c -o ScalarWave_Playground-SIMD
>
> for Intel 19, and for GNU (gcc 9):
>
> gcc -fsimd-cost-model=unlimited -Ofast -fopenmp -march=native ScalarWave_Playground-SIMD.c -fopt-info-vec-optimized-missed -o ScalarWave_Playground-SIMD -lm
>
> When I look at the Intel-generated annotated assembly of the innermost RHS loop (using icc -S -g -restrict -align -qopenmp -xHost -O2 -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c), I see many 256-bit "ymmX"'s and corresponding instructions that seem to be consistent with the *math* intrinsics. I can't decipher much beyond that, though. Notably I didn't see any assembler instructions that look like _mm256_loadu_pd().
>
> I fiddled around a bit with what goes inside the _mm256_loadu_pd(), just to see what might be causing the cryptic remark above. I found that if I remove dependence on the innermost loop variable "i0" on certain (obviously this would break the functionality, but the compiler doesn't care), then it is capable of vectorizing that loop.
>
> Note that the version of the code that does not use intrinsics is about 1.8x slower with either compiler, so I think intrinsics are providing some good benefit. However, I am discouraged by the compiler telling me that the inner loop cannot be fully vectorized.
>
> Any tips would be greatly appreciated!
>
> -Zach
>
> *     *     *
> Prof. Zachariah Etienne
> West Virginia University
> https://math.wvu.edu/~zetienne/
> https://blackholesathome.net
> _______________________________________________
> performanceoptimization-wg mailing list
> performanceoptimization-wg at einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/performanceoptimization-wg

-- 
Erik Schnetter <schnetter at gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/