[Performanceoptimization-wg] SIMD Puzzle

Tue Jul 23 10:38:12 CDT 2019

Zach

The generated code looks good. If you want to profile it to see where
it spends its time, then I recommend the Linux "perf" utilities to
first record and then report performance. This will tell you which
machine instruction will require most of the time.

-erik

On Tue, Jul 23, 2019 at 11:23 AM Zach Etienne <zachetie at gmail.com> wrote:
>
> Hi Erik,
>
> Thanks for your reassuring reply.
>
> > Did you look at the generated code? Try calling the compiler with the "-S" option to get assembler output, or use "objdump -d" to disassemble the object file. You should see lots of "ymms" mentioned in your memory access and arithmetic operations.
>
> Yep, as I mentioned, the `-S -g` commented assembler indeed did output lots of "ymms" in the innermost loop. Also, the annotated assembler on the innermost loop gave precisely the same remark about not being able to vectorize an operation:
>
>                 # optimization report
>                 # LOOP WITH USER VECTOR INTRINSICS
>                 # %s was not vectorized: operation cannot be vectorized
>                 # VECTOR TRIP COUNT IS ESTIMATED CONSTANT
>
> -Zach
>
> *     *     *
> Prof. Zachariah Etienne
> West Virginia University
> https://math.wvu.edu/~zetienne/
> https://blackholesathome.net
>
>
> On Tue, Jul 23, 2019 at 11:08 AM Erik Schnetter <schnetter at gmail.com> wrote:
>>
>> Zach
>>
>> The compiler is correct: The intrinsic _mm256_loadu_pd cannot be
>> vectorized because ... it is already vectorized! If you are using
>> intrinsics for vectorization, then the compiler does not need to
>> perform any work.
>>
>> Did you look at the generated code? Try calling the compiler with the
>> "-S" option to get assembler output, or use "objdump -d" to
>> disassemble the object file. You should see lots of "ymms" mentioned
>> in your memory access and arithmetic operations.
>>
>> -erik
>>
>> On Tue, Jul 23, 2019 at 9:35 AM Zach Etienne <zachetie at gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I used NRPy+ to create a "minimal example" SIMD-intrinsics-enabled PDE solver kernel -- solving the scalar wave equation in 3 spatial dimensions.
>> >
>> > With AVX256+FMA intrinsics, neither Intel nor GNU compilers report success at fully vectorizing the RHS eval loop. E.g., the Intel compiler yields the cryptic message when compiling the innermost loop:
>> >
>> >          remark #15310: loop was not vectorized: operation cannot be vectorized   [ ScalarWave/ScalarWave_RHSs-SIMD.h(31,52) ]
>> >
>> > The line it's referring to has to do with loading data from memory _mm256_loadu_pd(&a).
>> >
>> > The entire source code is attached to this email, and I've been compiling using
>> >
>> > icc -restrict -align -qopenmp -xHost -O2 -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c -o ScalarWave_Playground-SIMD
>> >
>> > for Intel 19, and for GNU (gcc 9):
>> >
>> > gcc -fsimd-cost-model=unlimited -Ofast -fopenmp -march=native ScalarWave_Playground-SIMD.c -fopt-info-vec-optimized-missed -o ScalarWave_Playground-SIMD -lm
>> >
>> > When I look at the Intel-generated annotated assembly of the innermost RHS loop (using icc -S -g -restrict -align -qopenmp -xHost -O2 -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c), I see many 256-bit "ymmX"'s and corresponding instructions that seem to be consistent with the *math* intrinsics. I can't decipher much beyond that, though. Notably I didn't see any assembler instructions that look like _mm256_loadu_pd().
>> >
>> > I fiddled around a bit with what goes inside the _mm256_loadu_pd(), just to see what might be causing the cryptic remark above. I found that if I remove dependence on the innermost loop variable "i0" on certain (obviously this would break the functionality, but the compiler doesn't care), then it is capable of vectorizing that loop.
>> >
>> > Note that the version of the code that does not use intrinsics is about 1.8x slower with either compiler, so I think intrinsics are providing some good benefit. However, I am discouraged by the compiler telling me that the inner loop cannot be fully vectorized.
>> >
>> > Any tips would be greatly appreciated!
>> >
>> > -Zach
>> >
>> > *     *     *
>> > Prof. Zachariah Etienne
>> > West Virginia University
>> > https://math.wvu.edu/~zetienne/
>> > https://blackholesathome.net
>> > _______________________________________________
>> > performanceoptimization-wg mailing list
>> > performanceoptimization-wg at einsteintoolkit.org
>> > http://lists.einsteintoolkit.org/mailman/listinfo/performanceoptimization-wg
>>
>>
>>
>> --
>> Erik Schnetter <schnetter at gmail.com>
>> http://www.perimeterinstitute.ca/personal/eschnetter/

-- 
Erik Schnetter <schnetter at gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/