[Performanceoptimization-wg] SIMD Puzzle

Tue Jul 23 12:51:52 CDT 2019

Zach

Unless the compiler can prove that the individual grid functions in
the 1D array don't overlap, you'll have to use individual pointers.
You might be able to add if or assert statements that prove to the
compiler that the maximum values of i, j, k are such that there is no
overlap, but you can never be sure that the compiler will really latch
onto this.

Personally, I don't trust auto-vectorization any more, except in
really simple cases. I always use intrinsics or equivalent for
important loops.

-erik

On Tue, Jul 23, 2019 at 11:56 AM Zach Etienne <zachetie at gmail.com> wrote:
>
> Thanks so much for your replies, Erik & Roland!
>
> I've a followup question:
>
> By default NRPy+/SENR codes store all gridfunctions in a 1D array, which the compiler tends to treat very conservatively. E.g., to access gxx and write to gxx_rhs, you might see something like (without compiler intrinsics for clarity):
>
> const double gxx = in_gfs[IDX4(GXX,i,j,k)];
>
> ...
>
> rhs_gfs[IDX4(GXX,i,j,k)] = blah;
>
> respectively.
>
> This 4D data type resulted in "vectorization cannot be performed because of possible data dependencies" types of messages, because all gridfunctions were being read from one 1D array and written to another 1D array. As a workaround and to keep the 1D array style *outside* the RHS eval function, I passed the start address of each gridfunction to the RHS function separately, using the *restrict keyword.
>
> My question is, should this workaround be necessary, or is there another, more straightforward approach?
>
> -Zach
>
> *     *     *
> Prof. Zachariah Etienne
> West Virginia University
> https://math.wvu.edu/~zetienne/
> https://blackholesathome.net
>
>
> On Tue, Jul 23, 2019 at 11:45 AM Haas, Roland <rhaas at illinois.edu> wrote:
>>
>> Hello all,
>>
>> just to stoke your paranoia :-)
>>
>> Erik's comment of checking the disassembled output is definitely the
>> right way to go.
>>
>> I kind of vaguely remember having heard that, at least for some
>> version of some compilers, once you use intrinsics the compiler will no
>> longer try to auto-vectorize the code (presumably the function) since
>> it takes you using intrinsics as an indication that you know what you
>> are doing.
>>
>> Yours,
>> Roland
>>
>> > Hi Erik,
>> >
>> > Thanks for your reassuring reply.
>> >
>> > > Did you look at the generated code? Try calling the compiler with the
>> > "-S" option to get assembler output, or use "objdump -d" to disassemble the
>> > object file. You should see lots of "ymms" mentioned in your memory access
>> > and arithmetic operations.
>> >
>> > Yep, as I mentioned, the `-S -g` commented assembler indeed did output lots
>> > of "ymms" in the innermost loop. Also, the annotated assembler on the
>> > innermost loop gave precisely the same remark about not being able to
>> > vectorize an operation:
>> >
>> >                 # optimization report
>> >                 # LOOP WITH USER VECTOR INTRINSICS
>> >                 # %s was not vectorized: operation cannot be vectorized
>> >                 # VECTOR TRIP COUNT IS ESTIMATED CONSTANT
>> >
>> > -Zach
>> >
>> > *     *     *
>> > Prof. Zachariah Etienne
>> > West Virginia University
>> > *https://math.wvu.edu/~zetienne/ <https://math.wvu.edu/~zetienne/>*
>> > https://blackholesathome.net
>> >
>> >
>> > On Tue, Jul 23, 2019 at 11:08 AM Erik Schnetter <schnetter at gmail.com> wrote:
>> >
>> > > Zach
>> > >
>> > > The compiler is correct: The intrinsic _mm256_loadu_pd cannot be
>> > > vectorized because ... it is already vectorized! If you are using
>> > > intrinsics for vectorization, then the compiler does not need to
>> > > perform any work.
>> > >
>> > > Did you look at the generated code? Try calling the compiler with the
>> > > "-S" option to get assembler output, or use "objdump -d" to
>> > > disassemble the object file. You should see lots of "ymms" mentioned
>> > > in your memory access and arithmetic operations.
>> > >
>> > > -erik
>> > >
>> > > On Tue, Jul 23, 2019 at 9:35 AM Zach Etienne <zachetie at gmail.com> wrote:
>> > > >
>> > > > Hi all,
>> > > >
>> > > > I used NRPy+ to create a "minimal example" SIMD-intrinsics-enabled PDE
>> > > solver kernel -- solving the scalar wave equation in 3 spatial dimensions.
>> > > >
>> > > > With AVX256+FMA intrinsics, neither Intel nor GNU compilers report
>> > > success at fully vectorizing the RHS eval loop. E.g., the Intel compiler
>> > > yields the cryptic message when compiling the innermost loop:
>> > > >
>> > > >          remark #15310: loop was not vectorized: operation cannot be
>> > > vectorized   [ ScalarWave/ScalarWave_RHSs-SIMD.h(31,52) ]
>> > > >
>> > > > The line it's referring to has to do with loading data from memory
>> > > _mm256_loadu_pd(&a).
>> > > >
>> > > > The entire source code is attached to this email, and I've been
>> > > compiling using
>> > > >
>> > > > icc -restrict -align -qopenmp -xHost -O2 -qopt-report=5
>> > > -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1
>> > > -qopt-prefetch=4 ScalarWave_Playground-SIMD.c -o ScalarWave_Playground-SIMD
>> > > >
>> > > > for Intel 19, and for GNU (gcc 9):
>> > > >
>> > > > gcc -fsimd-cost-model=unlimited -Ofast -fopenmp -march=native
>> > > ScalarWave_Playground-SIMD.c -fopt-info-vec-optimized-missed -o
>> > > ScalarWave_Playground-SIMD -lm
>> > > >
>> > > > When I look at the Intel-generated annotated assembly of the innermost
>> > > RHS loop (using icc -S -g -restrict -align -qopenmp -xHost -O2
>> > > -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec
>> > > -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c), I see many
>> > > 256-bit "ymmX"'s and corresponding instructions that seem to be consistent
>> > > with the *math* intrinsics. I can't decipher much beyond that, though.
>> > > Notably I didn't see any assembler instructions that look like
>> > > _mm256_loadu_pd().
>> > > >
>> > > > I fiddled around a bit with what goes inside the _mm256_loadu_pd(), just
>> > > to see what might be causing the cryptic remark above. I found that if I
>> > > remove dependence on the innermost loop variable "i0" on certain (obviously
>> > > this would break the functionality, but the compiler doesn't care), then it
>> > > is capable of vectorizing that loop.
>> > > >
>> > > > Note that the version of the code that does not use intrinsics is about
>> > > 1.8x slower with either compiler, so I think intrinsics are providing some
>> > > good benefit. However, I am discouraged by the compiler telling me that the
>> > > inner loop cannot be fully vectorized.
>> > > >
>> > > > Any tips would be greatly appreciated!
>> > > >
>> > > > -Zach
>> > > >
>> > > > *     *     *
>> > > > Prof. Zachariah Etienne
>> > > > West Virginia University
>> > > > https://math.wvu.edu/~zetienne/
>> > > > https://blackholesathome.net
>> > > > _______________________________________________
>> > > > performanceoptimization-wg mailing list
>> > > > performanceoptimization-wg at einsteintoolkit.org
>> > > >
>> > > http://lists.einsteintoolkit.org/mailman/listinfo/performanceoptimization-wg
>> > >
>> > >
>> > >
>> > > --
>> > > Erik Schnetter <schnetter at gmail.com>
>> > > http://www.perimeterinstitute.ca/personal/eschnetter/
>> > >
>>
>>
>>
>> --
>> My email is as private as my paper mail. I therefore support encrypting
>> and signing email messages. Get my PGP key from http://keys.gnupg.net.

-- 
Erik Schnetter <schnetter at gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/