[Performanceoptimization-wg] SIMD Puzzle

Tue Jul 23 10:56:15 CDT 2019

Thanks so much for your replies, Erik & Roland!

I've a followup question:

By default NRPy+/SENR codes store all gridfunctions in a 1D array, which
the compiler tends to treat very conservatively. E.g., to access gxx and
write to gxx_rhs, you might see something like (without compiler intrinsics
for clarity):

const double gxx = in_gfs[IDX4(GXX,i,j,k)];

...

rhs_gfs[IDX4(GXX,i,j,k)] = blah;

respectively.

This 4D data type resulted in "vectorization cannot be performed because of
possible data dependencies" types of messages, because all gridfunctions
were being read from one 1D array and written to another 1D array. As a
workaround and to keep the 1D array style *outside* the RHS eval function,
I passed the start address of each gridfunction to the RHS function
separately, using the *restrict keyword.

My question is, should this workaround be necessary, or is there another,
more straightforward approach?

-Zach

*     *     *
Prof. Zachariah Etienne
West Virginia University
*https://math.wvu.edu/~zetienne/ <https://math.wvu.edu/~zetienne/>*
https://blackholesathome.net

On Tue, Jul 23, 2019 at 11:45 AM Haas, Roland <rhaas at illinois.edu> wrote:

> Hello all,
>
> just to stoke your paranoia :-)
>
> Erik's comment of checking the disassembled output is definitely the
> right way to go.
>
> I kind of vaguely remember having heard that, at least for some
> version of some compilers, once you use intrinsics the compiler will no
> longer try to auto-vectorize the code (presumably the function) since
> it takes you using intrinsics as an indication that you know what you
> are doing.
>
> Yours,
> Roland
>
> > Hi Erik,
> >
> > Thanks for your reassuring reply.
> >
> > > Did you look at the generated code? Try calling the compiler with the
> > "-S" option to get assembler output, or use "objdump -d" to disassemble
> the
> > object file. You should see lots of "ymms" mentioned in your memory
> access
> > and arithmetic operations.
> >
> > Yep, as I mentioned, the `-S -g` commented assembler indeed did output
> lots
> > of "ymms" in the innermost loop. Also, the annotated assembler on the
> > innermost loop gave precisely the same remark about not being able to
> > vectorize an operation:
> >
> >                 # optimization report
> >                 # LOOP WITH USER VECTOR INTRINSICS
> >                 # %s was not vectorized: operation cannot be vectorized
> >                 # VECTOR TRIP COUNT IS ESTIMATED CONSTANT
> >
> > -Zach
> >
> > *     *     *
> > Prof. Zachariah Etienne
> > West Virginia University
> > *https://math.wvu.edu/~zetienne/ <https://math.wvu.edu/~zetienne/>*
> > https://blackholesathome.net
> >
> >
> > On Tue, Jul 23, 2019 at 11:08 AM Erik Schnetter <schnetter at gmail.com>
> wrote:
> >
> > > Zach
> > >
> > > The compiler is correct: The intrinsic _mm256_loadu_pd cannot be
> > > vectorized because ... it is already vectorized! If you are using
> > > intrinsics for vectorization, then the compiler does not need to
> > > perform any work.
> > >
> > > Did you look at the generated code? Try calling the compiler with the
> > > "-S" option to get assembler output, or use "objdump -d" to
> > > disassemble the object file. You should see lots of "ymms" mentioned
> > > in your memory access and arithmetic operations.
> > >
> > > -erik
> > >
> > > On Tue, Jul 23, 2019 at 9:35 AM Zach Etienne <zachetie at gmail.com>
> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I used NRPy+ to create a "minimal example" SIMD-intrinsics-enabled
> PDE
> > > solver kernel -- solving the scalar wave equation in 3 spatial
> dimensions.
> > > >
> > > > With AVX256+FMA intrinsics, neither Intel nor GNU compilers report
> > > success at fully vectorizing the RHS eval loop. E.g., the Intel
> compiler
> > > yields the cryptic message when compiling the innermost loop:
> > > >
> > > >          remark #15310: loop was not vectorized: operation cannot
> be
> > > vectorized   [ ScalarWave/ScalarWave_RHSs-SIMD.h(31,52) ]
> > > >
> > > > The line it's referring to has to do with loading data from memory
> > > _mm256_loadu_pd(&a).
> > > >
> > > > The entire source code is attached to this email, and I've been
> > > compiling using
> > > >
> > > > icc -restrict -align -qopenmp -xHost -O2 -qopt-report=5
> > > -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1
> > > -qopt-prefetch=4 ScalarWave_Playground-SIMD.c -o
> ScalarWave_Playground-SIMD
> > > >
> > > > for Intel 19, and for GNU (gcc 9):
> > > >
> > > > gcc -fsimd-cost-model=unlimited -Ofast -fopenmp -march=native
> > > ScalarWave_Playground-SIMD.c -fopt-info-vec-optimized-missed -o
> > > ScalarWave_Playground-SIMD -lm
> > > >
> > > > When I look at the Intel-generated annotated assembly of the
> innermost
> > > RHS loop (using icc -S -g -restrict -align -qopenmp -xHost -O2
> > > -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec
> > > -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c), I see
> many
> > > 256-bit "ymmX"'s and corresponding instructions that seem to be
> consistent
> > > with the *math* intrinsics. I can't decipher much beyond that, though.
> > > Notably I didn't see any assembler instructions that look like
> > > _mm256_loadu_pd().
> > > >
> > > > I fiddled around a bit with what goes inside the _mm256_loadu_pd(),
> just
> > > to see what might be causing the cryptic remark above. I found that if
> I
> > > remove dependence on the innermost loop variable "i0" on certain
> (obviously
> > > this would break the functionality, but the compiler doesn't care),
> then it
> > > is capable of vectorizing that loop.
> > > >
> > > > Note that the version of the code that does not use intrinsics is
> about
> > > 1.8x slower with either compiler, so I think intrinsics are providing
> some
> > > good benefit. However, I am discouraged by the compiler telling me
> that the
> > > inner loop cannot be fully vectorized.
> > > >
> > > > Any tips would be greatly appreciated!
> > > >
> > > > -Zach
> > > >
> > > > *     *     *
> > > > Prof. Zachariah Etienne
> > > > West Virginia University
> > > > https://math.wvu.edu/~zetienne/
> > > > https://blackholesathome.net
> > > > _______________________________________________
> > > > performanceoptimization-wg mailing list
> > > > performanceoptimization-wg at einsteintoolkit.org
> > > >
> > >
> http://lists.einsteintoolkit.org/mailman/listinfo/performanceoptimization-wg
> > >
> > >
> > >
> > > --
> > > Erik Schnetter <schnetter at gmail.com>
> > > http://www.perimeterinstitute.ca/personal/eschnetter/
> > >
>
>
>
> --
> My email is as private as my paper mail. I therefore support encrypting
> and signing email messages. Get my PGP key from http://keys.gnupg.net.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/performanceoptimization-wg/attachments/20190723/500a6e08/attachment.html