[Performanceoptimization-wg] SIMD Puzzle

Tue Jul 23 10:45:48 CDT 2019

Hello all,

just to stoke your paranoia :-)

Erik's comment of checking the disassembled output is definitely the
right way to go.

I kind of vaguely remember having heard that, at least for some
version of some compilers, once you use intrinsics the compiler will no
longer try to auto-vectorize the code (presumably the function) since
it takes you using intrinsics as an indication that you know what you
are doing.

Yours,
Roland

> Hi Erik,
> 
> Thanks for your reassuring reply.
> 
> > Did you look at the generated code? Try calling the compiler with the  
> "-S" option to get assembler output, or use "objdump -d" to disassemble the
> object file. You should see lots of "ymms" mentioned in your memory access
> and arithmetic operations.
> 
> Yep, as I mentioned, the `-S -g` commented assembler indeed did output lots
> of "ymms" in the innermost loop. Also, the annotated assembler on the
> innermost loop gave precisely the same remark about not being able to
> vectorize an operation:
> 
>                 # optimization report
>                 # LOOP WITH USER VECTOR INTRINSICS
>                 # %s was not vectorized: operation cannot be vectorized
>                 # VECTOR TRIP COUNT IS ESTIMATED CONSTANT
> 
> -Zach
> 
> *     *     *
> Prof. Zachariah Etienne
> West Virginia University
> *https://math.wvu.edu/~zetienne/ <https://math.wvu.edu/~zetienne/>*
> https://blackholesathome.net
> 
> 
> On Tue, Jul 23, 2019 at 11:08 AM Erik Schnetter <schnetter at gmail.com> wrote:
> 
> > Zach
> >
> > The compiler is correct: The intrinsic _mm256_loadu_pd cannot be
> > vectorized because ... it is already vectorized! If you are using
> > intrinsics for vectorization, then the compiler does not need to
> > perform any work.
> >
> > Did you look at the generated code? Try calling the compiler with the
> > "-S" option to get assembler output, or use "objdump -d" to
> > disassemble the object file. You should see lots of "ymms" mentioned
> > in your memory access and arithmetic operations.
> >
> > -erik
> >
> > On Tue, Jul 23, 2019 at 9:35 AM Zach Etienne <zachetie at gmail.com> wrote:  
> > >
> > > Hi all,
> > >
> > > I used NRPy+ to create a "minimal example" SIMD-intrinsics-enabled PDE  
> > solver kernel -- solving the scalar wave equation in 3 spatial dimensions.  
> > >
> > > With AVX256+FMA intrinsics, neither Intel nor GNU compilers report  
> > success at fully vectorizing the RHS eval loop. E.g., the Intel compiler
> > yields the cryptic message when compiling the innermost loop:  
> > >
> > >          remark #15310: loop was not vectorized: operation cannot be  
> > vectorized   [ ScalarWave/ScalarWave_RHSs-SIMD.h(31,52) ]  
> > >
> > > The line it's referring to has to do with loading data from memory  
> > _mm256_loadu_pd(&a).  
> > >
> > > The entire source code is attached to this email, and I've been  
> > compiling using  
> > >
> > > icc -restrict -align -qopenmp -xHost -O2 -qopt-report=5  
> > -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1
> > -qopt-prefetch=4 ScalarWave_Playground-SIMD.c -o ScalarWave_Playground-SIMD  
> > >
> > > for Intel 19, and for GNU (gcc 9):
> > >
> > > gcc -fsimd-cost-model=unlimited -Ofast -fopenmp -march=native  
> > ScalarWave_Playground-SIMD.c -fopt-info-vec-optimized-missed -o
> > ScalarWave_Playground-SIMD -lm  
> > >
> > > When I look at the Intel-generated annotated assembly of the innermost  
> > RHS loop (using icc -S -g -restrict -align -qopenmp -xHost -O2
> > -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec
> > -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c), I see many
> > 256-bit "ymmX"'s and corresponding instructions that seem to be consistent
> > with the *math* intrinsics. I can't decipher much beyond that, though.
> > Notably I didn't see any assembler instructions that look like
> > _mm256_loadu_pd().  
> > >
> > > I fiddled around a bit with what goes inside the _mm256_loadu_pd(), just  
> > to see what might be causing the cryptic remark above. I found that if I
> > remove dependence on the innermost loop variable "i0" on certain (obviously
> > this would break the functionality, but the compiler doesn't care), then it
> > is capable of vectorizing that loop.  
> > >
> > > Note that the version of the code that does not use intrinsics is about  
> > 1.8x slower with either compiler, so I think intrinsics are providing some
> > good benefit. However, I am discouraged by the compiler telling me that the
> > inner loop cannot be fully vectorized.  
> > >
> > > Any tips would be greatly appreciated!
> > >
> > > -Zach
> > >
> > > *     *     *
> > > Prof. Zachariah Etienne
> > > West Virginia University
> > > https://math.wvu.edu/~zetienne/
> > > https://blackholesathome.net
> > > _______________________________________________
> > > performanceoptimization-wg mailing list
> > > performanceoptimization-wg at einsteintoolkit.org
> > >  
> > http://lists.einsteintoolkit.org/mailman/listinfo/performanceoptimization-wg
> >
> >
> >
> > --
> > Erik Schnetter <schnetter at gmail.com>
> > http://www.perimeterinstitute.ca/personal/eschnetter/
> >  

-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://keys.gnupg.net.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://lists.einsteintoolkit.org/pipermail/performanceoptimization-wg/attachments/20190723/d0dbd6d5/attachment-0001.bin