<div dir="ltr">Hi all,<div><br></div><div>I used NRPy+ to create a &quot;minimal example&quot; SIMD-intrinsics-enabled PDE solver kernel -- solving the scalar wave equation in 3 spatial dimensions.</div><div><br></div><div>With AVX256+FMA intrinsics, neither Intel nor GNU compilers report success at fully vectorizing the RHS eval loop. E.g., the Intel compiler yields the cryptic message when compiling the innermost loop:</div><div><br></div><div>         remark #15310: loop was not vectorized: operation cannot be vectorized   [ ScalarWave/ScalarWave_RHSs-SIMD.h(31,52) ]<br></div><div><br></div><div>The line it&#39;s referring to has to do with loading data from memory _mm256_loadu_pd(&amp;a).</div><div><br></div><div>The entire source code is attached to this email, and I&#39;ve been compiling using</div><div><br></div><div>icc -restrict -align -qopenmp -xHost -O2 -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c -o ScalarWave_Playground-SIMD<br></div><div><br></div><div>for Intel 19, and for GNU (gcc 9):</div><div><br></div><div>gcc -fsimd-cost-model=unlimited -Ofast -fopenmp -march=native ScalarWave_Playground-SIMD.c -fopt-info-vec-optimized-missed -o ScalarWave_Playground-SIMD -lm<br></div><div><br></div><div>When I look at the Intel-generated annotated assembly of the innermost RHS loop (using icc -S -g -restrict -align -qopenmp -xHost -O2 -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c), I see many 256-bit &quot;ymmX&quot;&#39;s and corresponding instructions that seem to be consistent with the *math* intrinsics. I can&#39;t decipher much beyond that, though. Notably I didn&#39;t see any assembler instructions that look like _mm256_loadu_pd().</div><div><br></div><div>I fiddled around a bit with what goes inside the _mm256_loadu_pd(), just to see what might be causing the cryptic remark above. I found that if I remove dependence on the innermost loop variable &quot;i0&quot; on certain (obviously this would break the functionality, but the compiler doesn&#39;t care), then it is capable of vectorizing that loop. </div><div><br></div><div>Note that the version of the code that does not use intrinsics is about 1.8x slower with either compiler, so I think intrinsics are providing some good benefit. However, I am discouraged by the compiler telling me that the inner loop cannot be fully vectorized.</div><div><br></div><div>Any tips would be greatly appreciated!</div><div><br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div style="font-size:12.8px">-Zach</div><div style="font-size:12.8px"><br></div><span style="font-size:12.8px">*     *     *</span><br style="font-size:12.8px"><span style="font-size:12.8px">Prof. Zachariah Etienne</span><br style="font-size:12.8px"><div style="font-size:12.8px">West Virginia University</div><div><font color="#0000ee"><u><a href="https://math.wvu.edu/~zetienne/" target="_blank">https://math.wvu.edu/~zetienne/</a></u></font></div><div><a href="https://blackholesathome.net" style="font-size:12.8px" target="_blank">https://blackholesathome.net</a><br></div></div></div></div></div></div></div></div></div></div></div></div></div></div>