<div dir="ltr">Thanks so much for your replies, Erik &amp; Roland!<div><br></div><div>I&#39;ve a followup question:<div><br></div><div>By default NRPy+/SENR codes store all gridfunctions in a 1D array, which the compiler tends to treat very conservatively. E.g., to access gxx and write to gxx_rhs, you might see something like (without compiler intrinsics for clarity):<br></div><div><br></div><div>const double gxx = in_gfs[IDX4(GXX,i,j,k)];</div><div><br></div><div>...</div><div><br></div><div>rhs_gfs[IDX4(GXX,i,j,k)] = blah;</div><div><br></div><div>respectively.</div><div><br></div><div>This 4D data type resulted in &quot;vectorization cannot be performed because of possible data dependencies&quot; types of messages, because all gridfunctions were being read from one 1D array and written to another 1D array. As a workaround and to keep the 1D array style *outside* the RHS eval function, I passed the start address of each gridfunction to the RHS function separately, using the *restrict keyword. </div><div><br></div><div>My question is, should this workaround be necessary, or is there another, more straightforward approach?</div><div><br clear="all"><div><div dir="ltr" class="m_5790956014382495057gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div style="font-size:12.8px">-Zach</div><div style="font-size:12.8px"><br></div><span style="font-size:12.8px">*     *     *</span><br style="font-size:12.8px"><span style="font-size:12.8px">Prof. Zachariah Etienne</span><br style="font-size:12.8px"><div style="font-size:12.8px">West Virginia University</div><div><font color="#0000ee"><u><a href="https://math.wvu.edu/~zetienne/" target="_blank">https://math.wvu.edu/~zetienne/</a></u></font></div><div><a href="https://blackholesathome.net" style="font-size:12.8px" target="_blank">https://blackholesathome.net</a><br></div></div></div></div></div></div></div></div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 23, 2019 at 11:45 AM Haas, Roland &lt;<a href="mailto:rhaas@illinois.edu" target="_blank">rhaas@illinois.edu</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello all,<br>

<br>

just to stoke your paranoia :-)<br>

<br>

Erik&#39;s comment of checking the disassembled output is definitely the<br>

right way to go.<br>

<br>

I kind of vaguely remember having heard that, at least for some<br>

version of some compilers, once you use intrinsics the compiler will no<br>

longer try to auto-vectorize the code (presumably the function) since<br>

it takes you using intrinsics as an indication that you know what you<br>

are doing.<br>

<br>

Yours,<br>

Roland<br>

<br>

&gt; Hi Erik,<br>

&gt; <br>

&gt; Thanks for your reassuring reply.<br>

&gt; <br>

&gt; &gt; Did you look at the generated code? Try calling the compiler with the  <br>

&gt; &quot;-S&quot; option to get assembler output, or use &quot;objdump -d&quot; to disassemble the<br>

&gt; object file. You should see lots of &quot;ymms&quot; mentioned in your memory access<br>

&gt; and arithmetic operations.<br>

&gt; <br>

&gt; Yep, as I mentioned, the `-S -g` commented assembler indeed did output lots<br>

&gt; of &quot;ymms&quot; in the innermost loop. Also, the annotated assembler on the<br>

&gt; innermost loop gave precisely the same remark about not being able to<br>

&gt; vectorize an operation:<br>

&gt; <br>

&gt;                 # optimization report<br>

&gt;                 # LOOP WITH USER VECTOR INTRINSICS<br>

&gt;                 # %s was not vectorized: operation cannot be vectorized<br>

&gt;                 # VECTOR TRIP COUNT IS ESTIMATED CONSTANT<br>

&gt; <br>

&gt; -Zach<br>

&gt; <br>

&gt; *     *     *<br>

&gt; Prof. Zachariah Etienne<br>

&gt; West Virginia University<br>

&gt; *<a href="https://math.wvu.edu/~zetienne/" rel="noreferrer" target="_blank">https://math.wvu.edu/~zetienne/</a> &lt;<a href="https://math.wvu.edu/~zetienne/" rel="noreferrer" target="_blank">https://math.wvu.edu/~zetienne/</a>&gt;*<br>

&gt; <a href="https://blackholesathome.net" rel="noreferrer" target="_blank">https://blackholesathome.net</a><br>

&gt; <br>

&gt; <br>

&gt; On Tue, Jul 23, 2019 at 11:08 AM Erik Schnetter &lt;<a href="mailto:schnetter@gmail.com" target="_blank">schnetter@gmail.com</a>&gt; wrote:<br>

&gt; <br>

&gt; &gt; Zach<br>

&gt; &gt;<br>

&gt; &gt; The compiler is correct: The intrinsic _mm256_loadu_pd cannot be<br>

&gt; &gt; vectorized because ... it is already vectorized! If you are using<br>

&gt; &gt; intrinsics for vectorization, then the compiler does not need to<br>

&gt; &gt; perform any work.<br>

&gt; &gt;<br>

&gt; &gt; Did you look at the generated code? Try calling the compiler with the<br>

&gt; &gt; &quot;-S&quot; option to get assembler output, or use &quot;objdump -d&quot; to<br>

&gt; &gt; disassemble the object file. You should see lots of &quot;ymms&quot; mentioned<br>

&gt; &gt; in your memory access and arithmetic operations.<br>

&gt; &gt;<br>

&gt; &gt; -erik<br>

&gt; &gt;<br>

&gt; &gt; On Tue, Jul 23, 2019 at 9:35 AM Zach Etienne &lt;<a href="mailto:zachetie@gmail.com" target="_blank">zachetie@gmail.com</a>&gt; wrote:  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; Hi all,<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; I used NRPy+ to create a &quot;minimal example&quot; SIMD-intrinsics-enabled PDE  <br>

&gt; &gt; solver kernel -- solving the scalar wave equation in 3 spatial dimensions.  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; With AVX256+FMA intrinsics, neither Intel nor GNU compilers report  <br>

&gt; &gt; success at fully vectorizing the RHS eval loop. E.g., the Intel compiler<br>

&gt; &gt; yields the cryptic message when compiling the innermost loop:  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt;          remark #15310: loop was not vectorized: operation cannot be  <br>

&gt; &gt; vectorized   [ ScalarWave/ScalarWave_RHSs-SIMD.h(31,52) ]  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; The line it&#39;s referring to has to do with loading data from memory  <br>

&gt; &gt; _mm256_loadu_pd(&amp;a).  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; The entire source code is attached to this email, and I&#39;ve been  <br>

&gt; &gt; compiling using  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; icc -restrict -align -qopenmp -xHost -O2 -qopt-report=5  <br>

&gt; &gt; -qopt-report-phase ipo -qopt-report-phase vec -vec-threshold1<br>

&gt; &gt; -qopt-prefetch=4 ScalarWave_Playground-SIMD.c -o ScalarWave_Playground-SIMD  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; for Intel 19, and for GNU (gcc 9):<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; gcc -fsimd-cost-model=unlimited -Ofast -fopenmp -march=native  <br>

&gt; &gt; ScalarWave_Playground-SIMD.c -fopt-info-vec-optimized-missed -o<br>

&gt; &gt; ScalarWave_Playground-SIMD -lm  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; When I look at the Intel-generated annotated assembly of the innermost  <br>

&gt; &gt; RHS loop (using icc -S -g -restrict -align -qopenmp -xHost -O2<br>

&gt; &gt; -qopt-report=5 -qopt-report-phase ipo -qopt-report-phase vec<br>

&gt; &gt; -vec-threshold1 -qopt-prefetch=4 ScalarWave_Playground-SIMD.c), I see many<br>

&gt; &gt; 256-bit &quot;ymmX&quot;&#39;s and corresponding instructions that seem to be consistent<br>

&gt; &gt; with the *math* intrinsics. I can&#39;t decipher much beyond that, though.<br>

&gt; &gt; Notably I didn&#39;t see any assembler instructions that look like<br>

&gt; &gt; _mm256_loadu_pd().  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; I fiddled around a bit with what goes inside the _mm256_loadu_pd(), just  <br>

&gt; &gt; to see what might be causing the cryptic remark above. I found that if I<br>

&gt; &gt; remove dependence on the innermost loop variable &quot;i0&quot; on certain (obviously<br>

&gt; &gt; this would break the functionality, but the compiler doesn&#39;t care), then it<br>

&gt; &gt; is capable of vectorizing that loop.  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; Note that the version of the code that does not use intrinsics is about  <br>

&gt; &gt; 1.8x slower with either compiler, so I think intrinsics are providing some<br>

&gt; &gt; good benefit. However, I am discouraged by the compiler telling me that the<br>

&gt; &gt; inner loop cannot be fully vectorized.  <br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; Any tips would be greatly appreciated!<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; -Zach<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; *     *     *<br>

&gt; &gt; &gt; Prof. Zachariah Etienne<br>

&gt; &gt; &gt; West Virginia University<br>

&gt; &gt; &gt; <a href="https://math.wvu.edu/~zetienne/" rel="noreferrer" target="_blank">https://math.wvu.edu/~zetienne/</a><br>

&gt; &gt; &gt; <a href="https://blackholesathome.net" rel="noreferrer" target="_blank">https://blackholesathome.net</a><br>

&gt; &gt; &gt; _______________________________________________<br>

&gt; &gt; &gt; performanceoptimization-wg mailing list<br>

&gt; &gt; &gt; <a href="mailto:performanceoptimization-wg@einsteintoolkit.org" target="_blank">performanceoptimization-wg@einsteintoolkit.org</a><br>

&gt; &gt; &gt;  <br>

&gt; &gt; <a href="http://lists.einsteintoolkit.org/mailman/listinfo/performanceoptimization-wg" rel="noreferrer" target="_blank">http://lists.einsteintoolkit.org/mailman/listinfo/performanceoptimization-wg</a><br>

&gt; &gt;<br>

&gt; &gt;<br>

&gt; &gt;<br>

&gt; &gt; --<br>

&gt; &gt; Erik Schnetter &lt;<a href="mailto:schnetter@gmail.com" target="_blank">schnetter@gmail.com</a>&gt;<br>

&gt; &gt; <a href="http://www.perimeterinstitute.ca/personal/eschnetter/" rel="noreferrer" target="_blank">http://www.perimeterinstitute.ca/personal/eschnetter/</a><br>

&gt; &gt;  <br>

<br>

<br>

<br>

-- <br>

My email is as private as my paper mail. I therefore support encrypting<br>

and signing email messages. Get my PGP key from <a href="http://keys.gnupg.net" rel="noreferrer" target="_blank">http://keys.gnupg.net</a>.<br>

</blockquote></div></div></div>