[Users] Using AVX-512 (Skylake and/or Knights Landing)

Erik Schnetter eschnetter at perimeterinstitute.ca
Wed Mar 7 13:52:28 CST 2018

A few days ago I was asked about the performance improvements from using
AVX-512 instructions in the Einstein Toolkit, i.e. the new machine
instructions offered by Intel's Knights Landing and Skylake CPUs. I thought
I should reply here in public since other might be interested as well.

If you want to learn more about what I write below, go to <
http://agner.org/optimize/> and look at the file "microarchitecture.pdf".

You might be aware that modern CPUs have instructions that execute multiple
operations simultaneously, called SIMD (Single Instruction, Multiple Data).
This is different from and in addition to the so-called "superscalar"
execution, which allows executing multiple instructions (up to three or
four) in a single cycle. Using SIMD instructions is an important ingredient
to obtaining good performance in a compute-bound inner loop.

The theoretical performance maximum of a single Skylake core can be
calculated as follows:

- use a fused multiply-add (FMA) instruction (i.e. use an instruction that
calculates x*y+z as single operation), counting as 2 Flop
- use SIMD to perform FMA operations on 8 vector elements simultaneously
(factor 8 speedup)
- arrange the code cleverly so that 2 of the 3 or 4 superscalar operations
can be FMA (the others will need to be load or store instructions) (another
factor 2 speedup)
- run this e.g. at 2.5 GHz

This leads to a theoretical performance maximum of 2 Flop * 8 * 2 * 2.5 GHz
= 80 GFlop/s per core. If a compute node has 40 cores, then one can achieve
3.2 TFlop/s per node.

Of course, these are all theoretical numbers, and all sorts of limitations
will get in the way of achieving this. The point of my note here is to
describe what some of these other limitations are, how to find out whether
they are getting in the way, and if so, how to avoid them.

If you are using a Knights Landing CPU (and if you are, you will know),
then the most annoying limitation is that the CPU can in principle execute
4 instructions per cycle -- but it can, unfortunately, only decode 2 or so
instructions per cycle. Now, the instructions that can be decoded can be
more complicated than the instructions that are executed. For mainly
historical reasons, the machine instructions that are read from memory by
the CPU are first decoded and then split into micro-operations, and what is
executed are actually micro-operations. If things go well, 2 instructions
are read and split into 4 micro-operations. In practice, things don't
usually go well, and the bottleneck of a calculation might actually be the
part of the CPU that reads instructions from memory, not the part that
executes them. However, this only applies to Knights Landing, not to
Skylake: Skylake has no problems in this respect. Since it's quite
difficult to analyze and remedy this, I will ignore this here.

To make the compiler use FMA instructions instead of separate multiply and
add instructions, it helps to use the "-ffast-math" flag. Without this
flag, the compiler cannot apply the associativity laws for real numbers to
floating-point numbers, and this might be necessary to re-order the
operations in such a way that FMA instructions can be used. Since
re-associating operations can change the rounding error that is incurred,
the IEEE standard does not allow it, assuming (quite preposterously in most
cases) that the programmer put great care in how floating-point expressions
are written. For example, the expression $x + y + z$ is evaluated as $(x +
y) + z$, and without "-ffast-math", this cannot be changed to $x + (y + z)$
or $(x + z) + y$. Of course, in the majority of all cases, the programmer
didn't intend to enforce a particular order. Allowing the compiler to
re-order expressions in such a way can improve performance for various
reasons. Apart from being able to use FMA instructions, reordering can also
improve instruction latency, or reduce the number of registers that are
needed. One should also add that GCC and Clang do follow the IEEE standard
by default, while the Intel compiler does not. So be careful when comparing
performance between compilers.

How can one check whether FMA instructions are used? The only reliable way
is to look at the generated assembler output. To do so, compile the code
(with optimizations enabled, of course), and locate the respective "*.o"
file that contains e.g. the BSSN right-hand side, or the hydrodynamics flux
calculations. The Unix command to disassemble an object file is "objdump"
("man objdump"). Try e.g. "objdump -d BSSN.o". It's best to redirect the
output to another file, e.g. "BSSN.s", since it will be large (very large),
and is best examined in an editor with a comfortable search function. On
MacOS, the command to disassemble would be "otool -tv BSSN.o"

The disassembled output might look somewhat like this:

0000000000000150        leaq    (%rdi,%rsi), %rax
0000000000000154        vmovupd (%rax,%rdx), %ymm1
0000000000000159        movq    %rsi, %rax
000000000000015c        subq    %rdx, %rax
000000000000015f        vsubpd  (%rdi,%rax), %ymm1, %ymm1
0000000000000164        movq    %rdx, %rax
0000000000000167        subq    %rsi, %rax
000000000000016a        negq    %rsi
000000000000016d        subq    %rdx, %rsi
0000000000000170        vsubpd  (%rdi,%rax), %ymm1, %ymm1
0000000000000175        vaddpd  (%rdi,%rsi), %ymm1, %ymm1
000000000000017a        vmulpd  %ymm0, %ymm1, %ymm0
000000000000017e        retq

This is the assembler code corresponding to a small function. The first
line is an externally visible label, here a C++-"mangled" function name.
You can use the command "c++filt" to convert this to a human-readable
function signature. The other lines start with a number (the address in the
object file), followed by a machine instruction and its arguments. There
are literally hundreds of different machine instructions, usually with
rather quaint semantics that only make sense in the context of 1976's
technology <https://en.wikipedia.org/wiki/Intel_8086>. On Intel CPUs, the
modern floating-point instructions start with the letter "v" for "vector".
The second-to-last letter indicates whether this is a scalar ("s") or SIMD
instruction ("p"). The last letter indicates whether this is a single
precision ("s") or double precision ("d") operation. For example, "vsubpd"
is SIMD floating point double-precision subtraction. There are three
arguments -- the first two are the source operands, the last is the
destination, i.e. where to the result is written. "%" indicates a register
(which is fast to access), and registers have quaint names as well. The
first letter of the floating-point register indicates whether a SIMD
instruction access 128 bits ("x"), 256 bits ("y"), or 512 bits ("z"). Since
a double precision number has 64 bits, the code here acts on 4-vectors.
(This code is taken from my laptop, which does not support 512-bit
vectors.) Parentheses indicate a pointer that is dereferenced, so e.g.
"(%rdi,%rax)" means that the rdi and rax registers are first added, and the
result is then interpreted as "double*" and dereferenced. Presumably,
either rdi or rax contains a pointer, and the other contains an offset.
Which is which needs to be inferred from context.

<http://www.felixcloutier.com/x86/> lists all instructions and gives a
detailed explanation. That's not a good resource for learning, but a good
resource for the expert to study the details. Intel also provides manuals
for their CPUs in the form of huge pdf files that are, however, much less
fun to search.

In this function, there are only four floating-point instructions
altogether: vsubpd (twice), vaddpd, and vmulpd (obviously subtraction,
addition, and multiplication). This function implements, in fact, a centred
second-order finite difference operator evaluating a second derivative,
namely a mixed derivative in the y-z-directions. There is also a "vmovupd"
instruction in the beginning, which loads a value from memory into a
register without performing a floating-point operation. The three additions
and subtractions calculate the differences, the final multiplication takes
the grid spacing into account. In this case, there is no fused multiply-add
instruction, since it doesn't fit with the order of operations. Thus we're
forced to forego at least half of the theoretical peak performance. Of
course, the integer instructions in between the floating point instructions
(mov, add ("lea" -- yes, instructions are quaint), sub, neg) also need to
be executed, so that we can't keep the floating-point unit 100% busy anyway.

Here is another function. (I am only showing really small functions here;
actual functions can be hundreds or thousands of instructions long.)

0000000000004e50        vmovapd (%rip), %ymm2
0000000000004e58        vmulpd  0x8(%rdi), %ymm2, %ymm2
0000000000004e5d        vmovupd 0x10(%rdi), %ymm1
0000000000004e62        vmovupd -0x18(%rdi), %ymm3
0000000000004e67        vmovupd -0x10(%rdi), %ymm4
0000000000004e6c        vmovupd 0x18(%rdi), %ymm5
0000000000004e71        vfmadd231pd     (%rip), %ymm4, %ymm2
0000000000004e7a        vfmsub132pd     (%rip), %ymm3, %ymm1
0000000000004e83        vaddpd  %ymm2, %ymm1, %ymm1
0000000000004e87        vmovupd -0x8(%rdi), %ymm2
0000000000004e8c        vfmadd132pd     (%rip), %ymm5, %ymm2
0000000000004e95        vaddpd  %ymm2, %ymm1, %ymm1
0000000000004e99        vmulpd  %ymm0, %ymm1, %ymm0
0000000000004e9d        retq

Note that the name of the function is interesting. It ends in ".isra.7",
which indicates that this function was automatically generated by GCC by
inter-procedural optimization <
GCC determined that it could copy the definition a function, and then
modify the copy to adapt it to one particular place where it is called,
producing a faster function. And this is indeed an optimized copy -- there
are no integer operations at all! This is quite rare, which is why I'm
showing it here. As above, the "vmovupd" instructions load data from
memory, and the "0x8(%rdi)" syntax indicates a pointer dereference with an
offset, here 8. In C, this would be "*(rdi+8)". This function calculates a
6th order accurate finite difference, namely the derivative in the
x-direction. Here, there are FMA instructions present -- three of them,
called "vfmadd231pd", "vfmsub132pd", and "vfmadd132pd". The "sub" variant
calculates "x*y-z" (probably -- I didn't look it up). There are also
variants (not used here) that calculate "-x*y+z" and "-x*y-z". The
permutation of "123" in the instruction name indicates which of the 3
arguments corresponds to x, y, and z (I think -- I didn't look it up).
Overall, while there are no integer operations, there are still a number of
move instructions that load data from memory, so this function won't be
achieving 100% of the theoretical floating point performance either.

In my experience, there is always something getting in the way of achieving
100% of the theoretical floating-point performance. Achieving 20% on a
Skylake CPU (or 10% on a Knights Landing CPU) is, in my opinion, very good,
certainly at the level of using bold face in funding proposals or opening a
bottle of Whisky from the last century. Of course, achieving 20% of the
theoretical peak performance in a loop kernel is only an interesting
achievement if a large fraction of the overall execution time is spent
there. If most of the time is spent finding horizons or performing in I/O,
or if there is a significant multi-threading overhead, then those problems
need to be addressed instead.

The issue that is most likely spoiling performance are memory accesses.
Let's play another game with typical numbers:

- memory bandwidth: 100 GByte/s
- 8 byte/double
- 2 memory banks ("sockets") per node
- 20 cores per socket
- memory needs to be accessed to both read inputs and write results
- combined: 100 GByte/s / (8 byte/double) * 2 / 40 / 2 = 0.3 GMop/s

That is, we can access about 0.3e9 double precision numbers per second per
core. If we compare this to the 80e9 floating-point operations that a core
can perform, and if we take these numbers as order of magnitude only (which
is what they are), then we have a disparity of a factor of 100: For each
double precision number read from or written to memory, we can -- on
average -- perform 100 floating-point operations for free (i.e. in the same
time). Put differently, each number read from memory needs to be re-used
about 100 times. If not, then memory access is the bottleneck, and CPU
floating point performance is simply irrelevant.

The answer, of course, is using a cache. A cache is a very small but fast
memory that sits in between the CPU and the memory. Caches are complicated
(I won't go into details here), but in a first order approximation we can
assume that a cache remembers that last n bytes that were read from or
written to memory, and will provide instantaneous access to these data. To
make things more complicated, Intel CPUs have several levels of cache
daisy-chained. There is an L1 cache holding 32 kByte, and L2 cache holding
256 kByte or more, and an L3 cache holding several MByte. The L1 and L2
cache are specific to a particular core, the L3 cache is shared between all
cores in the same NUMA domain, and has about 1 MByte per core.

The L1 cache is so small that it can usually hold only data for a few grid
points when evaluating the BSSN RHS (if at all), or a few cells for a hydro
code. If you assume a SIMD vector length of 8, and 8 bytes per double, then
the L1 cache can hold 512 separate values. Assume a bit of overhead etc.,
and you'll see that it's difficult to fit even a full BSSN RHS evaluation
in there. If a single loop iteration doesn't fit into the L1 cache, then
performance suffers dramatically, and it then makes sense to split the loop
kernel into several, even if this involves repeating a few operations. For
the BSSN equations, McLachlan uses 3 sequential loops instead of a single
one, although the reason there is that this is required to make the
instructions (!) fit into the L1 instruction cache (!), and isn't motivated
by the size of the data cache. Instruction caches are even more complicated
than data caches since it's not even officially documented what exactly is
cached; however, heroic efforts have been made to reverse engineer this to
be able to write efficient compilers. (See Agner's document above.)

Assuming that a code's inner loop has been written in such a way that the
temporary variables used during a single iteration fit into the L1 cache,
the question is how data can be reused. Data reuse happens via derivative
operators that access neighbouring points. The usual approach taken here is
to use "loop tiling" (or "loop blocking"). In principle, compilers should
be able to do this automatically; in practice, they do not. The idea is
rather straightforward: One splits a large 3D loops into small tiles of a
certain size (e.g. 8x4x4), and then has an outer loop nest iterating over
tiles, and an inner loop nest iterating over the points within a tile. The
inner loops can then be aggressively optimized, in particular if the number
of iterations is known in advance. That might make it necessary to enforce
that the total domain size is always a multiple of 8 (why not?), or one can
add unused padding points at the domain edge. This can be well worth the
performance benefit. A tile size of 8x4x4 allows using a single SIMD vector
in the x direction, maybe (partially?) unrolling the loop in the y
direction, etc. Obviously, the optimal tile sizes need to be determined via
trial and error (cache behaviour is complicated and cannot really be

The outer loop nest can then also be parallelized via OpenMP. Communication
between OpenMP threads, which includes starting or stopping a thread etc.,
happens via the L3 caches that exchange data between each other. As a rule
of thumb, this takes several hundred nanoseconds in the best possible case,
and about 1 μs in practice (again, all numbers are orders of magnitude).
Compare this to the floating-point performance, and it becomes clear that
even handling a single tile might not be enough to offset multi-threading
overhead. It makes sense to group multiple tiles into a group (e.g. a
"pencil" of tiles in the x direction) that is dispatched by OpenMP.

Regarding OpenMP: I find it is crucial for performance to bind OpenMP
threads to a particular core during execution. The operating system usually
does not do a good job, and whenever a thread moves, all its data need to
be moved to higher cache levels or even the main memory, which is expensive.

While memory is accessed in terms of bytes, this is these days merely a
fantasy that is upheld for compatibility with programming languages such as
C. Actual memory is highly parallel, and is fastest if accessed in
multiples of a "cache line", which is today 64 bytes on high-end Intel
CPUs. In other words, reading one byte reads 64 bytes from memory, and
writing one byte requires first reading 64 bytes, then overwriting 1 byte,
and then writing back 64 bytes. (Yes, writing 64 bytes can thus be faster
than writing 1 byte.) If two cores write to the same cache line, then this
cache line needs to be repeatedly read and written back to the L3 cache.
This is also called "false sharing", and should be avoided in
performance-critial multi-threaded code. (Reading from the same cache line
is fine -- only writing is slow, since writing requires exclusive access.)
To avoid this, it is necessary to align arrays to ensure that each tile
starts at a multiple of 64 bytes. This can be done manually, and there are
also special function posix_memalign for this. To ensure that not just the
first point, but all x boundary points of a tile start at a multiple of 64
bytes, it is also necessary to ensure that the number of points in the x
direction is a multiple of 8, padding the array with unused points if

The abovementioned performance stumbling blocks have the endearing tendency
to interact in highly nonlinear (read: almost random) ways. At times,
changing one thing can have a dramatic unintended side effect onto another
thing, which makes it difficult to understand what is going on. It can even
be that making one thing worse can have a side effect that is several times
larger, and which improves things overall. One has to be quite careful
about various "effects" that people are finding. I recall a code with
comments stating that copying the extrinsic curvature into a temporary
array made things faster. By the time I examined the code, I could not
confirm this. In principle, copying data uses more space in the cache and
thus reduces performance. Who knows what other effects this copying had --
maybe the compiler treated the code differently, maybe it inhibited an
"optimization" that made things worse in this case, etc. Blindly trying
things to improve performance is good if it leads to an improved
performance, but it cannot replace actual understanding of performance
bottlenecks, and one should not fall prey to over-generalizing findings
that worked in one case.

If you are interested in understanding your code's performance, then the
most important step I recommend it to measure how fast the code actually
is, when compared to the theoretically possible performance. This is,
unfortunately, quite tedious. One would hope that there are tools that
simply output e.g. "5% of peak -- optimize these three functions", but HPC
as a field is too far away from mainstream programming to offer such tools.
Hence one has to calculate these numbers by hand. (And if a tool promises
to measure these things automatically at run time, then be very careful: BY
DEFINITION, this percentage cannot be measured at run time. See below.)

The three traditionally important performance measures in HPC are:
- floating-point operations (Flop/sec)
- memory accesses (bytes/sec)
- communication bandwidth (bytes/sec)

Latencies are also important, but out-of-order execution of CPU instruction
and memory accesses, as well as compiler optimizations, usually make these
less important in practice. The exception is communication latency, which
isn't optimized by the compiler since e.g. MPI communication is handled by
a library and is not a built-in language feature.

As a note -- be careful with units. I like to measure bandwidth in "double
precision numbers per second", since that's what interests me. Industry
standard is "bytes per second". Hardware vendors are easily tempted to
quote "bits per second", since this leads to larger numbers. And since they
want to sound clever, they never spell out quite what they mean. "GB/s"
could mean either "GigaByte" or "GigaBit". (And then theres "GT/s", which
are "GigaTransactions per second", where a "transaction" probably has 64
bits. Maybe. There's a long Wikipedia article about DDR memory.)

For each of these three measures, the hardware imposes a hard limit. This
limit can, with suitable study of literature, be determined ahead of time.
There should also be benchmarks that are designed to measure the ideal
case, and which should come close to obtaining this performance. For a
typical HPC system, its web site will quote the first and third number, but
not always the second. As life has it, the second (memory bandwidth) might
actually be the most relevant one. It is also important to recall to which
subsystem the performance measure applies: average memory bandwidth per
core? per socket? per node? ... how is a socket defined?

To see how code a particular code is doing, one needs to determine the
requirements of the algorithm that the code implements. This is a property
of the algorithm, not of the implementation. This is thus -- and I'm
repeating myself here -- not something that can be measured at run time. In
other words, one needs to count the number of USEFUL operations, not the
number of operations that a code or a CPU performs after optimizations have
been applied. (It would otherwise be very easy to artificially inflate the
number of operations.) In fact, modern CPUs perform "speculative execution"
of instructions, and in many cases, speculatively executed instructions
yield invalid results that are then discarded. Obviously, speculatively
executed instructions must not be counted when measuring the performance of
a code. Compilers may also insert additional floating-point operations or
memory accesses if they find it likely that this will speed up execution.
Obviously, these additional operations must also not be counted.

Thus, to determine the fraction of the theoretical peak performance that a
code achieves, one needs to count (manually, most likely; or by using an
instrumented C++ class) the number of floating-point operations and memory
accesses that the algorithm (e.g. fourth-order finite differences with an
fourth-order Runge-Kutta integrator) requires, and compare this to the
running time of the code and the theoretical peak performance of the
system. This tells you how far off you are, and also which of the
performance criteria (floating-point, memory accesses, etc.) are the

To make a last attempt to drive my point home: Measuring "fraction of peak
performance" only makes sense if one counts only the useful operations, not
if one counts the operations that happened to be executed at run time.

A good tool for seeing which machine instructions take how long is "perf".
perf <https://perf.wiki.kernel.org/> is a Linux utility that is distributed
with the Linux kernel. You can use "perf record" when running an executable
(for a few seconds -- more is probably not needed), and "perf report" to
interactively analyze the results.

- if much time is spent in "div" or "sqrt" or "exp" instructions, then
using approximate SIMD implementations might help
- if there are a few slow instructions that access memory, then optimizing
cache use might help
- if the slow instructions are spread all over the place, then the code
might be well balanced, and further optimizations are difficult
- perf can be used to measure any number of things, such as e.g. CPU cycles
(run time), memory accesses, cache misses, etc.

Good luck!


Erik Schnetter <eschnetter at perimeterinstitute.ca>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.einsteintoolkit.org/pipermail/users/attachments/20180307/b4587616/attachment-0001.html 

More information about the Users mailing list