[Commits] [svn:einsteintoolkit] incoming/MemSpeed/ (Rev. 89)

Fri Jun 21 23:54:26 CDT 2013

User: eschnett
Date: 2013/06/21 11:54 PM

Modified:
 /MemSpeed/
  README, configuration.ccl, interface.ccl, schedule.ccl
 /MemSpeed/doc/
  documentation.tex
 /MemSpeed/src/
  memspeed.cc

Log:
 Add documentation

File Changes:

Directory: /MemSpeed/
=====================

File [modified]: README
Delta lines: +6 -4
===================================================================

--- MemSpeed/README	2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/README	2013-06-22 04:54:25 UTC (rev 89)
@@ -1,9 +1,11 @@
 Cactus Code Thorn MemSpeed
-Author(s)    : Erik Schnetter <schnetter at gmail.com>
-Maintainer(s): Erik Schnetter <schnetter at gmail.com>
-Licence      : n/a
+Author(s)    : Erik Schnetter <eschnetter at perimeterinstitute.ca>
+Maintainer(s): Erik Schnetter <eschnetter at perimeterinstitute.ca>
+Licence      : LGPL
 --------------------------------------------------------------------------
 
 1. Purpose
 
-Determine the latencies and bandwidths of caches and main memory.
+Determine the speed of the CPU, as well as latencies and bandwidths of
+caches and main memory. These provides ideal, but real-world values
+against which the performance of other routines can be compared.

File [modified]: configuration.ccl
Delta lines: +1 -0
===================================================================
--- MemSpeed/configuration.ccl	2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/configuration.ccl	2013-06-22 04:54:25 UTC (rev 89)
@@ -1,3 +1,4 @@
 # Configuration definitions for thorn MemSpeed
 
+# Floating point benchmarks use explicit vectorization
 REQUIRES Vectors

File [modified]: interface.ccl
Delta lines: +1 -0
===================================================================
--- MemSpeed/interface.ccl	2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/interface.ccl	2013-06-22 04:54:25 UTC (rev 89)
@@ -6,6 +6,7 @@
 
 
 
+# Obtain information about caches and memory sizes from thorn hwloc
 CCTK_INT FUNCTION GetCacheInfo1                            \
     (CCTK_POINTER_TO_CONST ARRAY OUT names,                \
      CCTK_INT              ARRAY OUT types,                \

File [modified]: schedule.ccl
Delta lines: +1 -1
===================================================================
--- MemSpeed/schedule.ccl	2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/schedule.ccl	2013-06-22 04:54:25 UTC (rev 89)
@@ -4,4 +4,4 @@
 {
   LANG: C
   OPTIONS: meta
-} "Measure memory and cache speeds"
+} "Measure CPU, memory, cache speeds"

Directory: /MemSpeed/doc/
=========================

File [modified]: documentation.tex
Delta lines: +349 -26
===================================================================
--- MemSpeed/doc/documentation.tex	2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/doc/documentation.tex	2013-06-22 04:54:25 UTC (rev 89)
@@ -79,63 +79,386 @@
 \begin{document}
 
 % The author of the documentation
-\author{Erik Schnetter \textless schnetter at gmail.com\textgreater}
+\author{Erik Schnetter \textless eschnetter at perimeterinstitute.ca\textgreater}
 
 % The title of the document (not necessarily the name of the Thorn)
 \title{MemSpeed}
 
 % the date your document was last changed, if your document is in CVS,
-% please use:
-%    \date{$ $Date: 2004-01-07 14:12:39 -0600 (Wed, 07 Jan 2004) $ $}
-\date{June 17 2013}
+\date{June 22, 2013}
 
 \maketitle
 
 % Do not delete next line
 % START CACTUS THORNGUIDE
 
-% Add all definitions used in this documentation here
-%   \def\mydef etc
-
-% Add an abstract for this thorn's documentation
 \begin{abstract}
-
+  Determine the speed of the CPU, as well as latencies and bandwidths
+  of caches and main memory. These provides ideal, but real-world
+  values against which the performance of other routines can be
+  compared.
 \end{abstract}
 
-% The following sections are suggestive only.
-% Remove them or add your own.
+\section{Measuring Maximum Speeds}
 
-\section{Introduction}
+This thorn measures the maximum practical speed that can be attained
+on a particular system. This speed will be somewhat lower than the
+theoretical peak performance listed in a system's hardware
+description.
 
-\section{Physical System}
+This thorn measures
+\begin{itemize}
+\item CPU floating-point peformance (GFlop/s),
+\item CPU integer peformance (GIop/s),
+\item Cache/memory read latency (ns),
+\item Cache/memory read bandwidth (GByte/s),
+\item Cache/memory write latency (ns),
+\item Cache/memory write bandwidth (GByte/s).
+\end{itemize}
 
-\section{Numerical Implementation}
+Theoretical performance values for memory access are often quoted in
+slightly different units. For example, bandwidth is often measured in
+GT/s (Giga-Transactions per second), where a transaction transfers a
+certain number of bytes, usually a cache line (e.g. 64 bytes).
 
-\section{Using This Thorn}
+A detailed understanding of the results requires some knowledge of how
+CPUs, caches, and memories operate. \cite{lmbench-usenix} provides a
+good introduction to this as well as to benchmark design.
+\cite{mhz-usenix} is also a good read (by the same authors), and their
+(somewhat dated) software \emph{lmbench} is available here
+\cite{lmbench}.
 
-\subsection{Obtaining This Thorn}
 
-\subsection{Basic Usage}
 
-\subsection{Special Behaviour}
+\section{Algorithms}
 
-\subsection{Interaction With Other Thorns}
+We use the following algorithms to determine the maximum speeds. We
+believe these algorithms and their implementations are adequate for
+current architectures, but this may need to change in the future.
 
-\subsection{Examples}
+Each benchmark is run $N$ times, where $N$ is automatically chosen
+such that the total run time is larger than $1$ second. (If a
+benchmark finishes too quickly, then $N$ is increased and the
+benchmark is repeated.)
 
-\subsection{Support and Feedback}
+\subsection{CPU floating-point peformance}
+\label{sec:flop}
 
-\section{History}
+CPUs for HPC systems are typically tuned for dot-product-like
+operations, where multiplications and additions alternate. We measure
+the floating-point peformance with the following calculation:
+\begin{verbatim}
+  for (int i=0; i<N; ++i) {
+    s := c_1 * s + c_2
+  }
+\end{verbatim}
+where $s$ is suitably initialized and $c_1$ and $c_2$ are suitably
+chosed to avoid overflow, e.g. $s=1.0$, $c_1=1.1$, $c_2=-0.1$.
 
-\subsection{Thorn Source Code}
+$s$ is a double precision variable. The loop over $i$ is explicitly
+unrolled $8$ times, is explicitly vectorized using LSUThorns/Vectors,
+and uses fma (fused multiply-add) instructions where available. This
+should ensure that the loops runs very close to the maximum possible
+speed. As usual, each (scalar) multiplication and addition is counted
+as one Flop (floating point operation).
 
-\subsection{Thorn Documentation}
+\subsection{CPU integer performance}
 
-\subsection{Acknowledgements}
+Many modern CPUs can handle integers in two different ways, treating
+integers either as data, or as pointers and array indices. For
+example, integers may be stored in two different sets of registers
+depending on their use. We are here interested in the performance of
+pointers and array indices. Most modern CPUs cannot vectorize these
+operations (some GPUs can), and we therefore do not employ
+vectorization in this benchmark.
 
+In general, array index calculations require addition and
+multiplication. For example, accessing the element $A(i,j)$ of a
+two-dimensional array requires calculating $i + n_i \cdot j$, where
+$n_i$ is the number of elements allocated in the $i$ direction.
 
+However, general integer multiplcations are expensive, and are not
+necessary if the array is accessed in a loop, since a running index
+$p$ can instead be kept. Accessing neighbouring elements (e.g. in
+stencil calculations) require only addition and multiplication with
+small constants. In the example above, assuming that $p$ is the linear
+index corresponding to $A(i,j)$, accessing $A(i+1,j)$ requires
+calculating $p+1$, and accessing $A(i,j+2)$ requires calculating $p +
+2 \cdot n_i$. We thus base our benchmark on integer additions and
+integer multiplications with small constants.
+
+We measure the floating-point peformance with the following
+calculation:
+\begin{verbatim}
+  for (int i=0; i<N; ++i) {
+    s := b + c * s
+  }
+\end{verbatim}
+where $b$ is a constant defined at run time, and $c$ is a small
+integer constant ($c = 1 \ldots 8$) known at compile time.
+
+$s$ is an integer variable of the same size as a pointer, i.e. 64 bit
+on a 64-bit system. The loop over $i$ is explicitly unrolled $8$
+times, each time with a different value for $c$. Each addition and
+multiplication is counted as one Iop (integer operation).
+
+\subsection{Cache/memory read latency}
+
+Memory read access latency is measured by reading small amounts of
+data from random locations in memory. This random access pattern
+defeats caches, because caching does not work for random access
+patterns. To ensure that the read operations are executed
+sequentially, each read operation needs to depend on the previous. The
+idea for the algorithm below was taken from \cite{lmbench-usenix}.
+
+To implement this, we set up a large linked list where the elements
+are randomly orderd. Traversing this linked list then measures the
+memory read latency. This is done as in the following pseudo-code:
+\begin{verbatim}
+  struct L { L* next; };
+  ... set up large circular list ...
+  L* ptr = head;
+  for (int i=0; i<N; ++i) {
+    ptr = ptr->next;
+  }
+\end{verbatim}
+
+To reduce the overhead of the for loop, we explicitly unroll the loop
+100 times.
+
+\label{sec:sizes}
+We use the \emph{hwloc} library to determine the sizes of the
+available data caches, the NUMA-node-local, and the global amount of
+memory. We perform this benchmark once for each cache level, and once
+each for the local and global memory:
+\begin{itemize}
+\item for a cache, the list occupies 3/4 of the cache;
+\item for the local memory, the list occupies 1/2 of the memory;
+\item for the global memory, the list skips the local memory, and
+  occupies 1/4 of the remaining global memory.
+\end{itemize}
+
+To skip the local memory, we allocate an array of the size of the
+local memory. Assuming that the operating system prefers to allocate
+local memory, this will then ensure that all further allocations will
+then use non-local memory. We do not test this assumption.
+
+\subsection{Cache/memory read bandwidth}
+
+Memory read access bandwidth is measured by reading a large,
+contiguous amount of data from memory. This access pattern benefits
+from caches (if the amount is less than the cache size), and also
+benefits from prefetching (that may be performed either by the
+compiler or by the hardware). This presents thus an ideal case where
+memory is read as fast as possible.
+
+To ensure that data are actually read from memory, it is necessary to
+consume the data, i.e. to perform some operations on them. We assume
+that a floating-point dot-product is among the fastest operations, and
+thus use the following algorithm:
+\begin{verbatim}
+  for (int i=0; i<N; i+=2) {
+    s := m[i] * s + m[i+1]
+  }
+\end{verbatim}
+
+As in section \ref{sec:flop} above, $s$ is a double precision
+variable. $m[i]$ denotes the memory accesses. The loop over $i$ is
+explicitly unrolled $8$ times, is explicitly vectorized using
+LSUThorns/Vectors, and uses fma (fused multiply-add) instructions
+where available. This should ensure that the loops runs very close to
+the maximum possible speed.
+
+To measure the bandwidth of each cache level as well as the local and
+global memory, the same array sizes as in section \ref{sec:sizes} are
+used.
+
+\subsection{Cache/memory write latency}
+\label{sec:write-latency}
+
+The notion of a ``write latency'' does not really make sense, as write
+operations to different memory locations do not depend on each other.
+This benchmark thus rather measures the speed at which independent write
+requests can be handled. However, since writing partial cache lines
+also requires reading them, this benchmark is also influence by read
+performance.
+
+To measure the write latency, we use the following algorithm, writing
+a single byte to random locations in memory:
+\begin{verbatim}
+  char array[N];
+  char* ptr = ...;
+  for (int i=0; i<N; ++i) {
+    *ptr = 1;
+    ptr += ...;
+  }
+\end{verbatim}
+
+In the loop, the pointer is increased by a pseudo-random amount, but
+ensuring that it stays within the bound of the array.
+
+To measure the bandwidth of each cache level as well as the local and
+global memory, the same array sizes as in section \ref{sec:sizes} are
+used as starting point. For efficiency reasons, these sizes are then
+rounded down to the nearest power of two.
+
+\subsection{Cache/memory write bandwidth}
+
+Memory write access bandwidth is measured by writing a large,
+contiguous amount of data from memory, in a manner very similar to
+measuring read bandwidth. The major difference is that the written
+data do not need to be consumed by the CPU, which simplifies the
+implementation.
+
+We use \emph{memset} to write data into an array, assuming that the
+memset function is already heavily optimized.
+
+To measure the bandwidth of each cache level as well as the local and
+global memory, the same array sizes as in section \ref{sec:sizes} are
+used.
+
+
+
+\section{Caveats}
+
+This benchmark should work out of the box on all systems.
+
+The only major caveat is that it does allocate more than half of the
+system's memory for its benchmarks, and this can severely degrade
+system performance if run on an interactive system (laptop or
+workstation). If run with MPI, then only the root process will run the
+benchmark.
+
+Typical memory bandwidth numbers are in the range of multiple GByte/s.
+Given today's memory amounts of many GByte, this means that this
+benchmark will run for tens of seconds. In addition to bencharking
+memory access, the operating system also needs to allocate the memory,
+which is surprisingly slow. A typical total execution time is several
+minutes.
+
+
+
+\section{Example Results}
+
+The XSEDE system Kraken at NICS reports the following performance
+numbers (measured on June 21, 2013):
+\begin{verbatim}
+INFO (MemSpeed): Measuring CPU, cache, and memory speeds:
+  CPU floating point performance: 10.396 Gflop/sec for each PU
+  CPU integer performance: 6.23736 Giop/sec for each PU
+  Read latency:
+    D1 cache read latency: 1.15434 nsec
+    L2 cache read latency: 5.82695 nsec
+    L3 cache read latency: 29.4962 nsec
+    local memory read latency: 135.264 nsec
+    global memory read latency: 154.1 nsec
+  Read bandwidth:
+    D1 cache read bandwidth: 72.3597 GByte/sec for 1 PUs
+    L2 cache read bandwidth: 20.7431 GByte/sec for 1 PUs
+    L3 cache read bandwidth: 9.51587 GByte/sec for 6 PUs
+    local memory read bandwidth: 5.19518 GByte/sec for 6 PUs
+    global memory read bandwidth: 4.03817 GByte/sec for 12 PUs
+  Write latency:
+    D1 cache write latency: 0.24048 nsec
+    L2 cache write latency: 2.8294 nsec
+    L3 cache write latency: 9.32924 nsec
+    local memory write latency: 47.5912 nsec
+    global memory write latency: 58.1591 nsec
+  Write bandwidth:
+    D1 cache write bandwidth: 39.3172 GByte/sec for 1 PUs
+    L2 cache write bandwidth: 12.9614 GByte/sec for 1 PUs
+    L3 cache write bandwidth: 5.5553 GByte/sec for 6 PUs
+    local memory write bandwidth: 4.48227 GByte/sec for 6 PUs
+    global memory write bandwidth: 3.16998 GByte/sec for 12 PUs
+\end{verbatim}
+
+The XSEDE system Kraken at NICS also reports the following system
+configuration via thorn hwloc (reported on June 21, 2013):
+\begin{verbatim}
+INFO (hwloc): Extracting CPU/cache/memory properties:
+  There are 1 PUs per core (aka hardware SMT threads)
+  There are 1 threads per core (aka SMT threads used)
+  Cache (unknown name) has type "data" depth 1
+    size 65536 linesize 64 associativity 2 stride 32768, for 1 PUs
+  Cache (unknown name) has type "unified" depth 2
+    size 524288 linesize 64 associativity 16 stride 32768, for 1 PUs
+  Cache (unknown name) has type "unified" depth 3
+    size 6291456 linesize 64 associativity 48 stride 131072, for 6 PUs
+  Memory has type "local" depth 1
+    size 8589541376 pagesize 4096, for 6 PUs
+  Memory has type "global" depth 1
+    size 17179082752 pagesize 4096, for 12 PUs
+\end{verbatim}
+
+Kraken's CPUs identify themselves as
+\verb+6-Core AMD Opteron(tm) Processor 23 (D0)+.
+
+Let us examine and partially interpret these numbers. (While the
+particular results will be different for other systems, the general
+behaviour will often be similar.)
+
+Kraken's compute nodes run at 2.6~GHz and execute 4 Flop/cycle,
+leading to a theoretical peak performance of 10.4~GFlop/s. Our
+measured number of 10.396~GFlop/s is surprisingly close.
+
+For integer performance, we expect half of the floating point
+performance since we cannot make use of vectorization, which yields a
+factor of two on this architecture. The reported number of 6.2~GIop/s
+is somewhat larger. We assume that the compiler found some way to
+optimize the code that we did not foresee, i.e. that this benchmark is
+not optimally designed. Still, the results are close.
+
+The read latency for the D1 cache is here difficult to measure
+exactly, since it is so fast, and the cycle time and thus the natural
+uncertainty is about 0.38~ns. (We assume this could be measured
+accuratly with sufficient effort, but we do not completely trust our
+benchmark algorithm.) We thus conclude that the D1 cache has a latency
+of about 1~ns or less. A similar argument holds for the D1 read
+bandwidth -- we conclude that the true bandwidth is 72~GB/s or higher.
+
+The L2 cache has a higher read latency and a lower read bandwidth (it
+is also significantly larger than the D1 cache). We consider these
+performance numbers now to be trustworthy.
+
+The L3 cache has again a slightly slower read performance than the L2
+cache. The major difference is that the L3 cache is shared between six
+cores, so that the bandwidth will be shared if several cores access it
+simultaneously.
+
+The local memory has a slightly lower read bandwith, and a
+significantly higher read latency than the L3 cache. The global memory
+is measurably slower than the local memory, but not by a large margin.
+
+The write latencies are, as expected, lower than the read latencies
+(see section \ref{sec:write-latency}).
+
+The write bandwidths are, surprisingly, only about half as large as
+the read bandwidths. This could either be a true property of the
+system architecture, or may be caused by write-allocating cache lines.
+The latter means that, as soon as a cache line is partially written,
+the cache fills it by reading from memory or the next higher cache
+level, although this is not actually necessary as the whole cache line
+will eventually be written. This additional read from memory
+effectively halves the observed write bandwidth. In principle, the
+memset function should use appropriate write instructions to avoid
+these unnecessary reads, but this may either not be the case, or the
+hardware may not be offering such write instructions.
+
+
+
 \begin{thebibliography}{9}
-
+  
+\bibitem{lmbench-usenix}{Larry McVoy, Carl Staelin, \emph{lmbench:
+    Portable tools for performance analysis}, 1996, Usenix,
+  \url{http://www.bitmover.com/lmbench/lmbench-usenix.pdf}}
+  
+\bibitem{mhz-usenix}{Carl Staelin, Larry McVoy, \emph{mhz: Anatomy of
+    a micro-benchmark}, 1998, Usenix,
+  \url{http://www.bitmover.com/lmbench/mhz-usenix.pdf}}
+  
+\bibitem{lmbench}{\emph{LMbench -- Tools for Performance Analysis},
+  \url{http://www.bitmover.com/lmbench}}
+  
 \end{thebibliography}
 
 % Do not delete next line

Directory: /MemSpeed/src/
=========================

File [modified]: memspeed.cc
Delta lines: +64 -2
===================================================================
--- MemSpeed/src/memspeed.cc	2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/src/memspeed.cc	2013-06-22 04:54:25 UTC (rev 89)
@@ -14,11 +14,14 @@
 
 
 
+// OpenMP is only used to provide an easy-to-use low-latency timer
+
 #ifdef _OPENMP
 #  include <omp.h>
 #else
 #  include <sys/time.h>
 namespace {
+  // Fall back to gettimeofday if OpenMP is not available
   double omp_get_wtime()
   {
     timeval tv;
@@ -32,13 +35,17 @@
 
 namespace {
   
+  // Information about the CPU, as determined by this thorn
   struct cpu_info_t {
     double flop_speed;
     double iop_speed;
   };
   cpu_info_t cpu_info;
   
+  // Information about each cache level and the memory, as obtained
+  // from hwloc and determined by this routine
   struct cache_info_t {
+    // Information obtained from hwloc
     string    name;
     int       type;
     ptrdiff_t size;
@@ -46,6 +53,7 @@
     int       stride;
     int       num_pus;
     
+    // Information determined by this thorn
     double read_latency;
     double read_bandwidth;
     double write_latency;
@@ -55,6 +63,7 @@
   
   
   
+  // Query hwloc about each cache level and the memory
   void load_cache_info()
   {
     const int num_cache_levels = GetCacheInfo1(0, 0, 0, 0, 0, 0, 0);
@@ -88,17 +97,27 @@
     if (verbose) {
       printf("\n");
     }
-    double min_elapsed = 1.0;
+    // Run the benchmark for at least this long
+    double min_elapsed = 1.0;   // seconds
+    // Run the benchmark initially for this many iterations
     ptrdiff_t max_count = 1000000;
-    double elapsed = 0.0;
+    // The last benchmark run took that long
+    double elapsed = 0.0;       // seconds
+    // Loop until the run time of the benchmark is longer than the
+    // minimum run time
     for (;;) {
       if (verbose) {
         printf("    iterations=%td...", max_count);
         fflush(stdout);
       }
+      // Start timing
       const double t0 = omp_get_wtime();
       CCTK_REAL_VEC s0, s1, s2, s3, s4, s5, s6, s7;
       s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = vec_set1(1.0);
+      // Explicitly unrolled loop, performing multiply-add operations.
+      // See latex file for a more detailed description. Note: The
+      // constants have been chosen so that the results don't over- or
+      // underflow
       for (ptrdiff_t count=0; count<max_count; ++count) {
         s0 = kmadd(vec_set1(1.1), s0, vec_set1(-0.1));
         s1 = kmadd(vec_set1(1.1), s1, vec_set1(-0.1));
@@ -109,17 +128,27 @@
         s6 = kmadd(vec_set1(1.1), s6, vec_set1(-0.1));
         s7 = kmadd(vec_set1(1.1), s7, vec_set1(-0.1));
       }
+      // Store sum of results into a volatile variable, so that the
+      // compiler does not optimize away the calculation
       volatile CCTK_REAL_VEC use_s CCTK_ATTRIBUTE_UNUSED =
         kadd(kadd(kadd(s0, s1), kadd(s2, s3)),
              kadd(kadd(s4, s5), kadd(s6, s7)));
+      // End timing
       const double t1 = omp_get_wtime();
       elapsed = t1 - t0;
       if (verbose) {
         printf(" time=%g sec\n", elapsed);
       }
+      // Are we done?
       if (elapsed >= min_elapsed) break;
+      // Estimate how many iterations we need. Run 1.1 times longer to
+      // ensure we don't fall short by a tiny bit. Increase the number
+      // of iterations at least by 2, at most by 10.
       max_count *= llrint(max(2.0, min(10.0, 1.1 * min_elapsed / elapsed)));
     }
+    // Calculate CPU performance: max_count is the number of
+    // iterations, 8 is the unroll factor, CCTK_REAL_VEC_SIZE is the
+    // vector size, and there are 2 operations in each kmadd.
     cpu_info.flop_speed = max_count * 8 * CCTK_REAL_VEC_SIZE * 2 / elapsed;
     if (verbose) {
       printf("    result:");
@@ -137,6 +166,8 @@
     if (verbose) {
       printf("\n");
     }
+    // The basic benchmark harness is the same as above, no comments
+    // here
     double min_elapsed = 1.0;
     ptrdiff_t max_count = 1000000;
     double elapsed = 0.0;
@@ -149,6 +180,8 @@
       vector<CCTK_REAL> base(1000);
       ptrdiff_t s0, s1, s2, s3, s4, s5, s6, s7;
       s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = 0; 
+      // Explicitly unrolled loop, performing integer multiply and add
+      // operations. See latex file for a more detailed description.
       for (ptrdiff_t count=0; count<max_count; ++count) {
         s0 = ptrdiff_t(&base[  s0]);
         s1 = ptrdiff_t(&base[2*s1]);
@@ -178,9 +211,15 @@
   
   
   
+  // Determine the size (in bytes) for a particular cache level or
+  // memory type. skipsize returns the number of bytes to allocate but
+  // then to not use, so that e.g. the node-local memory can be
+  // skipped. size returns the number of bytes to use for the
+  // benchmark.
   void calc_sizes(int cache, ptrdiff_t& skipsize, ptrdiff_t& size)
   {
     if (cache_info[cache].type==1) {
+      // Memory
       if (cache>0 && cache_info[cache-1].type==1) {
         // Global memory, and there is also local memory
         skipsize = cache_info[cache-1].size;
@@ -205,7 +244,9 @@
     DECLARE_CCTK_PARAMETERS;
     
     printf("  Read latency:\n");
+    // Loop over all cache levels and memory types
     for (int cache=0; cache<int(cache_info.size()); ++cache) {
+      // Determine size
       ptrdiff_t skipsize, size;
       calc_sizes(cache, skipsize, size);
       assert(size>0);
@@ -218,9 +259,12 @@
       } else {
         printf("    %s read latency:", cache_info[cache].name.c_str());
       }
+      // Allocate skipped memory, filling it with 1 so that it is
+      // actually allocated by the operating system
       vector<char> skiparray(skipsize, 1);
       const ptrdiff_t offset = 0xa1d2d5ff; // a random number
       const ptrdiff_t nmax = size / sizeof(void*);
+      // Linked list (see latex)
       vector<void*> array(nmax);
       {
         ptrdiff_t i = 0;
@@ -233,6 +277,8 @@
         }
         assert(i == 0);
       }
+      // The basic benchmark harness is the same as above, no comments
+      // here
       double min_elapsed = 1.0;
       ptrdiff_t max_count = 1000;
       double elapsed = 0.0;
@@ -243,6 +289,7 @@
         }
         const double t0 = omp_get_wtime();
         void* ptr = &array[0];
+        // Chase linked list (see latex)
         for (ptrdiff_t count=0; count<max_count; ++count) {
 #define REPEAT10(x) x x x x x x x x x x
           REPEAT10(REPEAT10(ptr = *(void**)ptr;));
@@ -272,6 +319,8 @@
     DECLARE_CCTK_PARAMETERS;
     
     printf("  Read bandwidth:\n");
+    // The basic benchmark harness is the same as above, no comments
+    // here
     for (int cache=0; cache<int(cache_info.size()); ++cache) {
       ptrdiff_t skipsize, size;
       calc_sizes(cache, skipsize, size);
@@ -285,7 +334,9 @@
       }
       vector<char> skiparray(skipsize, 1);
       const ptrdiff_t nmax = size / sizeof(CCTK_REAL);
+      // Allocate array, set all elements to 1.0
       vector<CCTK_REAL> raw_array(nmax + CCTK_REAL_VEC_SIZE-1, 1.0);
+      // Align array
       CCTK_REAL* restrict array = &raw_array[CCTK_REAL_VEC_SIZE-1];
       array = (CCTK_REAL*)(ptrdiff_t(array) & -sizeof(CCTK_REAL_VEC));
       double min_elapsed = 1.0;
@@ -301,6 +352,8 @@
           CCTK_REAL_VEC s0, s1, s2, s3, s4, s5, s6, s7;
           s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = vec_set1(0.0);
           const ptrdiff_t dn = CCTK_REAL_VEC_SIZE;
+          // Access memory with unit stride, consuming data via
+          // multiply and add operations (see latex)
           for (ptrdiff_t n=0; n<nmax;) {
             s0 = kmadd(vec_load(array[n]), s0, vec_load(array[n+dn]));
             n += 2*dn;
@@ -348,11 +401,15 @@
     DECLARE_CCTK_PARAMETERS;
     
     printf("  Write latency:\n");
+    // The basic benchmark harness is the same as above, no comments
+    // here
     for (int cache=0; cache<int(cache_info.size()); ++cache) {
       ptrdiff_t skipsize, size;
       calc_sizes(cache, skipsize, size);
       assert(size>0);
+      // Round down size to next power of two
       size = ptrdiff_t(1) << ilogb(double(size));
+      // Define a mask for efficient modulo operations
       const ptrdiff_t size_mask = size - 1;
       const ptrdiff_t offset = 0xa1d2d5ff; // a random number
       assert(size>0);
@@ -376,6 +433,8 @@
         }
         const double t0 = omp_get_wtime();
         ptrdiff_t n = 0;
+        // March through the array with large, pseudo-random steps
+        // (see latex)
         for (ptrdiff_t count=0; count<max_count; ++count) {
           array[n & size_mask] = 2;
           n += offset;
@@ -417,6 +476,8 @@
     DECLARE_CCTK_PARAMETERS;
     
     printf("  Write bandwidth:\n");
+    // The basic benchmark harness is the same as above, no comments
+    // here
     for (int cache=0; cache<int(cache_info.size()); ++cache) {
       ptrdiff_t skipsize, size;
       calc_sizes(cache, skipsize, size);
@@ -439,6 +500,7 @@
           fflush(stdout);
         }
         const double t0 = omp_get_wtime();
+        // Use memset for writing (see latex)
         for (ptrdiff_t count=0; count<max_count; ++count) {
           memset(&array[0], count % 256, size);
           volatile char use_array CCTK_ATTRIBUTE_UNUSED = array[count % size];