[Commits] [svn:einsteintoolkit] incoming/MemSpeed/ (Rev. 89)
schnetter at cct.lsu.edu
schnetter at cct.lsu.edu
Fri Jun 21 23:54:26 CDT 2013
User: eschnett
Date: 2013/06/21 11:54 PM
Modified:
/MemSpeed/
README, configuration.ccl, interface.ccl, schedule.ccl
/MemSpeed/doc/
documentation.tex
/MemSpeed/src/
memspeed.cc
Log:
Add documentation
File Changes:
Directory: /MemSpeed/
=====================
File [modified]: README
Delta lines: +6 -4
===================================================================
--- MemSpeed/README 2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/README 2013-06-22 04:54:25 UTC (rev 89)
@@ -1,9 +1,11 @@
Cactus Code Thorn MemSpeed
-Author(s) : Erik Schnetter <schnetter at gmail.com>
-Maintainer(s): Erik Schnetter <schnetter at gmail.com>
-Licence : n/a
+Author(s) : Erik Schnetter <eschnetter at perimeterinstitute.ca>
+Maintainer(s): Erik Schnetter <eschnetter at perimeterinstitute.ca>
+Licence : LGPL
--------------------------------------------------------------------------
1. Purpose
-Determine the latencies and bandwidths of caches and main memory.
+Determine the speed of the CPU, as well as latencies and bandwidths of
+caches and main memory. These provides ideal, but real-world values
+against which the performance of other routines can be compared.
File [modified]: configuration.ccl
Delta lines: +1 -0
===================================================================
--- MemSpeed/configuration.ccl 2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/configuration.ccl 2013-06-22 04:54:25 UTC (rev 89)
@@ -1,3 +1,4 @@
# Configuration definitions for thorn MemSpeed
+# Floating point benchmarks use explicit vectorization
REQUIRES Vectors
File [modified]: interface.ccl
Delta lines: +1 -0
===================================================================
--- MemSpeed/interface.ccl 2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/interface.ccl 2013-06-22 04:54:25 UTC (rev 89)
@@ -6,6 +6,7 @@
+# Obtain information about caches and memory sizes from thorn hwloc
CCTK_INT FUNCTION GetCacheInfo1 \
(CCTK_POINTER_TO_CONST ARRAY OUT names, \
CCTK_INT ARRAY OUT types, \
File [modified]: schedule.ccl
Delta lines: +1 -1
===================================================================
--- MemSpeed/schedule.ccl 2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/schedule.ccl 2013-06-22 04:54:25 UTC (rev 89)
@@ -4,4 +4,4 @@
{
LANG: C
OPTIONS: meta
-} "Measure memory and cache speeds"
+} "Measure CPU, memory, cache speeds"
Directory: /MemSpeed/doc/
=========================
File [modified]: documentation.tex
Delta lines: +349 -26
===================================================================
--- MemSpeed/doc/documentation.tex 2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/doc/documentation.tex 2013-06-22 04:54:25 UTC (rev 89)
@@ -79,63 +79,386 @@
\begin{document}
% The author of the documentation
-\author{Erik Schnetter \textless schnetter at gmail.com\textgreater}
+\author{Erik Schnetter \textless eschnetter at perimeterinstitute.ca\textgreater}
% The title of the document (not necessarily the name of the Thorn)
\title{MemSpeed}
% the date your document was last changed, if your document is in CVS,
-% please use:
-% \date{$ $Date: 2004-01-07 14:12:39 -0600 (Wed, 07 Jan 2004) $ $}
-\date{June 17 2013}
+\date{June 22, 2013}
\maketitle
% Do not delete next line
% START CACTUS THORNGUIDE
-% Add all definitions used in this documentation here
-% \def\mydef etc
-
-% Add an abstract for this thorn's documentation
\begin{abstract}
-
+ Determine the speed of the CPU, as well as latencies and bandwidths
+ of caches and main memory. These provides ideal, but real-world
+ values against which the performance of other routines can be
+ compared.
\end{abstract}
-% The following sections are suggestive only.
-% Remove them or add your own.
+\section{Measuring Maximum Speeds}
-\section{Introduction}
+This thorn measures the maximum practical speed that can be attained
+on a particular system. This speed will be somewhat lower than the
+theoretical peak performance listed in a system's hardware
+description.
-\section{Physical System}
+This thorn measures
+\begin{itemize}
+\item CPU floating-point peformance (GFlop/s),
+\item CPU integer peformance (GIop/s),
+\item Cache/memory read latency (ns),
+\item Cache/memory read bandwidth (GByte/s),
+\item Cache/memory write latency (ns),
+\item Cache/memory write bandwidth (GByte/s).
+\end{itemize}
-\section{Numerical Implementation}
+Theoretical performance values for memory access are often quoted in
+slightly different units. For example, bandwidth is often measured in
+GT/s (Giga-Transactions per second), where a transaction transfers a
+certain number of bytes, usually a cache line (e.g. 64 bytes).
-\section{Using This Thorn}
+A detailed understanding of the results requires some knowledge of how
+CPUs, caches, and memories operate. \cite{lmbench-usenix} provides a
+good introduction to this as well as to benchmark design.
+\cite{mhz-usenix} is also a good read (by the same authors), and their
+(somewhat dated) software \emph{lmbench} is available here
+\cite{lmbench}.
-\subsection{Obtaining This Thorn}
-\subsection{Basic Usage}
-\subsection{Special Behaviour}
+\section{Algorithms}
-\subsection{Interaction With Other Thorns}
+We use the following algorithms to determine the maximum speeds. We
+believe these algorithms and their implementations are adequate for
+current architectures, but this may need to change in the future.
-\subsection{Examples}
+Each benchmark is run $N$ times, where $N$ is automatically chosen
+such that the total run time is larger than $1$ second. (If a
+benchmark finishes too quickly, then $N$ is increased and the
+benchmark is repeated.)
-\subsection{Support and Feedback}
+\subsection{CPU floating-point peformance}
+\label{sec:flop}
-\section{History}
+CPUs for HPC systems are typically tuned for dot-product-like
+operations, where multiplications and additions alternate. We measure
+the floating-point peformance with the following calculation:
+\begin{verbatim}
+ for (int i=0; i<N; ++i) {
+ s := c_1 * s + c_2
+ }
+\end{verbatim}
+where $s$ is suitably initialized and $c_1$ and $c_2$ are suitably
+chosed to avoid overflow, e.g. $s=1.0$, $c_1=1.1$, $c_2=-0.1$.
-\subsection{Thorn Source Code}
+$s$ is a double precision variable. The loop over $i$ is explicitly
+unrolled $8$ times, is explicitly vectorized using LSUThorns/Vectors,
+and uses fma (fused multiply-add) instructions where available. This
+should ensure that the loops runs very close to the maximum possible
+speed. As usual, each (scalar) multiplication and addition is counted
+as one Flop (floating point operation).
-\subsection{Thorn Documentation}
+\subsection{CPU integer performance}
-\subsection{Acknowledgements}
+Many modern CPUs can handle integers in two different ways, treating
+integers either as data, or as pointers and array indices. For
+example, integers may be stored in two different sets of registers
+depending on their use. We are here interested in the performance of
+pointers and array indices. Most modern CPUs cannot vectorize these
+operations (some GPUs can), and we therefore do not employ
+vectorization in this benchmark.
+In general, array index calculations require addition and
+multiplication. For example, accessing the element $A(i,j)$ of a
+two-dimensional array requires calculating $i + n_i \cdot j$, where
+$n_i$ is the number of elements allocated in the $i$ direction.
+However, general integer multiplcations are expensive, and are not
+necessary if the array is accessed in a loop, since a running index
+$p$ can instead be kept. Accessing neighbouring elements (e.g. in
+stencil calculations) require only addition and multiplication with
+small constants. In the example above, assuming that $p$ is the linear
+index corresponding to $A(i,j)$, accessing $A(i+1,j)$ requires
+calculating $p+1$, and accessing $A(i,j+2)$ requires calculating $p +
+2 \cdot n_i$. We thus base our benchmark on integer additions and
+integer multiplications with small constants.
+
+We measure the floating-point peformance with the following
+calculation:
+\begin{verbatim}
+ for (int i=0; i<N; ++i) {
+ s := b + c * s
+ }
+\end{verbatim}
+where $b$ is a constant defined at run time, and $c$ is a small
+integer constant ($c = 1 \ldots 8$) known at compile time.
+
+$s$ is an integer variable of the same size as a pointer, i.e. 64 bit
+on a 64-bit system. The loop over $i$ is explicitly unrolled $8$
+times, each time with a different value for $c$. Each addition and
+multiplication is counted as one Iop (integer operation).
+
+\subsection{Cache/memory read latency}
+
+Memory read access latency is measured by reading small amounts of
+data from random locations in memory. This random access pattern
+defeats caches, because caching does not work for random access
+patterns. To ensure that the read operations are executed
+sequentially, each read operation needs to depend on the previous. The
+idea for the algorithm below was taken from \cite{lmbench-usenix}.
+
+To implement this, we set up a large linked list where the elements
+are randomly orderd. Traversing this linked list then measures the
+memory read latency. This is done as in the following pseudo-code:
+\begin{verbatim}
+ struct L { L* next; };
+ ... set up large circular list ...
+ L* ptr = head;
+ for (int i=0; i<N; ++i) {
+ ptr = ptr->next;
+ }
+\end{verbatim}
+
+To reduce the overhead of the for loop, we explicitly unroll the loop
+100 times.
+
+\label{sec:sizes}
+We use the \emph{hwloc} library to determine the sizes of the
+available data caches, the NUMA-node-local, and the global amount of
+memory. We perform this benchmark once for each cache level, and once
+each for the local and global memory:
+\begin{itemize}
+\item for a cache, the list occupies 3/4 of the cache;
+\item for the local memory, the list occupies 1/2 of the memory;
+\item for the global memory, the list skips the local memory, and
+ occupies 1/4 of the remaining global memory.
+\end{itemize}
+
+To skip the local memory, we allocate an array of the size of the
+local memory. Assuming that the operating system prefers to allocate
+local memory, this will then ensure that all further allocations will
+then use non-local memory. We do not test this assumption.
+
+\subsection{Cache/memory read bandwidth}
+
+Memory read access bandwidth is measured by reading a large,
+contiguous amount of data from memory. This access pattern benefits
+from caches (if the amount is less than the cache size), and also
+benefits from prefetching (that may be performed either by the
+compiler or by the hardware). This presents thus an ideal case where
+memory is read as fast as possible.
+
+To ensure that data are actually read from memory, it is necessary to
+consume the data, i.e. to perform some operations on them. We assume
+that a floating-point dot-product is among the fastest operations, and
+thus use the following algorithm:
+\begin{verbatim}
+ for (int i=0; i<N; i+=2) {
+ s := m[i] * s + m[i+1]
+ }
+\end{verbatim}
+
+As in section \ref{sec:flop} above, $s$ is a double precision
+variable. $m[i]$ denotes the memory accesses. The loop over $i$ is
+explicitly unrolled $8$ times, is explicitly vectorized using
+LSUThorns/Vectors, and uses fma (fused multiply-add) instructions
+where available. This should ensure that the loops runs very close to
+the maximum possible speed.
+
+To measure the bandwidth of each cache level as well as the local and
+global memory, the same array sizes as in section \ref{sec:sizes} are
+used.
+
+\subsection{Cache/memory write latency}
+\label{sec:write-latency}
+
+The notion of a ``write latency'' does not really make sense, as write
+operations to different memory locations do not depend on each other.
+This benchmark thus rather measures the speed at which independent write
+requests can be handled. However, since writing partial cache lines
+also requires reading them, this benchmark is also influence by read
+performance.
+
+To measure the write latency, we use the following algorithm, writing
+a single byte to random locations in memory:
+\begin{verbatim}
+ char array[N];
+ char* ptr = ...;
+ for (int i=0; i<N; ++i) {
+ *ptr = 1;
+ ptr += ...;
+ }
+\end{verbatim}
+
+In the loop, the pointer is increased by a pseudo-random amount, but
+ensuring that it stays within the bound of the array.
+
+To measure the bandwidth of each cache level as well as the local and
+global memory, the same array sizes as in section \ref{sec:sizes} are
+used as starting point. For efficiency reasons, these sizes are then
+rounded down to the nearest power of two.
+
+\subsection{Cache/memory write bandwidth}
+
+Memory write access bandwidth is measured by writing a large,
+contiguous amount of data from memory, in a manner very similar to
+measuring read bandwidth. The major difference is that the written
+data do not need to be consumed by the CPU, which simplifies the
+implementation.
+
+We use \emph{memset} to write data into an array, assuming that the
+memset function is already heavily optimized.
+
+To measure the bandwidth of each cache level as well as the local and
+global memory, the same array sizes as in section \ref{sec:sizes} are
+used.
+
+
+
+\section{Caveats}
+
+This benchmark should work out of the box on all systems.
+
+The only major caveat is that it does allocate more than half of the
+system's memory for its benchmarks, and this can severely degrade
+system performance if run on an interactive system (laptop or
+workstation). If run with MPI, then only the root process will run the
+benchmark.
+
+Typical memory bandwidth numbers are in the range of multiple GByte/s.
+Given today's memory amounts of many GByte, this means that this
+benchmark will run for tens of seconds. In addition to bencharking
+memory access, the operating system also needs to allocate the memory,
+which is surprisingly slow. A typical total execution time is several
+minutes.
+
+
+
+\section{Example Results}
+
+The XSEDE system Kraken at NICS reports the following performance
+numbers (measured on June 21, 2013):
+\begin{verbatim}
+INFO (MemSpeed): Measuring CPU, cache, and memory speeds:
+ CPU floating point performance: 10.396 Gflop/sec for each PU
+ CPU integer performance: 6.23736 Giop/sec for each PU
+ Read latency:
+ D1 cache read latency: 1.15434 nsec
+ L2 cache read latency: 5.82695 nsec
+ L3 cache read latency: 29.4962 nsec
+ local memory read latency: 135.264 nsec
+ global memory read latency: 154.1 nsec
+ Read bandwidth:
+ D1 cache read bandwidth: 72.3597 GByte/sec for 1 PUs
+ L2 cache read bandwidth: 20.7431 GByte/sec for 1 PUs
+ L3 cache read bandwidth: 9.51587 GByte/sec for 6 PUs
+ local memory read bandwidth: 5.19518 GByte/sec for 6 PUs
+ global memory read bandwidth: 4.03817 GByte/sec for 12 PUs
+ Write latency:
+ D1 cache write latency: 0.24048 nsec
+ L2 cache write latency: 2.8294 nsec
+ L3 cache write latency: 9.32924 nsec
+ local memory write latency: 47.5912 nsec
+ global memory write latency: 58.1591 nsec
+ Write bandwidth:
+ D1 cache write bandwidth: 39.3172 GByte/sec for 1 PUs
+ L2 cache write bandwidth: 12.9614 GByte/sec for 1 PUs
+ L3 cache write bandwidth: 5.5553 GByte/sec for 6 PUs
+ local memory write bandwidth: 4.48227 GByte/sec for 6 PUs
+ global memory write bandwidth: 3.16998 GByte/sec for 12 PUs
+\end{verbatim}
+
+The XSEDE system Kraken at NICS also reports the following system
+configuration via thorn hwloc (reported on June 21, 2013):
+\begin{verbatim}
+INFO (hwloc): Extracting CPU/cache/memory properties:
+ There are 1 PUs per core (aka hardware SMT threads)
+ There are 1 threads per core (aka SMT threads used)
+ Cache (unknown name) has type "data" depth 1
+ size 65536 linesize 64 associativity 2 stride 32768, for 1 PUs
+ Cache (unknown name) has type "unified" depth 2
+ size 524288 linesize 64 associativity 16 stride 32768, for 1 PUs
+ Cache (unknown name) has type "unified" depth 3
+ size 6291456 linesize 64 associativity 48 stride 131072, for 6 PUs
+ Memory has type "local" depth 1
+ size 8589541376 pagesize 4096, for 6 PUs
+ Memory has type "global" depth 1
+ size 17179082752 pagesize 4096, for 12 PUs
+\end{verbatim}
+
+Kraken's CPUs identify themselves as
+\verb+6-Core AMD Opteron(tm) Processor 23 (D0)+.
+
+Let us examine and partially interpret these numbers. (While the
+particular results will be different for other systems, the general
+behaviour will often be similar.)
+
+Kraken's compute nodes run at 2.6~GHz and execute 4 Flop/cycle,
+leading to a theoretical peak performance of 10.4~GFlop/s. Our
+measured number of 10.396~GFlop/s is surprisingly close.
+
+For integer performance, we expect half of the floating point
+performance since we cannot make use of vectorization, which yields a
+factor of two on this architecture. The reported number of 6.2~GIop/s
+is somewhat larger. We assume that the compiler found some way to
+optimize the code that we did not foresee, i.e. that this benchmark is
+not optimally designed. Still, the results are close.
+
+The read latency for the D1 cache is here difficult to measure
+exactly, since it is so fast, and the cycle time and thus the natural
+uncertainty is about 0.38~ns. (We assume this could be measured
+accuratly with sufficient effort, but we do not completely trust our
+benchmark algorithm.) We thus conclude that the D1 cache has a latency
+of about 1~ns or less. A similar argument holds for the D1 read
+bandwidth -- we conclude that the true bandwidth is 72~GB/s or higher.
+
+The L2 cache has a higher read latency and a lower read bandwidth (it
+is also significantly larger than the D1 cache). We consider these
+performance numbers now to be trustworthy.
+
+The L3 cache has again a slightly slower read performance than the L2
+cache. The major difference is that the L3 cache is shared between six
+cores, so that the bandwidth will be shared if several cores access it
+simultaneously.
+
+The local memory has a slightly lower read bandwith, and a
+significantly higher read latency than the L3 cache. The global memory
+is measurably slower than the local memory, but not by a large margin.
+
+The write latencies are, as expected, lower than the read latencies
+(see section \ref{sec:write-latency}).
+
+The write bandwidths are, surprisingly, only about half as large as
+the read bandwidths. This could either be a true property of the
+system architecture, or may be caused by write-allocating cache lines.
+The latter means that, as soon as a cache line is partially written,
+the cache fills it by reading from memory or the next higher cache
+level, although this is not actually necessary as the whole cache line
+will eventually be written. This additional read from memory
+effectively halves the observed write bandwidth. In principle, the
+memset function should use appropriate write instructions to avoid
+these unnecessary reads, but this may either not be the case, or the
+hardware may not be offering such write instructions.
+
+
+
\begin{thebibliography}{9}
-
+
+\bibitem{lmbench-usenix}{Larry McVoy, Carl Staelin, \emph{lmbench:
+ Portable tools for performance analysis}, 1996, Usenix,
+ \url{http://www.bitmover.com/lmbench/lmbench-usenix.pdf}}
+
+\bibitem{mhz-usenix}{Carl Staelin, Larry McVoy, \emph{mhz: Anatomy of
+ a micro-benchmark}, 1998, Usenix,
+ \url{http://www.bitmover.com/lmbench/mhz-usenix.pdf}}
+
+\bibitem{lmbench}{\emph{LMbench -- Tools for Performance Analysis},
+ \url{http://www.bitmover.com/lmbench}}
+
\end{thebibliography}
% Do not delete next line
Directory: /MemSpeed/src/
=========================
File [modified]: memspeed.cc
Delta lines: +64 -2
===================================================================
--- MemSpeed/src/memspeed.cc 2013-06-22 01:07:31 UTC (rev 88)
+++ MemSpeed/src/memspeed.cc 2013-06-22 04:54:25 UTC (rev 89)
@@ -14,11 +14,14 @@
+// OpenMP is only used to provide an easy-to-use low-latency timer
+
#ifdef _OPENMP
# include <omp.h>
#else
# include <sys/time.h>
namespace {
+ // Fall back to gettimeofday if OpenMP is not available
double omp_get_wtime()
{
timeval tv;
@@ -32,13 +35,17 @@
namespace {
+ // Information about the CPU, as determined by this thorn
struct cpu_info_t {
double flop_speed;
double iop_speed;
};
cpu_info_t cpu_info;
+ // Information about each cache level and the memory, as obtained
+ // from hwloc and determined by this routine
struct cache_info_t {
+ // Information obtained from hwloc
string name;
int type;
ptrdiff_t size;
@@ -46,6 +53,7 @@
int stride;
int num_pus;
+ // Information determined by this thorn
double read_latency;
double read_bandwidth;
double write_latency;
@@ -55,6 +63,7 @@
+ // Query hwloc about each cache level and the memory
void load_cache_info()
{
const int num_cache_levels = GetCacheInfo1(0, 0, 0, 0, 0, 0, 0);
@@ -88,17 +97,27 @@
if (verbose) {
printf("\n");
}
- double min_elapsed = 1.0;
+ // Run the benchmark for at least this long
+ double min_elapsed = 1.0; // seconds
+ // Run the benchmark initially for this many iterations
ptrdiff_t max_count = 1000000;
- double elapsed = 0.0;
+ // The last benchmark run took that long
+ double elapsed = 0.0; // seconds
+ // Loop until the run time of the benchmark is longer than the
+ // minimum run time
for (;;) {
if (verbose) {
printf(" iterations=%td...", max_count);
fflush(stdout);
}
+ // Start timing
const double t0 = omp_get_wtime();
CCTK_REAL_VEC s0, s1, s2, s3, s4, s5, s6, s7;
s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = vec_set1(1.0);
+ // Explicitly unrolled loop, performing multiply-add operations.
+ // See latex file for a more detailed description. Note: The
+ // constants have been chosen so that the results don't over- or
+ // underflow
for (ptrdiff_t count=0; count<max_count; ++count) {
s0 = kmadd(vec_set1(1.1), s0, vec_set1(-0.1));
s1 = kmadd(vec_set1(1.1), s1, vec_set1(-0.1));
@@ -109,17 +128,27 @@
s6 = kmadd(vec_set1(1.1), s6, vec_set1(-0.1));
s7 = kmadd(vec_set1(1.1), s7, vec_set1(-0.1));
}
+ // Store sum of results into a volatile variable, so that the
+ // compiler does not optimize away the calculation
volatile CCTK_REAL_VEC use_s CCTK_ATTRIBUTE_UNUSED =
kadd(kadd(kadd(s0, s1), kadd(s2, s3)),
kadd(kadd(s4, s5), kadd(s6, s7)));
+ // End timing
const double t1 = omp_get_wtime();
elapsed = t1 - t0;
if (verbose) {
printf(" time=%g sec\n", elapsed);
}
+ // Are we done?
if (elapsed >= min_elapsed) break;
+ // Estimate how many iterations we need. Run 1.1 times longer to
+ // ensure we don't fall short by a tiny bit. Increase the number
+ // of iterations at least by 2, at most by 10.
max_count *= llrint(max(2.0, min(10.0, 1.1 * min_elapsed / elapsed)));
}
+ // Calculate CPU performance: max_count is the number of
+ // iterations, 8 is the unroll factor, CCTK_REAL_VEC_SIZE is the
+ // vector size, and there are 2 operations in each kmadd.
cpu_info.flop_speed = max_count * 8 * CCTK_REAL_VEC_SIZE * 2 / elapsed;
if (verbose) {
printf(" result:");
@@ -137,6 +166,8 @@
if (verbose) {
printf("\n");
}
+ // The basic benchmark harness is the same as above, no comments
+ // here
double min_elapsed = 1.0;
ptrdiff_t max_count = 1000000;
double elapsed = 0.0;
@@ -149,6 +180,8 @@
vector<CCTK_REAL> base(1000);
ptrdiff_t s0, s1, s2, s3, s4, s5, s6, s7;
s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = 0;
+ // Explicitly unrolled loop, performing integer multiply and add
+ // operations. See latex file for a more detailed description.
for (ptrdiff_t count=0; count<max_count; ++count) {
s0 = ptrdiff_t(&base[ s0]);
s1 = ptrdiff_t(&base[2*s1]);
@@ -178,9 +211,15 @@
+ // Determine the size (in bytes) for a particular cache level or
+ // memory type. skipsize returns the number of bytes to allocate but
+ // then to not use, so that e.g. the node-local memory can be
+ // skipped. size returns the number of bytes to use for the
+ // benchmark.
void calc_sizes(int cache, ptrdiff_t& skipsize, ptrdiff_t& size)
{
if (cache_info[cache].type==1) {
+ // Memory
if (cache>0 && cache_info[cache-1].type==1) {
// Global memory, and there is also local memory
skipsize = cache_info[cache-1].size;
@@ -205,7 +244,9 @@
DECLARE_CCTK_PARAMETERS;
printf(" Read latency:\n");
+ // Loop over all cache levels and memory types
for (int cache=0; cache<int(cache_info.size()); ++cache) {
+ // Determine size
ptrdiff_t skipsize, size;
calc_sizes(cache, skipsize, size);
assert(size>0);
@@ -218,9 +259,12 @@
} else {
printf(" %s read latency:", cache_info[cache].name.c_str());
}
+ // Allocate skipped memory, filling it with 1 so that it is
+ // actually allocated by the operating system
vector<char> skiparray(skipsize, 1);
const ptrdiff_t offset = 0xa1d2d5ff; // a random number
const ptrdiff_t nmax = size / sizeof(void*);
+ // Linked list (see latex)
vector<void*> array(nmax);
{
ptrdiff_t i = 0;
@@ -233,6 +277,8 @@
}
assert(i == 0);
}
+ // The basic benchmark harness is the same as above, no comments
+ // here
double min_elapsed = 1.0;
ptrdiff_t max_count = 1000;
double elapsed = 0.0;
@@ -243,6 +289,7 @@
}
const double t0 = omp_get_wtime();
void* ptr = &array[0];
+ // Chase linked list (see latex)
for (ptrdiff_t count=0; count<max_count; ++count) {
#define REPEAT10(x) x x x x x x x x x x
REPEAT10(REPEAT10(ptr = *(void**)ptr;));
@@ -272,6 +319,8 @@
DECLARE_CCTK_PARAMETERS;
printf(" Read bandwidth:\n");
+ // The basic benchmark harness is the same as above, no comments
+ // here
for (int cache=0; cache<int(cache_info.size()); ++cache) {
ptrdiff_t skipsize, size;
calc_sizes(cache, skipsize, size);
@@ -285,7 +334,9 @@
}
vector<char> skiparray(skipsize, 1);
const ptrdiff_t nmax = size / sizeof(CCTK_REAL);
+ // Allocate array, set all elements to 1.0
vector<CCTK_REAL> raw_array(nmax + CCTK_REAL_VEC_SIZE-1, 1.0);
+ // Align array
CCTK_REAL* restrict array = &raw_array[CCTK_REAL_VEC_SIZE-1];
array = (CCTK_REAL*)(ptrdiff_t(array) & -sizeof(CCTK_REAL_VEC));
double min_elapsed = 1.0;
@@ -301,6 +352,8 @@
CCTK_REAL_VEC s0, s1, s2, s3, s4, s5, s6, s7;
s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = vec_set1(0.0);
const ptrdiff_t dn = CCTK_REAL_VEC_SIZE;
+ // Access memory with unit stride, consuming data via
+ // multiply and add operations (see latex)
for (ptrdiff_t n=0; n<nmax;) {
s0 = kmadd(vec_load(array[n]), s0, vec_load(array[n+dn]));
n += 2*dn;
@@ -348,11 +401,15 @@
DECLARE_CCTK_PARAMETERS;
printf(" Write latency:\n");
+ // The basic benchmark harness is the same as above, no comments
+ // here
for (int cache=0; cache<int(cache_info.size()); ++cache) {
ptrdiff_t skipsize, size;
calc_sizes(cache, skipsize, size);
assert(size>0);
+ // Round down size to next power of two
size = ptrdiff_t(1) << ilogb(double(size));
+ // Define a mask for efficient modulo operations
const ptrdiff_t size_mask = size - 1;
const ptrdiff_t offset = 0xa1d2d5ff; // a random number
assert(size>0);
@@ -376,6 +433,8 @@
}
const double t0 = omp_get_wtime();
ptrdiff_t n = 0;
+ // March through the array with large, pseudo-random steps
+ // (see latex)
for (ptrdiff_t count=0; count<max_count; ++count) {
array[n & size_mask] = 2;
n += offset;
@@ -417,6 +476,8 @@
DECLARE_CCTK_PARAMETERS;
printf(" Write bandwidth:\n");
+ // The basic benchmark harness is the same as above, no comments
+ // here
for (int cache=0; cache<int(cache_info.size()); ++cache) {
ptrdiff_t skipsize, size;
calc_sizes(cache, skipsize, size);
@@ -439,6 +500,7 @@
fflush(stdout);
}
const double t0 = omp_get_wtime();
+ // Use memset for writing (see latex)
for (ptrdiff_t count=0; count<max_count; ++count) {
memset(&array[0], count % 256, size);
volatile char use_array CCTK_ATTRIBUTE_UNUSED = array[count % size];
More information about the Commits
mailing list