I want to get cache hit rate for a specific function of a C/C++ program (foo) running on a Linux machine. I am using gcc and no compiler optimization. With perf I can get hit rates for the entire program using the following command.
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./a.out
But I am interested in the kernel foo only.
Is there a way to get hit rates only for foo using perf or any other tool?
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#define NI 192
#define NJ NI
#ifndef DATA_TYPE
#define DATA_TYPE float
#endif
static
void* xmalloc(size_t num)
{
void * nnew = NULL;
int ret = posix_memalign (&nnew, 32, num);
if(!nnew || ret)
{
fprintf(stderr, "Can not allocate Memory\n");
exit(1);
}
return nnew;
}
void* alloc_data(unsigned long long int n, int elt_size)
{
size_t val = n;
val *= elt_size;
void* ret = xmalloc(val);
return ret;
}
/* Array initialization. */
static
void init_array(int ni, int nj,
DATA_TYPE A[NI][NJ],
DATA_TYPE R[NJ][NJ],
DATA_TYPE Q[NI][NJ])
{
int i, j;
for (i = 0; i < ni; i++)
for (j = 0; j < nj; j++) {
A[i][j] = ((DATA_TYPE) i*j) / ni;
Q[i][j] = ((DATA_TYPE) i*(j+1)) / nj;
}
for (i = 0; i < nj; i++)
for (j = 0; j < nj; j++)
R[i][j] = ((DATA_TYPE) i*(j+2)) / nj;
}
/* Main computational kernel.*/
static
void foo(int ni, int nj,
DATA_TYPE A[NI][NJ],
DATA_TYPE R[NJ][NJ],
DATA_TYPE Q[NI][NJ])
{
int i, j, k;
DATA_TYPE nrm;
for (k = 0; k < nj; k++)
{
nrm = 0;
for (i = 0; i < ni; i++)
nrm += A[i][k] * A[i][k];
R[k][k] = sqrt(nrm);
for (i = 0; i < ni; i++)
Q[i][k] = A[i][k] / R[k][k];
for (j = k + 1; j < nj; j++)
{
R[k][j] = 0;
for (i = 0; i < ni; i++)
R[k][j] += Q[i][k] * A[i][j];
for (i = 0; i < ni; i++)
A[i][j] = A[i][j] - Q[i][k] * R[k][j];
}
}
}
int main(int argc, char** argv)
{
/* Retrieve problem size. */
int ni = NI;
int nj = NJ;
/* Variable declaration/allocation. */
DATA_TYPE (*A)[NI][NJ];
DATA_TYPE (*R)[NI][NJ];
DATA_TYPE (*Q)[NI][NJ];
A = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
R = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
Q = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
/* Initialize array(s). */
init_array (ni, nj,
(*A),
(*R),
(*Q));
/* Run kernel. */
foo (ni, nj, *A, *R, *Q);
/* Be clean. */
free((void *)A);
free((void *)R);
free((void *)Q);
return 0;
}
Output of lscpu command is:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
Stepping: 2
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-15
You might be interested in gprof(1). It won't measure cache hit rate (this has no sense, since some calls to
foocould be inlined, once GCC is invoked with optimizations enabled).You could use libbacktrace in your code. See also time(7) and signal(7).
You might compile your code with
gcc -Wall -Wextra -O2 -g -pgthen uselibbacktrace(like GCC or RefPerSys are doing) inside it, and later gprof(1) with gdb(1).With efforts (so read Advanced Linux Programming then syscalls(2) and signal-safety(7)) you might use setitimer(2) with sigaction(2) and/or profil(3).
Consider also generating some C code (e.g. using GPP and/or GNU bison in your own C code generator) and see this answer. J.Pitrat's book Artificial Beings: the Conscience of a Conscious Machine (ISBN-13: 978-1848211018) could be inspirational. You may want to generate some C code for extra instrumentation.
You might generate some code in a plugin (e.g. with libgccjit or GNU lightning...) at runtime, then dlopen(3) and dlsym(3) it. Read more about partial evaluation and see my
manydl.cexample, and more seriously the source code of Ocaml or of SBCL.You could write your GCC plugin to automatically generate some measurements, in a more clever way than what the
-pgoption of GCC is doing. Your GCC plugin would transform (at the GIMPLE level) most function calls to something more complex making some benchmarking (this is how-pgworks inside GCC, and you might study the source code of GCC). Try compiling yourfoo.casgcc -Wall -Wextra -O2 -pg -S -fverbose-asm foo.cand look into the generatedfoo.s, perhaps adding more optimizations, or static analysis or instrumentation options.You could be interested in recent papers of ACM SIGPLAN.
At last, benchmarking a C program compiled without optimizations makes no sense. Consider instead compiling and linking your program with at least
gcc -flto -O2 -WallWithin your
foo, you might use cleverly clock_gettime(2) to measure CPU time.And if performance is very important and if you are allowed to spend weeks of work to improve it, you might consider using OpenCL (or perhaps CUDA) to compute your kernel on a powerful GPGPU. Of course, you need dedicated hardware. Otherwise, consider using OpenMP or OpenACC (or perhaps MPI). Some recent GCC compilers (at least GCC 10 in October 2020) could support these. Of course, read documentation on Invoking GCC.