How does the CPU cache work (the whole description of the problem inside)?

229 views Asked by At

Recently, I became interested in data oriented design. I have read some articles and publications on the subject so far. I understand how the cache works and how it is built (generally, abstractly) L1, L2, L3, what is the cache line, what is the N-way associative, why is the cache so efficient and so on. Unfortunately, I understand all this in theory, looking at examples I can not understand them very much. In addition, I can't combine it into one, hence my question.

I came across an example on the internet (http://igoro.com/archive/gallery-of-processor-cache-effects/) (Example3 -> code belowe).

My question:

I don't understand why the performance decreases with the increase size of the array (see image at the link).

After all, the cache line is 64bytes (I read somewhere that this is the size) so it should always fit in L1. I do not understand the relationship between data size and a given cache level, because the cache line is 64bytes + prefetcher 256bytes (I read somewhere that this is the size) additional, so everything fits in L1. (If the cache line is busy then we can remove it and add a new needed cache line.).

int steps = 64 * 1024 * 1024; // Arbitrary number of steps
int lengthMod = arr.Length - 1;
for (int i = 0; i < steps; i++)
{
    arr[(i * 16) & lengthMod]++; // (x & lengthMod) is equal to (x % arr.Length)
}

Image, performance results

You can see distinct drops after 32kB and 4MB – the sizes of L1 and L2 caches on my machine.

In addition: I don't understand how loading data into the cache works. We assume that the needed data is not at any level of the cache and the whole cache is empty.

Will the data be loaded first into L3, then to L2 and to L1 (if they fit)? Or just up to L1?

I guess the answer is not clear and depends on the CPU and cache design but if anyone can describe it in some general way I would be very grateful.

1

There are 1 answers

0
wxz On

Your first question:

Why [does] the performance decrease with the increased size of the array?

The author states that his cache sizes are "a 32kB L1 data cache, a 32kB L1 instruction cache, and a 4MB L2 data cache." So even if the individual cache lines are 64 bytes like you said, each cache has multiple cache lines.

That being said, there's no reason why the array "should always fit in L1." The author is rerunning this code snippet with a different sized array each time (array size declaration not shown in the code), sizes between 1KB to 1GB.

Theoretically*, based on the author's cache sizes, an array smaller than 32KB will fit into both his L1 and L2 caches and he'll only have cold misses, when the whole array needs to be loaded into the caches for the first time.

Once the array is bigger than 32KB, now he'll have the original cold misses plus extra L1 misses from some data having to be evicted and reloaded into the L1 cache.

Any larger than 4MB, and now he'll have additional L2 misses on top of that.

For your other question:

Will the data be loaded first into L3, then to L2 and to L1 (if they fit)? Or just up to L1?

This is microarchitecture dependent. See here.


* Other processes will also be competing for cache lines while this array program is going on, increasing the actual cache misses.