Memory Hierarchy and Cache

CS 441/641 Lecture, Dr. Lawlor

DRAM Hardware

DRAM chips are 2D grids of little FET-and-capacitor cells.  Each cell stores a 1 or 0 in the capacitor, and the FET is normally off, insulating the capacitor, and only turns on for a read or a write.  The evolutionary history of DRAM cells shown in Figure 5 here is quite fascinating: over time, as cells have gotten denser they have actually been simplified (fewer parts to fabricate), and are slowly pushing into 3D.

The capacitor's charge does slowly bleed off over a timescale of milliseconds; internally DRAM chips need to be "refreshed" (read off and re-written) every millisecond or two, which special dedicated circuitry performs.

There's more detail than you'd ever want to know about DRAM chip interfacing: RAS/CAS, EDO, FPM, DDR, etc at Ars Technica.  Modern memory uses a synchronous command bus, but the same wires.

Charged DRAM cells are slightly light sensitive, likely due to photoelectric electrons bleeding off the capacitor's charge.  This means you can actually build a camera using a RAM chip as a sensor (this was much easier back in the day, when chips were often packaged in a little metal "can" instead of being semi-permanently cast into plastic or ceramic).  The downside is the sensors have a native 1-bit resolution (real black-and-white), so even to get grayscale you've got to take several exposures and average.  The upside is that the chips in a $10 1GB (8 gigabit) DIMM could be used to build a 8 *gigapixel* camera (in literal on-or-off black and white, however), and the entire thing could be read out at dozens of frames per second.  If you were willing to restrict the focal area and resolution, you could easily get several million frames per second, although you'd probably need a very bright lightsource.

Over time, DRAM capacity has improved dramatically (my first computer, in 1987, was a "Fat Mac" with a whopping 512KB of RAM!).  DRAM access bandwidth has improved dramatically, mostly by increasing the number of transfers per memory clock cycle (SDRAM to DDR3), to the point where memory busses are approaching a hundred gigabytes per second.  But latency has only improved by about twofold since the 1980's! 

Levels of Cache

Here's an inner loop that does something funny--it jumps around (using "loc") inside an array called "buf", but only within the bounds established by "mask".  Like any loop, each iteration takes some amount of time; but what's suprising is that there's a very strong dependence of the speed on the value of "mask", which establishes the size of the array we're jumping around in.

(Executable NetRun Link)
	for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
sum+=buf[loc&mask]++;
loc+=1234567;
}
Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").
Size (KB) 3.1GHz Sandy
2.4GHz Q6600
2.8GHz P4 1.6GHz P-M 2.2Ghz Athlon64 2.0GHz PPC G5 900MHz P3 500MHz
ARM
300MHz PPC
1.02 0.89
1.54
4.92 4.88 2.27 5.53 17.47 24.5
16.03
2.05 0.89
1.54
4.6 4.87 2.28 5.53 17.47 24.5 16.03
4.1 0.89
1.54
4.72 4.9 2.28 5.18 17.48 24.5 16.03
8.19 0.89
1.54
4.83 4.91 2.28 3.6 17.49 24.5 16.03
16.38 0.89
1.54
4.75 4.91 2.28 3.6 17.52 24.5 16.03
32.77 0.89
1.54
6.59 4.84 2.28 3.63 21.57 196.0
16.03
65.54 1.90
2.96
6.84 10.1 2.29 5.31 21.58 229
16.64
131.07 1.91
2.66
6.92 10.11 5.26 5.31 21.97 314
40.31
262.14 2.01
2.66
7.13 10.11 6.92 5.31 98.28 320
40.34
524.29 2.09
2.66
8.48 10.07 10.13 23.98 144.04 388
52.33
1048.58 2.09
2.66
19.33 10.43 38.95 44.59 153.2 521
49.86
2097.15 2.59
4.68
54.33 28.87 76.15 99.11 156.86 552
144.76
4194.3 7.34
11.00
76.31 85.3 78.05 112.55 157.32 551
256.09
8388.61 8.50
29.37
75.33 111.43 78.81 210.04 159.43 551
342.73
16777.22 9.71
30.42
77.49 120.39 81.77 214.19 166.52 558
166.52
33554.43 10.52
31.16
77.93 126.73 81.56 208.21 168.58 551
168.58

I claim each performance plateau corresponds to a chunk of hardware.  Note that there are three jumps in the timings:

Memory Speed

In general, memory accesses have performance that's:
So if you're getting bad performance, you can either:
These are actually two aspects of "locality": the values you access should be similar.  The cache lets you reuse stuff you've used recently in time (temporal locality); streaming access is about touching stuff nearby in space (spatial locality).

There are lots of different ways to improve locality:
Here's some actual size-vs-stride memory performance I measured on a Core2 Duo CPU using this NetRun program.

Size   \   Stride
2 11 43 155 546
1.0 KB 2.96ns 2.97ns 2.98ns 2.96ns 2.96ns
4.0 KB 3.03ns 3.03ns 3.10ns 3.10ns 3.06ns
16.0 KB 3.04ns 3.05ns 3.10ns 3.10ns 3.06ns
64.0 KB 3.03ns 3.10ns 3.84ns 3.47ns 3.24ns< Data barely fits in L1 cache
256.0 KB 3.03ns 3.10ns 3.82ns 3.96ns 3.96ns
1024.0 KB 3.03ns 3.18ns 4.57ns 4.01ns 4.53ns
4096.0 KB 3.05ns 3.21ns 22.19ns 31.05ns 35.02ns  
< Data no longer fits in L2 cache
16384.0 KB 3.03ns 3.27ns 22.90ns 42.74ns 43.54ns

Or to summarize:

Local
Nonlocal
Small Data
Good
Good
Big Data
Good
Bad

Note that up-to-256KB access is always fast (good temporal locality; those 256 Kbytes just get hit over and over).  Similarly, stride-2 access is always fast (good spatial locality; the next access is just 2 bytes from the previous access).  The only time memory access is slow is when jumping around in a big buffer--and it's 10x slower when you do that!

Why you care: Set-associative Cache Mapping

Anytime you have a cache, of any kind, you need to figure out what to do when the cache gets full.  Generally, you face this problem when you've got a new element X to load into the cache--which cache slot do you place  X into?

The simplest approach is a "direct mapped cache", where element X goes into cache slot X%N (where N is the size of the cache).  Direct mapping means elements 1 and 2 will go into different adjacent slots, but you can support many elements.

For example, the Pentium 4's L1 cache is 64KB in size and direct-mapped.  This means address 0x0ABCD and address 0x1ABCD (which are 64KB apart) both get mapped to the same place in the cache.  So even though this program is very fast (5.2ns/call):
enum {n=1024*1024};
char arr[n];

int foo(void) {
arr[0]++;
arr[12345]++;
return 0;
}

(Try this in NetRun now!)

By contrast this very similar-looking program is very slow (20+ns/call), because both array elements (exactly 64KB apart) map to the same line of the cache, so the CPU keeps overwriting one with the other (called "cache thrashing"), and the cache is totally useless:
enum {n=1024*1024};
char arr[n];

int foo(void) {
arr[0]++;
arr[65536]++;
return 0;
}

(Try this in NetRun now!)

In general, power-of-two jumps in memory can be very slow on direct-mapped machines.  This is one of the only cases on computers where powers of two are not ideal!

Some machines avoid the thrashing of a direct-mapped cache by allowing a given memory address to sit in one of two or four slots, called two- or four-way "set associative caching" (described here, or in the Hennessy & Patterson book, Chapter 7).  On such machines, you can still get cache thrashing with power of two jumps, but you need more and longer jumps to do so.

In general, there are many more complicated cache replacement algorithms, although most of them are too complicated to implement in hardware.  The OS treats RAM as a cache of data on disk, and since disks are so slow, it can pay to be smart about choosing which page to write to disk.

Why you care: Multicore Cache Coherence and Cache Thrashing

If two threads try to write to the *same* variable at the same time, you can get the wrong answer.

But surprisingly, if two threads try to write to different but nearby variables at the same time, you get the right answer, but a big performance hit!
enum {n=2}; /* number of threads */
enum {m=1000*1000}; /* number of times to access the int */
int offset=1; /* distance, in ints, between threads' accesses */
volatile int arr[1025];
void hammer_on_int(volatile int *ptr) {
for (int i=0;i<m;i++)
(*ptr)++;
}
int multi_hammer(void) {
#pragma omp parallel for
for (int thread=0;thread<n;thread++) {
hammer_on_int(&arr[thread*offset]);
}
return 0;
}
int foo(void) {
for (offset=1;offset<=1024;offset*=2)
printf("%d-byte offset took %.3f ns/++\n",
sizeof(int)*offset,time_function(multi_hammer)*1.0e9/m);
return 0;
}

(Try this in NetRun now!)

This program prints out:
4-byte offset took 19.437 ns/++
8-byte offset took 19.304 ns/++
16-byte offset took 20.442 ns/++
32-byte offset took 22.939 ns/++
64-byte offset took 2.615 ns/++
128-byte offset took 2.601 ns/++
256-byte offset took 2.598 ns/++
512-byte offset took 2.572 ns/++
1024-byte offset took 2.857 ns/++
2048-byte offset took 2.664 ns/++
4096-byte offset took 2.571 ns/++
Program complete. Return 0 (0x0)
For this machine, when two threads are working on data within the same 64-byte cache line, there's a substantial (10x!) slowdown, often called "false sharing" (not the same data, only nearby).

In software, how do you fix this?  You pad out data so adjacent threads are working in different cache lines.  Usually this happens automatically, but sometimes you need to reorganize the data to make it so.  Generally, you can avoid multithread cache coherence performance problems by *reducing* across-core locality, which is the opposite of the single-core cache case!