RAM and Cache Performance

DRAM Hardware

DRAM cells are basically little capacitors, and their charge does slowly bleed off over a timescale of milliseconds; internally they need to be "refreshed" (read off and re-written), which special circuitry performs.

There's more detail than you'd ever want to know about DRAM chip interfacing: RAS/CAS, EDO, FPM, DDR, etc at Ars Technica.

Charged DRAM cells are slightly light sensitive. This means you can actually build a camera using a RAM chip as a sensor (this was much easier back in the day, when chips were often packaged in a little metal "can" instead of being semi-permanently cast into plastic or ceramic). The downside is the sensors have a native 1-bit resolution (real black-and-white), so even to get grayscale you've got to take several exposures and average. The upside is that the chips in a $50 1GB (8 gigabit) DIMM could be used to build a 8 *gigapixel* camera (in literal on-or-off black and white, however), and the thing could be read out at several frames per second. If you were willing to restrict the focal area and resolution, you could easily get hundreds of thousands of frames per second, although you'd probably need a very bright lightsource.

Levels of Cache

Here's an inner loop that does something funny--it jumps around (using "loc") inside an array called "buf", but only within the bounds established by "mask". Like any loop, each iteration takes some amount of time; but what's suprising is that there's a very strong dependence of the speed on the value of "mask", which establishes the size of the array we're jumping around in.

(Executable NetRun Link)

	for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
		sum+=buf[loc&mask]++;
		loc+=1234567;
	}

Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").

Size (KB)	2.8GHz P4	1.6GHz P-M	2.2Ghz Athlon64	2.0GHz PPC G5	900MHz P3	300MHz PPC
1.02	4.92	4.88	2.27	5.53	17.47	16.03
2.05	4.6	4.87	2.28	5.53	17.47	16.03
4.1	4.72	4.9	2.28	5.18	17.48	16.03
8.19	4.83	4.91	2.28	3.6	17.49	16.03
16.38	4.75	4.91	2.28	3.6	17.52	16.03
32.77	6.59	4.84	2.28	3.63	21.57	16.03
65.54	6.84	10.1	2.29	5.31	21.58	16.64
131.07	6.92	10.11	5.26	5.31	21.97	40.31
262.14	7.13	10.11	6.92	5.31	98.28	40.34
524.29	8.48	10.07	10.13	23.98	144.04	52.33
1048.58	19.33	10.43	38.95	44.59	153.2	49.86
2097.15	54.33	28.87	76.15	99.11	156.86	144.76
4194.3	76.31	85.3	78.05	112.55	157.32	256.09
8388.61	75.33	111.43	78.81	210.04	159.43	342.73
16777.22	77.49	120.39	81.77	214.19	166.52	166.52
33554.43	77.93	126.73	81.56	208.21	168.58	168.58

I claim each performance plateau corresponds to a chunk of hardware. Note that there are three jumps in the timings:

"L1" cache is the fastest (like 5ns) but tiny (100KB or less).
"L2" cache is slower (7-10ns) but much bigger (up to a meg)
RAM is painfully slow (100ns or more) but huge (gigs)

Note that machines have been getting faster and faster (see the slow machines on the right), but RAM isn't much faster nowadays!

Memory Speed

In general, memory accesses have performance that's:

Good *if* you're accessing a small amount of memory--small enough to stay in cache. The pieces of memory you use often automatically get copied into fast cache memory. For example, the top of the stack, recent stuff from the heap, and commonly used globals are almost always in cache, and hence really fast. Machines only have on the order of 1 meg of cache nowdays.
Good *if* you're accessing memory sequentially--if you access a[i], you then access a[i+1]. This "streaming" access is fast because memory chips don't have to change rows. See Ars Technica for piles of details on the mechanics of RAM access.
Terrible *if* you're accessing a big buffer (too big to fit in cache) in a nonsequential way. Cached accesses are on the order of 1ns. Streaming access is also a few ns. Random access that isn't in the cache is on the order of 100ns!

So if you're getting bad performance, you can either:

Make things smaller, or re-use them more often, to take advantage of the cache.
Make your accesses more regular, to get streaming access.

These are actually two aspects of "locality": the values you access should be similar. The cache lets you reuse stuff you've used recently in time (temporal locality); streaming access is about touching stuff nearby in space (spatial locality).

There are lots of different ways to improve locality:

Change your algorithm to improve locality. Think about how your loops access memory. Often it's trivial to reorganize your loops so you touch the same data 5 times before hitting the next piece of data (temporal locality); or so you access arrays sequentially (spatial locality). Sometimes an algorithm that's theoretically more efficient will be slower than another algorithm with more instructions but better locality!
Change your data structure to improve locality. For example, you might change linked lists into arrays. The trouble with linked lists is that links are allocated at random separate locations; but an array will be contiguous in memory (spatial locality).
Change your input data to improve locality. For example, processing data in small pieces (1KB) will usually have better temporal locality than huge chunks (1GB). Of course, if your pieces are too small, sometimes your program will slow down because of the per-piece overhead!

Here's some actual size-vs-stride memory performance I measured on a Core2 Duo CPU using this NetRun program.

Size \ Stride	2	11	43	155	546
1.0 KB	2.96ns	2.97ns	2.98ns	2.96ns	2.96ns
4.0 KB	3.03ns	3.03ns	3.10ns	3.10ns	3.06ns
16.0 KB	3.04ns	3.05ns	3.10ns	3.10ns	3.06ns
64.0 KB	3.03ns	3.10ns	3.84ns	3.47ns	3.24ns
256.0 KB	3.03ns	3.10ns	3.82ns	3.96ns	3.96ns
1024.0 KB	3.03ns	3.18ns	4.57ns	4.01ns	4.53ns
4096.0 KB	3.05ns	3.21ns	22.19ns	31.05ns	35.02ns
16384.0 KB	3.03ns	3.27ns	22.90ns	42.74ns	43.54ns

Note that up-to-256KB access is always fast (good temporal locality; those 256 Kbytes just get hit over and over). Similarly, stride-2 access is always fast (good spatial locality; the next access is just 2 bytes from the previous access). The only time memory access is slow is when jumping around in a big buffer--and it's 10x slower when you do that!

2D Arrays implemented as 1D Arrays

C and C++ don't actually support "real" 2D arrays.

For example, here's some 2D array code--check the disassembly

int arr[4][3];
int foo(void) {
	arr[0][0]=0xA0B0;
	arr[0][1]=0xA0B1;
	arr[0][2]=0xA0B2;
	arr[1][0]=0xA1B0;
	arr[2][0]=0xA2B0;
	arr[3][0]=0xA3B0;
	return 0;
}

(executable NetRun link)

Here, arr[i][j] is at a 1D address like arr[i*3+j]. Here's a picture of the array, and the 1D addresses of each element:

i==	0	1	2	3
j==0	[0]	[3]	[6]	[9]
j==1	[1]	[4]	[7]	[10]
j==2	[2]	[5]	[8]	[11]

Note that adjacent j indices are actually adjacent in memory; adjacent i indices are separated by 3 ints. There are lots of ways to say this same thing:

"j is the fast index", since as you walk through the 1D array, the 2D value of j changes fast (every step), but i changes more slowly.
"the array is column-major" (if i is the column number, like I've drawn the table above, a whole row is contiguous in memory), *or* "the array is row-major" (if i is the row number, like the transpose of my figure, a whole row is contiguous in memory). Note that whether you call an array row- or column-major depends on what you *define* as the rows or columns!

In general, when you write
    int arr[sx][sy];
    arr[x][y]++;

The compiler turns this into a 1D array like so:
    int arr[sx*sy];
    arr[x*sy+y]++;

Here y is the fast index. Note that the x coordinate is multiplied by the y size. This is because between x and x+1, there's one whole set of y's. Between y and y+1, there's nothing! So you want your *innermost* loop to be y, since then that loop will access contiguous array elements.

We could also have written
    int arr[sy][sx];
    arr[y][x]++;

Which in 1D notation is:
    int arr[sy*sx];
    arr[x+y*sx]++;

Now x is the fast index, and the y coordinate is multiplied by the x size. You now want your innermost loop to be x, since adjacent x values are contiguous in memory.

For example, this program has terrible performance-- 75ns/pixel:

enum {sx=512};
enum {sy=512};
double img[sx*sy];

int inc_img(void) {
        for (int x=0;x<sx;x++)
	for (int y=0;y<sy;y++)
		img[x+y*sx]++;
	return 0;
}

int foo(void) {
	double t=time_function(inc_img);
	printf("%.3f ns/pixel\n",t/(sx*sy)*1.0e9);
	return 0;
}
(Executable NetRun Link)

We can speed the program up by either:

Avoiding the power-of-two-jumps that screw up a set-associative cache (see below), by changing "sx" to 512+4. This alone speeds the program up to 15ns/pixel.

Making "img" smaller, by changing it to a "char" (1 byte) instead of "double" (8 bytes). This speeds the program up to 6ns/pixel (try it!).
Making the access more regular, by either interchanging the x and y loops, or by changing how the array is indexed (so it's "img[x*sy+y]"). This speeds the program up to 4ns/pixel.
Doing both--small data with regular access. This speeds the program up to about 1ns/pixel--a 70x improvement!

Set-associative Cache Mapping

Anytime you have a cache, of any kind, you need to figure out what to do when the cache gets full. Generally, you face this problem when you've got a new element X to load into the cache--which cache slot do you place X into?

The simplest approach is a "direct mapped cache", where element X goes into cache slot X%N (where N is the size of the cache). Direct mapping means elements 1 and 2 will go into different adjacent slots, but you can support many elements.

For example, the Pentium 4's L1 cache is 64KB in size and direct-mapped. This means address 0x0ABCD and address 0x1ABCD (which are 64KB apart) both get mapped to the same place in the cache. So even though this program is very fast (5.2ns/call):

enum {n=1024*1024};
char arr[n]; 

int foo(void) {
	arr[0]++;
	arr[12345]++;
	return 0;
}

(Try this in NetRun now!)

By contrast this very similar-looking program is very slow (20+ns/call), because both array elements (exactly 64KB apart) map to the same line of the cache, so the CPU keeps overwriting one with the other (called "cache thrashing"), and the cache is totally useless:

enum {n=1024*1024};
char arr[n]; 

int foo(void) {
	arr[0]++;
	arr[65536]++;
	return 0;
}

(Try this in NetRun now!)

In general, power-of-two jumps in memory can be very slow on direct-mapped machines. This is one of the only cases on computers where powers of two are not ideal!

Some machines avoid the thrashing of a direct-mapped cache by allowing a given memory address to sit in one of two or four slots, called two- or four-way "set associative caching" (described here, or in the Hennessy & Patterson book, Chapter 7). On such machines, you can still get cache thrashing with power of two jumps, but you need more and longer jumps to do so.

In general, there are many more complicated cache replacement algorithms, although most of them are too complicated to implement in hardware. The OS treats RAM as a cache of data on disk, and since disks are so slow, it can pay to be smart about choosing which page to write to disk.

Memory-Carried Superscalar Dependencies

Especially on x86, which has very few registers, the CPU has to work hard to make memory accesses fast.

First, memory dependencies are actually tracked at runtime, so this code where each loop iteration touches a different element runs fairly fast:

const int n=1000;
int arr[n];
int i;
for (i=0;i<n;i++)
	arr[i]*=i;
return arr[0];

(Try this in NetRun now!)

But this code, where each loop iteration touches the same element zero, runs over 2x slower, because the different loop iterations cannot be overlapped.

const int n=1000;
int arr[n];
int i;
for (i=0;i<n;i++)
	arr[0]*=i;
return arr[0];

(Try this in NetRun now!)

How can the CPU possibly do this? Well, to make superscalar execution work, the CPU has to track *pending* loads and stores anyway, just in case that instruction has to be rolled back due to an exception (hardware interrupt, divide-by-zero, page fault). So whenever the CPU issues a load, it just check to see if there's a pending store to that location, and if needed delays the load until the corresponding store is finished.

This "load-store buffer" is a pretty common trick on superscalar CPUs.