RAM and Cache Layout
CS 441 Lecture, Dr. Lawlor
Modern virtual memory is quite flexible, and you can request the OS
change your page table to put memory at any address you like.
This program makes up a 48-bit pointer, and maps data there:
#include <sys/mman.h>
int foo(void) {
void *ptr=(void *)0x3cafef00d000;
void *retptr=mmap(ptr,4096, PROT_WRITE|PROT_READ,
MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
std::cout<<"Did mmap, got back ptr "<<retptr<<"\n";
int *ip=(int *)ptr;
std::cout<<"Writing data...\n";
*ip=7;
std::cout<<"Reading data...\n";
return *ip;
}
(Try this in NetRun now!)
Physical memory is typically much more restrictive; for example, on the old Mac Plus you had to cut a resistor off the motherboard to route memory to the right place if you added RAM.
Modern Memory Hardware
- The Load/Store buffer matches pending loads to pending stores, bypassing everything else entirely.
- The Translation Lookaside Buffer (TLB)
caches the page table, accelerating virtual to physical address
translation. Some caches are indexed by virtual address, which reduces
the pressure on the TLB, but then you need to figure out what to do
about context switches.
- The level 1 cache (L1) is tiny, typically a few dozen kilobytes,
but fast, typically a few nanoseconds. It's made fast by being simple
(i.e., stupid), typically direct mapped. Each core typically has its
own L1.
- The
level 2 cache (L2) is typically a few megabytes, and only a bit slower
than L1. L2 typically is shared between cores, although some
machines push this off to a L3 cache.
- The memory controller figures out how to talk across the memory access bus.
- The Dynamic Random Access Memory (DRAM) chips are huge, measured in gigabytes, but impossibly slow,
taking 50+ nanoseconds to do a random access. They're actually much
better at streaming accesses, which is ironic considering their name!
Note that currently, every part of this is happening without any
programmer intervention, although there has been a push in recent years
to (1) raise programmer's awareness of cache behavior, writing more
cache-friendly code, and (2) give programmers more control, including
hints like "do not cache this", and even fully software-managed on-chip
scratchpad memory.
DRAM Hardware
DRAM chips
are 2D grids of little FET-and-capacitor cells. Each cell stores
a 1 or 0 in the capacitor, and the FET is normally off, insulating the
capacitor, and only turns on for a read or a write. The evolutionary history of DRAM cells shown in Figure 5 here
is quite fascinating: over time, as cells have gotten denser they have
actually been simplified (fewer parts to fabricate), and are slowly
pushing into 3D.
The capacitor's charge does
slowly bleed off over a timescale of milliseconds; internally DRAM chips need
to be "refreshed" (read off and re-written) every millisecond or two, which special dedicated circuitry
performs.
There's more detail than you'd ever want to know about DRAM chip interfacing: RAS/CAS, EDO, FPM, DDR, etc at Ars Technica.
Charged DRAM cells are slightly light sensitive, likely due to
photoelectric electrons bleeding off the capacitor's charge. This means you can actually build a camera using a RAM chip as a sensor
(this was much easier back in the day, when chips were often packaged
in a little metal "can" instead of being semi-permanently cast into plastic or
ceramic). The downside is the sensors have a native 1-bit
resolution (real black-and-white), so even to get grayscale you've got
to take several exposures and average. The upside is that the
chips in a $50 1GB (8 gigabit) DIMM could be used to build a 8
*gigapixel* camera (in literal on-or-off black and white, however), and the thing could be read out at several frames
per second. If you were willing to restrict the focal area and
resolution, you could easily get hundreds of thousands of frames per
second, although you'd probably need a very bright lightsource.
Levels of Cache
Here's an inner loop that does something funny--it jumps around (using
"loc") inside an array called "buf", but only within the bounds
established by "mask". Like any loop, each iteration takes some
amount of time; but what's suprising is that there's a very strong
dependence of the speed on the value of "mask", which establishes the
size of the array we're jumping around in.
(Executable NetRun Link)
for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */
sum+=buf[loc&mask]++;
loc+=1234567;
}
Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").
Size (KB) |
2.8GHz P4 |
1.6GHz P-M |
2.2Ghz Athlon64 |
2.0GHz PPC G5 |
900MHz P3 |
300MHz PPC |
1.02 |
4.92 |
4.88 |
2.27 |
5.53 |
17.47 |
16.03 |
2.05 |
4.6 |
4.87 |
2.28 |
5.53 |
17.47 |
16.03 |
4.1 |
4.72 |
4.9 |
2.28 |
5.18 |
17.48 |
16.03 |
8.19 |
4.83 |
4.91 |
2.28 |
3.6 |
17.49 |
16.03 |
16.38 |
4.75 |
4.91 |
2.28 |
3.6 |
17.52 |
16.03 |
32.77 |
6.59 |
4.84 |
2.28 |
3.63 |
21.57 |
16.03 |
65.54 |
6.84 |
10.1 |
2.29 |
5.31 |
21.58 |
16.64 |
131.07 |
6.92 |
10.11 |
5.26 |
5.31 |
21.97 |
40.31 |
262.14 |
7.13 |
10.11 |
6.92 |
5.31 |
98.28 |
40.34 |
524.29 |
8.48 |
10.07 |
10.13 |
23.98 |
144.04 |
52.33 |
1048.58 |
19.33 |
10.43 |
38.95 |
44.59 |
153.2 |
49.86 |
2097.15 |
54.33 |
28.87 |
76.15 |
99.11 |
156.86 |
144.76 |
4194.3 |
76.31 |
85.3 |
78.05 |
112.55 |
157.32 |
256.09 |
8388.61 |
75.33 |
111.43 |
78.81 |
210.04 |
159.43 |
342.73 |
16777.22 |
77.49 |
120.39 |
81.77 |
214.19 |
166.52 |
166.52 |
33554.43 |
77.93 |
126.73 |
81.56 |
208.21 |
168.58 |
168.58 |
I claim each performance plateau corresponds to a chunk of hardware. Note that there are three jumps in the timings:
- "L1" cache is the fastest (like 5ns) but tiny (100KB or less).
- "L2" cache is slower (7-10ns) but much bigger (up to a meg)
- RAM is painfully slow (100ns or more) but huge (gigs)
Note that machines have been getting faster and faster (see the slow machines on the right), but RAM isn't much faster nowadays!
Set-associative Cache Mapping
Cache is designed to speed up your reads and writes to values in
DRAM.
The typical way you do this is to break up recently accessed DRAM data
into "cache lines" of around 64 bytes, and put a "tag" indicating the
original address in DRAM. Typically, cache lines are stored in
some reasonable order--DRAM address X is stored at some predictable
location in the cache.
The simplest approach is a "direct mapped cache",
where element X goes into cache slot X%N (where N is the size of the
cache). For example, the Pentium 4's L1 cache is 64KB in size and
direct-mapped. This means address 0x0ABCD and address 0x1ABCD
(which are 64KB apart) both get mapped to the same place in the
cache. So even though this program is very fast (5.2ns/call):
enum {n=1024*1024};
char arr[n];
int foo(void) {
arr[0]++;
arr[12345]++;
return 0;
}
(Try this in NetRun now!)
By contrast this very similar-looking program is very slow
(20+ns/call), because both array elements (exactly 64KB apart) map to
the same line of the cache, so the CPU keeps overwriting one with the
other (called "cache thrashing"), and the cache is totally useless:
enum {n=1024*1024};
char arr[n];
int foo(void) {
arr[0]++;
arr[65536]++;
return 0;
}
(Try this in NetRun now!)
In general, power-of-two jumps in memory can be very slow on
direct-mapped machines. This is one of the only cases on
computers where powers of two are not ideal!
Some machines avoid the thrashing of a direct-mapped cache by allowing
a given memory address to sit in one of two or four slots, called two-
or four-way "set associative caching" (described here,
or in the Hennessy & Patterson book, Chapter 7). On such machines,
you can still get cache thrashing with power of two jumps, but you need
more and longer jumps to do so.
In general, there are many more complicated cache replacement algorithms,
although most of them are too complicated to implement in
hardware. The OS treats RAM as a cache of data on disk, and since
disks are so slow, it can pay to be smart about choosing which page to
write to disk.
Memory-Carried Superscalar Dependencies
Especially on x86, which has very few registers, the CPU has to work hard to make memory accesses fast.
First, memory dependencies are actually tracked at runtime, so this
code where each loop iteration touches a different element runs fairly
fast:
const int n=1000;
int arr[n];
int i;
for (i=0;i<n;i++)
arr[i]*=i;
return arr[0];
(Try this in NetRun now!)
But this code, where each loop iteration touches the same element zero,
runs over 2x slower, because the different loop iterations cannot be
overlapped.
const int n=1000;
int arr[n];
int i;
for (i=0;i<n;i++)
arr[0]*=i;
return arr[0];
(Try this in NetRun now!)
How can the CPU possibly do this? Well, to make superscalar
execution work, the CPU has to track *pending* loads and stores anyway,
just in case that instruction has to be rolled back due to an exception
(hardware interrupt, divide-by-zero, page fault). So whenever the
CPU issues a load, it just check to see if there's a pending store to
that location, and if needed delays the load until the corresponding
store is finished.
This "load-store buffer" is a pretty common trick on superscalar CPUs.
Empirical Memory Speed
In general, memory accesses have performance that's:
- Good *if* you're accessing a small amount of memory--small
enough to stay in cache. The pieces of memory you use often
automatically get copied into fast cache memory. For example, the
top of the stack, recent stuff from the heap, and commonly used globals
are almost always in cache, and hence really fast. Machines only
have on the order of 1 meg of cache nowdays.
- Good *if* you're accessing memory sequentially--if you
access a[i], you then access a[i+1]. This "streaming" access is
fast because memory chips don't have to change rows.
- Terrible *if* you're accessing a big buffer (too big to fit
in cache) in a nonsequential way. Cached accesses are on the
order of 1ns. Streaming access is also a few ns. Random
access that isn't in the cache is on the order of 100ns!
So if you're getting bad performance, you can either:
- Make things smaller, or re-use them more often, to take advantage of the cache.
- Make your accesses more regular, to get streaming access.
These are actually two aspects of "locality": the values you access
should be similar. The cache lets you reuse stuff you've used
recently in time (temporal locality); streaming access is about
touching stuff nearby in space (spatial locality).
There are lots of different ways to improve locality:
- Change your algorithm to improve locality. Think about how
your loops access memory. Often it's trivial to reorganize your
loops so you touch the same data 5 times before hitting the next piece
of data (temporal locality); or so you access arrays sequentially
(spatial locality). Sometimes an algorithm that's theoretically
more efficient will be slower than another algorithm with more
instructions but better locality!
- Change your data structure to improve locality. For
example, you might change linked lists into arrays. The trouble
with linked lists is that links are allocated at random separate
locations; but an array will be contiguous in memory (spatial locality).
- Change your input data to improve locality. For example,
processing data in small pieces (1KB) will usually have better temporal
locality than huge chunks (1GB). Of course, if your pieces are
too small, sometimes your program will slow down because of the
per-piece overhead!
Here's some actual size-vs-stride memory performance I measured on a Core2 Duo CPU
using this NetRun program.
Size \ Stride
|
2 |
11 |
43 |
155 |
546 |
1.0 KB |
2.96ns |
2.97ns |
2.98ns |
2.96ns |
2.96ns |
4.0 KB |
3.03ns |
3.03ns |
3.10ns |
3.10ns |
3.06ns |
16.0 KB |
3.04ns |
3.05ns |
3.10ns |
3.10ns |
3.06ns |
64.0 KB |
3.03ns |
3.10ns |
3.84ns |
3.47ns |
3.24ns |
256.0 KB |
3.03ns |
3.10ns |
3.82ns |
3.96ns |
3.96ns |
1024.0 KB |
3.03ns |
3.18ns |
4.57ns |
4.01ns |
4.53ns |
4096.0 KB |
3.05ns |
3.21ns |
22.19ns |
31.05ns |
35.02ns |
16384.0 KB |
3.03ns |
3.27ns |
22.90ns |
42.74ns |
43.54ns |
Note that up-to-256KB access is always fast (good temporal locality; those
256 Kbytes just get hit over and over). Similarly, stride-2 access
is always fast (good spatial locality; the next access is just 2 bytes
from the previous access). The only time memory access is slow is
when jumping around in a big buffer--and it's 10x slower when you do
that!
2D Arrays implemented as 1D Arrays
C and C++ don't actually support "real" 2D arrays.
For example, here's some 2D array code--check the disassembly
int arr[4][3];
int foo(void) {
arr[0][0]=0xA0B0;
arr[0][1]=0xA0B1;
arr[0][2]=0xA0B2;
arr[1][0]=0xA1B0;
arr[2][0]=0xA2B0;
arr[3][0]=0xA3B0;
return 0;
}
(executable NetRun link)
Here, arr[i][j] is at a 1D address like arr[i*3+j]. Here's a picture of the array, and the 1D addresses of each element:
i==
|
0
|
1
|
2
|
3
|
j==0
|
[0]
|
[3]
|
[6]
|
[9]
|
j==1
|
[1]
|
[4]
|
[7]
|
[10]
|
j==2
|
[2]
|
[5]
|
[8]
|
[11]
|
Note that adjacent j indices are actually adjacent in memory; adjacent
i indices are separated by 3 ints. There are lots of ways to say
this same thing:
- "j is the fast index", since as you walk through the 1D array,
the 2D value of j changes fast (every step), but i changes more slowly.
- "the array is column-major" (if i is the column number, like I've
drawn the table above, a whole row is contiguous in memory), *or* "the
array is row-major" (if i is the row number, like the transpose of my
figure, a whole row is contiguous in memory). Note that whether
you call an array row- or column-major depends on what you *define* as
the rows or columns!
In general, when you write
int arr[sx][sy];
arr[x][y]++;
The compiler turns this into a 1D array like so:
int arr[sx*sy];
arr[x*sy+y]++;
Here y is the fast index. Note that the x coordinate is multiplied by the y size.
This is because between x and x+1, there's one whole set of y's.
Between y and y+1, there's nothing! So you want your *innermost*
loop to be y, since then that loop will access contiguous array
elements.
We could also have written
int arr[sy][sx];
arr[y][x]++;
Which in 1D notation is:
int arr[sy*sx];
arr[x+y*sx]++;
Now x is the fast index, and the y coordinate is multiplied by the x
size. You now want your innermost loop to be x, since adjacent x
values are contiguous in memory.
For example, this program has terrible performance-- 75ns/pixel:
enum {sx=512};
enum {sy=512};
double img[sx*sy];
int inc_img(void) {
for (int x=0;x<sx;x++)
for (int y=0;y<sy;y++)
img[x+y*sx]++;
return 0;
}
int foo(void) {
double t=time_function(inc_img);
printf("%.3f ns/pixel\n",t/(sx*sy)*1.0e9);
return 0;
}
(Executable NetRun Link)
We can speed the program up by either:
- Avoiding the power-of-two-jumps that screw up a set-associative
cache (see below), by changing "sx" to 512+4. This alone speeds the program up to
15ns/pixel.
- Making "img" smaller, by changing it to a "char" (1 byte) instead
of "double" (8 bytes). This speeds the program up to 6ns/pixel
(try it!).
- Making the access more regular, by either interchanging the x and
y loops, or by changing how the array is indexed (so it's
"img[x*sy+y]"). This speeds the program up to 4ns/pixel.
- Doing both--small data with regular access. This speeds the program up to about 1ns/pixel--a 70x improvement!
Caching In Parallel: Cache Coherence and Cache Thrashing
If two threads try to write to the *same* variable at the same time, you can get the wrong answer.
But surprisingly, if two threads try to write to different but nearby
variables at the same time, you get the right answer, but a big
performance hit!
enum {n=2}; /* number of threads */
enum {m=1000*1000}; /* number of times to access the int */
int offset=1; /* distance, in ints, between threads' accesses */
volatile int arr[1025];
void hammer_on_int(volatile int *ptr) {
for (int i=0;i<m;i++)
(*ptr)++;
}
int multi_hammer(void) {
#pragma omp parallel for
for (int thread=0;thread<n;thread++) {
hammer_on_int(&arr[thread*offset]);
}
return 0;
}
int foo(void) {
for (offset=1;offset<=1024;offset*=2)
printf("%d-byte offset took %.3f ns/++\n",
sizeof(int)*offset,time_function(multi_hammer)*1.0e9/m);
return 0;
}
(Try this in NetRun now!)
This program prints out:
4-byte offset took 19.437 ns/++
8-byte offset took 19.304 ns/++
16-byte offset took 20.442 ns/++
32-byte offset took 22.939 ns/++
64-byte offset took 2.615 ns/++
128-byte offset took 2.601 ns/++
256-byte offset took 2.598 ns/++
512-byte offset took 2.572 ns/++
1024-byte offset took 2.857 ns/++
2048-byte offset took 2.664 ns/++
4096-byte offset took 2.571 ns/++
Program complete. Return 0 (0x0)
When two threads are working on data within the same 64-byte cache line, there's a substantial (10x!) slowdown.
Snoopy Cache Coherence Protocol
The problem is the cache coherence protocol the two CPUs use to ensure that writes to different locations will combine properly.
In particular, most modern machines use the MESI variant of MSI
(go read 'em!). These protocols have line granularity, so if two
CPUs are accessing data near each other (within one cache line), you'll
get "cache thrashing" as the CPUs have to hand the line back and forth.
The solution to this "false sharing" problem is to just separate data
accessed by multiple threads--they've got to be at least one cache line
apart for good performance.
Non-Uniform Memory Access (NUMA): Scalable Cache Coherence
Larger machines, where there is no shared bus, usually have non-uniform memory access:
local memory is faster to access than remote memory. To keep
track of the caches, NUMA machines usually use some form of directory protocol,
where there's a tiny stub associated with each potential cache line in
memory, telling you the CPU responsible for keeping track of that cache
line, or -1 if it's uncached.
Software Distributed Shared Memory (SDSM)
You can even write your own distributed shared memory cache coherence
protocol, using local RAM as a cache for remote RAM. Typically
the "cache line" size is the same as the size of a hardware page, say
4KB for x86. This is pretty big, so to avoid false sharing on
SDSM systems, you need to make sure different threads' data is many
kilobytes separated!