Memory Speed and 2D Arrays
CS 301 Lecture, Dr. Lawlor
Memory Speed
Memory accesses have performance that's:
- Good *if* you're accessing a small amount of memory--small
enough to stay in cache. The pieces of memory you use often
automatically get copied into fast cache memory. For example, the
top of the stack, recent stuff from the heap, and commonly used globals
are almost always in cache, and hence really fast. Machines only
have on the order of 1 meg of cache nowdays.
- Good *if* you're accessing memory sequentially--if you
access a[i], you then access a[i+1]. This "streaming" access is
fast because memory chips don't have to change rows. See Ars Technica for piles of details on the mechanics of RAM access.
- Terrible *if* you're accessing a big buffer (too big to fit
in cache) in a nonsequential way. Cached accesses are on the
order of 1ns. Streaming access is also a few ns. Random
access that isn't in the cache is on the order of 100ns!
So if you're getting bad performance, you can either:
- Make things smaller, or re-use them more often, to take advantage of the cache.
- Make your accesses more regular, to get streaming access.
These are actually two aspects of "locality": the values you access
should be similar. The cache lets you reuse stuff you've used
recently in time (temporal locality); streaming access is about
touching stuff nearby in space (spatial locality).
There are lots of different ways to improve locality:
- Change your algorithm to improve locality. Think about how
your loops access memory. Often it's trivial to reorganize your
loops so you touch the same data 5 times before hitting the next piece
of data (temporal locality); or so you access arrays sequentially
(spatial locality). Sometimes an algorithm that's theoretically
more efficient will be slower than another algorithm with more
instructions but better locality!
- Change your data structure to improve locality. For
example, you might change linked lists into arrays. The trouble
with linked lists is that links are allocated at random separate
locations; but an array will be contiguous in memory (spatial locality).
- Change your input data to improve locality. For example,
processing data in small pieces (1KB) will usually have better temporal
locality than huge chunks (1GB). Of course, if your pieces are
too small, sometimes your program will slow down because of the
per-piece overhead!
Here's the actual memory performance I measured on a Core2 Duo CPU
using this NetRun program.
Size \ Stride
|
2 |
11 |
43 |
155 |
546 |
1.0 KB |
2.96ns |
2.97ns |
2.98ns |
2.96ns |
2.96ns |
4.0 KB |
3.03ns |
3.03ns |
3.10ns |
3.10ns |
3.06ns |
16.0 KB |
3.04ns |
3.05ns |
3.10ns |
3.10ns |
3.06ns |
64.0 KB |
3.03ns |
3.10ns |
3.84ns |
3.47ns |
3.24ns |
256.0 KB |
3.03ns |
3.10ns |
3.82ns |
3.96ns |
3.96ns |
1024.0 KB |
3.03ns |
3.18ns |
4.57ns |
4.01ns |
4.53ns |
4096.0 KB |
3.05ns |
3.21ns |
22.19ns |
31.05ns |
35.02ns |
16384.0 KB |
3.03ns |
3.27ns |
22.90ns |
42.74ns |
43.54ns |
Note that 256KB access is always fast (good temporal locality; those
256 bytes just get hit over and over). Similarly, stride-2 access
is always fast (good spatial locality; the next access is just 2 bytes
from the previous access). The only time memory access is slow is
when jumping around in a big buffer--and it's 10x slower when you do
that!
2D Arrays implemented as 1D Arrays
C and C++ don't actually support "real" 2D arrays.
For example, here's some 2D array code--check the disassembly
int arr[4][3];
int foo(void) {
arr[0][0]=0xA0B0;
arr[0][1]=0xA0B1;
arr[0][2]=0xA0B2;
arr[1][0]=0xA1B0;
arr[2][0]=0xA2B0;
arr[3][0]=0xA3B0;
return 0;
}
(executable NetRun link)
Here, arr[i][j] is at a 1D address like arr[i*3+j]. Here's a picture of the array, and the 1D addresses of each element:
i==
|
0
|
1
|
2
|
3
|
j==0
|
[0]
|
[3]
|
[6]
|
[9]
|
j==1
|
[1]
|
[4]
|
[7]
|
[10]
|
j==2
|
[2]
|
[5]
|
[8]
|
[11]
|
Note that adjacent j indices are actually adjacent in memory; adjacent
i indices are separated by 3 ints. There are lots of ways to say
this same thing:
- "j is the fast index", since as you walk through the 1D array,
the 2D value of j changes fast (every step), but i changes more slowly.
- "the array is column-major" (if i is the column number, like I've
drawn the table above, a whole row is contiguous in memory), *or* "the
array is row-major" (if i is the row number, like the transpose of my
figure, a whole row is contiguous in memory). Note that whether
you call an array row- or column-major depends on what you *define* as
the rows or columns!
In general, when you write
int arr[sx][sy];
arr[x][y]++;
The compiler turns this into a 1D array like so:
int arr[sx*sy];
arr[x*sy+y]++;
Here y is the fast index. Note that the x coordinate is multiplied by the y size.
This is because between x and x+1, there's one whole set of y's.
Between y and y+1, there's nothing! So you want your *innermost*
loop to be y, since then that loop will access contiguous array
elements.
We could also have written
int arr[sy][sx];
arr[y][x]++;
Which in 1D notation is:
int arr[sy*sx];
arr[x+y*sx]++;
Now x is the fast index, and the y coordinate is multiplied by the x
size. You now want your innermost loop to be x, since adjacent x
values are contiguous in memory.
2D Reconstructed from 1D
There's a cool trick that can be used to reconstruct the 2D array indices of a 1D array index. If
i = x*sy+y;
then
y = i % sy;
x = i / sy;
(Of course, sadly these are the slowest instructions on your CPU! Hopefully sy is a power of two!)
And if
i = y*sx + x;
then
x = i % sx;
y = i / sx;
This can be handy in a bunch of ways--you can write a single loop to
initialize all the elements in a 2D array, or send a 2D array index in
a single "int".