Here's an inner loop that does something funny--it jumps around (using "loc") inside an array called "buf", but only within the bounds established by "mask". Like any loop, each iteration takes some amount of time; but what's suprising is that there's a very strong dependence of the speed on the value of "mask", which establishes the size of the array we're jumping around in.
for (i=0;i<max;i++) { /* jump around in buffer, incrementing as we go */ sum+=buf[loc&mask]++; loc+=del; del=del+sum; } (Try this in NetRun now!)
Here's the performance of this loop, in nanoseconds per iteration, as a function of the array size (as determined by "mask").
Size (KB) | 4GHz Skylake | 2.4GHz Q6600 | 2.8GHz P4 | 2.2Ghz Athlon64 | 2.0GHz PPC G5 | 900MHz P3 | 900MHz ARM | 300MHz PPC |
1 | 1.29 | 2.31 | 4.05 | 2.27 | 5.5 | 17.5 | 15.5 | 16.0 |
2 | 1.29 | 2.29 | 4.39 | 2.28 | 5.5 | 17.5 | 13.4 | 16.0 |
4 | 1.29 | 2.29 | 4.63 | 2.28 | 5.2 | 17.5 | 13.4 | 16.0 |
8 | 1.29 | 2.29 | 4.71 | 2.28 | 3.6 | 17.5 | 13.4 | 16.0 |
16 | 1.29 | 2.29 | 4.76 | 2.28 | 3.6 | 17.5 | 13.4 | 16.0 |
33 | 1.29 | 2.29 | 7.74 | 2.28 | 3.6 | 21.6 | 14.7 | 16.0 |
66 | 1.8 | 3.91 | 8.67 | 2.29 | 5.3 | 21.6 | 20.4 | 16.6 |
131 | 2.34 | 4.66 | 9.07 | 5.26 | 5.3 | 22.0 | 24.5 | 40.3 |
262 | 3.05 | 4.91 | 9.54 | 6.92 | 5.3 | 98.3 | 26.8 | 40.3 |
524 | 4.74 | 4.98 | 12.57 | 10.13 | 24.0 | 144.0 | 43.5 | 52.3 |
1049 | 5.7 | 5.02 | 33.5 | 38.95 | 44.6 | 153.2 | 117.0 | 49.9 |
2097 | 6.12 | 6.52 | 61.49 | 76.15 | 99.1 | 156.9 | 160.9 | 144.8 |
4194 | 5.91 | 15.72 | 76.95 | 78.05 | 112.6 | 157.3 | 186.3 | 256.1 |
8389 | 8.82 | 40.3 | 85.36 | 78.81 | 210.0 | 159.4 | 201.7 | 342.7 |
16777 | 25.22 | 50.19 | 88.55 | 81.77 | 214.2 | 166.5 | 215.2 | 166.5 |
33554 | 31.04 | 53.35 | 90.81 | 81.56 | 208.2 | 168.6 | 226.9 | 168.6 |
I claim each performance plateau corresponds to a chunk of hardware. Note that there are three jumps in the timings:
In general, memory accesses have performance that's:
So if you're getting bad performance, you can either:
These are actually two aspects of "locality", the similarity between the values you access. The cache lets you reuse stuff you've used recently in time (temporal locality); streaming access is about touching stuff nearby in space (spatial locality).
There are lots of different ways to improve locality: