Load Balancing vs False Sharing / Cache Thrashing
CS 441 Lecture, Dr. Lawlor
Here's an application that sums up some data repeatedly. It takes about 0.1s to run.
enum {Nouter=10000,Ninner=10000};
int arr[Nouter];
int runloop(int i) {
int sum=0;
for (int j=0;j<Ninner;j++)
sum+=i^j;
return sum;
}
int foo(void) {
arr[0]=0;
for (unsigned int i=0;i<Nouter;i++) {
arr[i]=runloop(i);
}
return arr[0];
}
(Try this in NetRun now!)
Where do we put the #pragma omp? Well, putting it inside runloop
actually slows the program down, to 0.2s! The problem there is it
takes a parallel loop a certain amount of time to start the threads, so
you want the threads in the outermost loop if you can. Putting
the #pragma omp into foo's loop speeds us up by a little over 3x on
four cores, which is pretty good.
Here's a similar problem, but now the inner loop runs to i. In
serial, gives a 2x speedup. But now OpenMP only gives about a 2x
speedup on four cores--not very good!
enum {Nouter=10000,Ninner=10000};
int arr[Nouter];
int runloop(int i) {
int sum=0;
for (int j=0;j<i;j++)
sum+=i^j;
return sum;
}
int foo(void) {
arr[0]=0;
//#pragma omp parallel for
for (unsigned int i=0;i<Nouter;i++) {
arr[i]=runloop(i);
}
return arr[0];
}
(Try this in NetRun now!)
The problem is even worse with this program. Now OpenMP hardly helps at all!
enum {Nouter=10000,Ninner=10000};
int arr[Nouter];
int runloop(int i) {
int sum=0;
int n=(int)(pow(i,4.0)/pow(Nouter,3.0));
for (int j=0;j<n;j++)
sum+=i^j;
return sum;
}
int foo(void) {
arr[0]=0;
//#pragma omp parallel for
for (unsigned int i=0;i<Nouter;i++) {
arr[i]=runloop(i);
}
return arr[0];
}
(Try this in NetRun now!)
The problem here is that the different iterations of the loop have
radically different costs: when i is small, n is very small; when i is
almost at the end, n gets huge. This means there's basically only
one thread doing work at a time! One fix is to interleave the
work (statically or dynamically), for example with a
"schedule(static,1)". In general,
time to run openMP loop = time to start threads + maximum time for any thread
The time to start threads depends on the platform, but it's typically a few dozen microseconds on modern implementations.
False Sharing / Cache Thrashing
OK, similar problem, but just 1D now. Integer ++ takes about 1ns/int, mostly for the load and store.
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
for (int i=0;i<n;i++) {
arr[i]++;
}
return arr[0];
}
(Try this in NetRun now!)
Switch to OpenMP, and I get about 0.5ns/int, about a 2x speedup on a quad-core machine.
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
#pragma omp parallel for
for (int i=0;i<n;i++) {
arr[i]++;
}
return arr[0];
}
(Try this in NetRun now!)
Switch to incrementing only four values in the array. In serial,
things get a little faster due to cache effect (0.9ns/int). But
with multiple threads, suddenly I get a fairly massive slowdown, to
8ns/int!
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
#pragma omp parallel for
for (unsigned int i=0;i<n;i++) {
arr[i%4]++;
}
return arr[0];
}
(Try this in NetRun now!)
I'm also getting the wrong answer--should be 1/4 million, but it's way
less than that, because processors are overwriting each others'
work. Switching the thread scheduling so that each thread will
increment a separate variable gives the right answer at least:
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
#pragma omp parallel for schedule(static,1)
for (unsigned int i=0;i<n;i++) {
arr[i%4]++;
}
return arr[0];
}
(Try this in NetRun now!)
Rationale: quad-core machine. Four threads doing one iteration each will each hit a unique array element.
But my performance is still quite bad--2ns/int, still worse than serial code! Why? Cache coherence.
Instead of doing work, the CPUs are busy fighting over the only used
cache line in the program. You can fix this with a very strange
technique: move the CPUs data apart in memory further than one cache
line. For example, here we move the integers accessed by each
thread 20 units apart.
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
#pragma omp parallel for schedule(static,1)
for (unsigned int i=0;i<n;i++) {
arr[(i%4)*20]++;
}
return arr[0];
}
(Try this in NetRun now!)
Just moving the data apart in RAM improves our performance substantially, to 0.7ns/int.
This cache thrashing business is annoying, because:
- For good performance, processors need to trade work from heavily loaded to more lightly loaded processors.
- But moving data across processors itself has a cost, especially
when cache lines are shared between processors, so they have to keep
exchanging them.
It's doubly annoying because it happens secretly, behind the
scenes--the programmer doesn't know anything is wrong unless they get
the wrong answer or bad performance, and then it's a mystery why it's
happening. One solution to this, which we'll explore in the
coming weeks: explicit data movement, such as across a network.