Load Balancing vs False Sharing / Cache Thrashing

CS 441 Lecture, Dr. Lawlor

Here's an application that sums up some data repeatedly. It takes about 0.1s to run.

enum {Nouter=10000,Ninner=10000};
int arr[Nouter];

int runloop(int i) {
	int sum=0;
	for (int j=0;j<Ninner;j++)
		sum+=i^j;
	return sum;
}

int foo(void) {
	arr[0]=0;
	for (unsigned int i=0;i<Nouter;i++) {
		arr[i]=runloop(i);
	}
	return arr[0];
}

(Try this in NetRun now!)

Where do we put the #pragma omp? Well, putting it inside runloop actually slows the program down, to 0.2s! The problem there is it takes a parallel loop a certain amount of time to start the threads, so you want the threads in the outermost loop if you can. Putting the #pragma omp into foo's loop speeds us up by a little over 3x on four cores, which is pretty good.

Here's a similar problem, but now the inner loop runs to i. In serial, gives a 2x speedup. But now OpenMP only gives about a 2x speedup on four cores--not very good!

enum {Nouter=10000,Ninner=10000};
int arr[Nouter];

int runloop(int i) {
	int sum=0;
	for (int j=0;j<i;j++)
		sum+=i^j;
	return sum;
}

int foo(void) {
	arr[0]=0;
//#pragma omp parallel for
	for (unsigned int i=0;i<Nouter;i++) {
		arr[i]=runloop(i);
	}
	return arr[0];
}

(Try this in NetRun now!)

The problem is even worse with this program. Now OpenMP hardly helps at all!

enum {Nouter=10000,Ninner=10000};
int arr[Nouter];

int runloop(int i) {
	int sum=0;
	int n=(int)(pow(i,4.0)/pow(Nouter,3.0));
	for (int j=0;j<n;j++)
		sum+=i^j;
	return sum;
}

int foo(void) {
	arr[0]=0;
//#pragma omp parallel for
	for (unsigned int i=0;i<Nouter;i++) {
		arr[i]=runloop(i);
	}
	return arr[0];
}

(Try this in NetRun now!)

The problem here is that the different iterations of the loop have radically different costs: when i is small, n is very small; when i is almost at the end, n gets huge. This means there's basically only one thread doing work at a time! One fix is to interleave the work (statically or dynamically), for example with a "schedule(static,1)". In general,
time to run openMP loop = time to start threads + maximum time for any thread
The time to start threads depends on the platform, but it's typically a few dozen microseconds on modern implementations.

False Sharing / Cache Thrashing

OK, similar problem, but just 1D now. Integer ++ takes about 1ns/int, mostly for the load and store.

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
	for (int i=0;i<n;i++) {
		arr[i]++;
	}
	return arr[0];
}

(Try this in NetRun now!)

Switch to OpenMP, and I get about 0.5ns/int, about a 2x speedup on a quad-core machine.

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
#pragma omp parallel for
	for (int i=0;i<n;i++) {
		arr[i]++;
	}
	return arr[0];
}

(Try this in NetRun now!)

Switch to incrementing only four values in the array. In serial, things get a little faster due to cache effect (0.9ns/int). But with multiple threads, suddenly I get a fairly massive slowdown, to 8ns/int!

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
#pragma omp parallel for
	for (unsigned int i=0;i<n;i++) {
		arr[i%4]++;
	}
	return arr[0];

}

(Try this in NetRun now!)

I'm also getting the wrong answer--should be 1/4 million, but it's way less than that, because processors are overwriting each others' work. Switching the thread scheduling so that each thread will increment a separate variable gives the right answer at least:

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
#pragma omp parallel for schedule(static,1)
	for (unsigned int i=0;i<n;i++) {
		arr[i%4]++;
	}
	return arr[0];
}

(Try this in NetRun now!)

Rationale: quad-core machine. Four threads doing one iteration each will each hit a unique array element.

But my performance is still quite bad--2ns/int, still worse than serial code! Why? Cache coherence. Instead of doing work, the CPUs are busy fighting over the only used cache line in the program. You can fix this with a very strange technique: move the CPUs data apart in memory further than one cache line. For example, here we move the integers accessed by each thread 20 units apart.

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
#pragma omp parallel for schedule(static,1)
	for (unsigned int i=0;i<n;i++) {
		arr[(i%4)*20]++;
	}
	return arr[0];
}

(Try this in NetRun now!)

Just moving the data apart in RAM improves our performance substantially, to 0.7ns/int.

This cache thrashing business is annoying, because:

For good performance, processors need to trade work from heavily loaded to more lightly loaded processors.
But moving data across processors itself has a cost, especially when cache lines are shared between processors, so they have to keep exchanging them.

It's doubly annoying because it happens secretly, behind the scenes--the programmer doesn't know anything is wrong unless they get the wrong answer or bad performance, and then it's a mystery why it's happening. One solution to this, which we'll explore in the coming weeks: explicit data movement, such as across a network.