Load Balancing in OpenMP

CS 641 Lecture, Dr. Lawlor
Here's an application that sums up some data repeatedly. It takes about 0.1s to run.

enum {Nouter=10000,Ninner=10000};
int arr[Nouter];

int runloop(int i) {
	int sum=0;
	for (int j=0;j<Ninner;j++)
		sum+=i^j;
	return sum;
}

int foo(void) {
	arr[0]=0;
	for (unsigned int i=0;i<Nouter;i++) {
		arr[i]=runloop(i);
	}
	return arr[0];
}

(Try this in NetRun now!)

Where do we put the #pragma omp? Well, putting it inside runloop actually slows the program down, to 0.2s! The problem there is it takes a parallel loop a certain amount of time to start the threads, so you want the threads in the outermost loop if you can. Putting the #pragma omp into foo's loop speeds us up by a little over 3x on four cores, which is pretty good.

Here's a similar problem, but now the inner loop runs to i. In serial, gives a 2x speedup. But now OpenMP only gives about a 2x speedup on four cores--not very good!

enum {Nouter=10000,Ninner=10000};
int arr[Nouter];

int runloop(int i) {
	int sum=0;
	for (int j=0;j<i;j++)
		sum+=i^j;
	return sum;
}

int foo(void) {
	arr[0]=0;
//#pragma omp parallel for
	for (unsigned int i=0;i<Nouter;i++) {
		arr[i]=runloop(i);
	}
	return arr[0];
}

(Try this in NetRun now!)

The problem is even worse with this program. Now OpenMP hardly helps at all!

enum {Nouter=10000,Ninner=10000};
int arr[Nouter];

int runloop(int i) {
	int sum=0;
	int n=(int)(pow(i,4.0)/pow(Nouter,3.0));
	for (int j=0;j<n;j++)
		sum+=i^j;
	return sum;
}

int foo(void) {
	arr[0]=0;
//#pragma omp parallel for
	for (unsigned int i=0;i<Nouter;i++) {
		arr[i]=runloop(i);
	}
	return arr[0];
}

(Try this in NetRun now!)

The problem here is that the different iterations of the loop have radically different costs: when i is small, n is very small; when i is almost at the end, n gets huge. This means there's basically only one thread doing work at a time! One fix is to interleave the work (statically or dynamically), for example with a "schedule(static,1)". In general,
time to run openMP loop = time to start threads + maximum time for any thread
The time to start threads depends on the platform, but it's typically a few dozen microseconds on modern implementations.

False Sharing / Cache Thrashing

OK, similar problem, but just 1D now. Integer ++ takes about 1ns/int, mostly for the load and store.

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
	for (int i=0;i<n;i++) {
		arr[i]++;
	}
	return arr[0];
}

(Try this in NetRun now!)

Switch to OpenMP, and I get about 0.5ns/int, about a 2x speedup on a quad-core machine.

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
#pragma omp parallel for
	for (int i=0;i<n;i++) {
		arr[i]++;
	}
	return arr[0];
}

(Try this in NetRun now!)

Switch to incrementing only four values in the array. In serial, things get a little faster due to cache effect (0.9ns/int). But with multiple threads, suddenly I get a fairly massive slowdown, to 8ns/int!

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
#pragma omp parallel for
	for (unsigned int i=0;i<n;i++) {
		arr[i%4]++;
	}
	return arr[0];

}

(Try this in NetRun now!)

I'm also getting the wrong answer--should be 1/4 million, but it's way less than that, because processors are overwriting each others' work. Switching the thread scheduling so that each thread will increment a separate variable gives the right answer at least:

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
#pragma omp parallel for schedule(static,1)
	for (unsigned int i=0;i<n;i++) {
		arr[i%4]++;
	}
	return arr[0];
}

(Try this in NetRun now!)

Rationale: quad-core machine. Four threads doing one iteration each will each hit a unique array element.

But my performance is still quite bad--2ns/int, still worse than serial code! Why? Cache coherence. Instead of doing work, the CPUs are busy fighting over the only used cache line in the program. You can fix this with a very strange technique: move the CPUs data apart in memory further than one cache line. For example, here we move the integers accessed by each thread 20 units apart.

enum {n=1000*1000};
int arr[n];
int foo(void) {
	arr[0]=0;
#pragma omp parallel for schedule(static,1)
	for (unsigned int i=0;i<n;i++) {
		arr[(i%4)*20]++;
	}
	return arr[0];
}

(Try this in NetRun now!)

Just moving the data apart in RAM improves our performance substantially, to 0.7ns/int.

This cache thrashing business is annoying, because:

For good performance, processors need to trade work from heavily loaded to more lightly loaded processors.
But moving data across processors itself has a cost, especially when cache lines are shared between processors, so they have to keep exchanging them.

It's doubly annoying because it happens secretly, behind the scenes--the programmer doesn't know anything is wrong unless they get the wrong answer or bad performance, and then it's a mystery why it's happening. One solution to this, which we'll explore in the next few weeks: explicit data movement, such as across a network.

Parallelism Generally

Here's a survey of data structures used by scientific simulation software. I prepared this while in Illinois, working on a variety of scientific simulation software.

Some applications are quite easy to parallelize. Others... aren't.

"Naturally Parallel" applications don't need any communication. For example, Mandelbrot set rendering is naturally parallel, because each Mandelbrot pixel is a stand-alone problem you can solve independently of all the other Mandelbrot pixels. GPUs can handle naturally parallel applications in a single pass. The other common term for this is "Trivially Parallel" or "Embarrasingly Parallel", which makes it sound like a bad thing--but parallelism is both natural and good!
"Neighbor" applications communicate with their immediate neighbors, and that's it. For example, heat flow in a plate can be computed based solely on one's immediate neighbors (new value at arr[i] is a function of the old value of arr[i-1], arr[i], and arr[i+1]). Neighbor applications naturally fit into the "ghost exchange" communication pattern, and for large problem sizes usually can be tweaked to get good performance.
"Other" applications have a weirder, often collective communication pattern. Depending on the structure of the problem, such applications can sometimes have relatively good performance, but are often network bound.
"Sequential" applications might have multiple threads or processes, but they don't have any parallelism--only one thread/process can do useful work at a time. Many I/O limited applications are basically sequential, since ten CPUs can wait on one disk no faster than one CPU! (Solution: add more disks!)