Programming Multicore with OpenMP

CS 301 Lecture, Dr. Lawlor

Basically, a thread is just a fancy way to run some of your code simultaniously with your other code. Each thread is given a separate set of registers, and separate set of stack addresses, so it can safely run its own functions.

Yet from a thread you can still call any function, access any global variable, or use data that belongs to other threads. As with multi-person projects, having multiple threads working on the same data is annoyingly error-prone: it's common for threads to overwrite each others' data, causing weird timing-dependent crashes. This sort of "data race condition" is by far the hardest part about multithreaded programming.

Generally speaking, parallel race conditions can only happen when:

More than one thread writes to the same data at the same time.

This statement basically tells you how to eliminate data race conditions:

More than one thread...	Only one thread? No problem. This is the solution taken by most modern software, and the one you've been using so far.
... writes to ...	No writes? No problem. So read-only data structures are fine. As soon as anybody does any writes, you're in trouble.
... the same data ...	Separate data? No problem. One common fix is to "privatize" all your data--make a private copy for each thread to use. This is probably the highest performance fix too.
... at the same time.	There are several cool ways to separate different threads' accesses in time. Many architectures support "atomic" instructions that the hardware guarantees will get the right answer, typically by excluding other cores' accesses. All thread libraries support a mutual exclusion primitive or "lock", typically built in software from atomic instructions.

Unfortunately, there's a false syllogism in people's minds, where:
multicore => threads => locks
That is, people assume the only way to write multicore code is using threads, and the only way to avoid problems with threads is to add locks. But there are lots of other options for both!

Basic OpenMP

There's a newish (mainstream in 2007) language extension out there called OpenMP, designed to make it very easy to write multithreaded code.

The basic idea is you take what looks like an ordinary sequential loop, like:

    for (int i=0;i<n;i++) do_fn(i);

And you add a little note to the compiler saying it's a parallel forloop, so if you've got six CPUs, the iterations should be spread across the CPUs. The particular syntax they chose is a "#pragma" statement like so:

#pragma omp parallel for
    for (int i=0;i<n;i++) do_fn(i);

This is a deceptively powerful statement. Each iteration of the loop can now potentially run on a separate core, in its own thread. After the for loop completes, all the threads go back to waiting, except the master thread, which continues with the (otherwise serial) program. But there are downsides:

This statement creates threads, which takes 5-10us/thread. So it won't help short loops, those that take less than a few thousand nanoseconds.
If any code called in the loop writes to shared data, you've just introduced a subtle timing-dependent bug into the program.

Unlike bare threads, with OpenMP:

The compiler decides how many threads to create (typically the number of CPUs)
The compiler moves the guts of your loop off into a separate function.
The compiler creates and deallocates the threads.
The compiler can actually help you find and fix memory race conditions.

Notice how much the compiler is involved? This means you need compiler support, or that pragma won't do anything. Here's how you enable OpenMP in various compilers. For market segmentation reasons, Microsoft doesn't make it easy to use OpenMP in the Express versions of Visual Studio, but here's how to hack in support for OpenMP in Visual C++ 2008 Express. gcc versions >=4.2 all support OpenMP, so most Linux or Mac desktops have it.

OpenMP Examples

Here's a non-OpenMP loop. Notice that &i, the address of local variable i, is constant (of course!).

int foo(void) {
	for (int i=0;i<10;i++) {
		std::cout<<"i="<<i<<" and &i="<<&i<<"\n";
	}
	return 0;
}

(Try this in NetRun now!)

Here's the same loop running in parallel with OpenMP. Notice that (1) there are 4 separate threads on this quad-core machine, (2) each thread gets its *own* copy of i (thanks to the compiler), and (3) every time you run the program, the printouts arrive in a slightly different order.

int foo(void) {
#pragma omp parallel for
	for (int i=0;i<10;i++) {
		std::cout<<"i="<<i<<" and &i="<<&i<<"\n";
	}
	return 0;
}

(Try this in NetRun now!)

Here's an example where OpenMP really shines. Each thread handles its own part of the array, working alone and getting things done. The net result is about 3x faster than a single core.

enum {n=100};
float array[n];

int foo(void) {
#pragma omp parallel for
	for (int i=0;i<n;i++) {
		array[i]=exp(-array[i]);
	}
	return array[0];
}

(Try this in NetRun now!)

OpenMP Gotchas

Here's a trivial little program to sum a million integers:

volatile int sum=0;
int foo(void) {
	int i,j;
	sum=0;
	for (i=0;i<1000;i++) 
	for (j=0;j<1000;j++) {
		sum++;
	}
	return sum;
}

(Try this in NetRun now!)

This takes 2.5 ns/iteration on my quad core machine.

Here's the obvious OpenMP version, which is 15x *slower* (32.3 ns/iteration), and gets the wrong answer besides!

volatile int sum=0;
int foo(void) {
	int i,j;
	sum=0;
#pragma omp parallel for
	for (i=0;i<1000;i++) 
	for (j=0;j<1000;j++) {
		sum++;
	}
	return sum;
}

(Try this in NetRun now!)

There are two problems here:

"j" is shared between the threads.
"sum" is shared between the threads.

In multithreaded programming, sharing is bad, both for performance and for correctness. We can fix "j" by asking the compiler to make a separate private copy for each thread. This speeds us up to under 1ns/iteration, about a 2x speedup over the sequential code. We're still getting the wrong answer, but now we're getting it *fast*!

volatile int sum=0;
int foo(void) {
	int i,j;
	sum=0;
#pragma omp parallel for private(j)
	for (i=0;i<1000;i++) 
	for (j=0;j<1000;j++) {
		sum++;
	}
	return sum;
}

(Try this in NetRun now!)

To get the right answer, we just ask OpenMP to automatically total up a private copy of "sum":

volatile int sum=0;
int foo(void) {
	int i,j;
	sum=0;
#pragma omp parallel for private(j) reduction(+:sum)
	for (i=0;i<1000;i++) 
	for (j=0;j<1000;j++) {
		sum++;
	}
	return sum;
}

(Try this in NetRun now!)

This is now 0.44ns/iteration, a solid 4x speedup over the original code, and it gets the right answer!