Programming Multicore with OpenMP
CS 301 Lecture, Dr. Lawlor
Basically, a thread is just a fancy way to run some of your code
simultaniously with your other code. Each thread is given a
separate set of registers, and separate set of stack addresses, so it
can safely run its own functions.
Yet from a thread you can
still call any function, access any global variable, or use data that
belongs
to other threads. As with multi-person projects, having multiple
threads working on the same data is annoyingly error-prone: it's common
for threads to overwrite each others' data, causing weird
timing-dependent crashes. This sort of "data race condition" is
by far the hardest part about multithreaded programming.
Generally speaking, parallel race conditions can only happen when:
More than one thread writes to the same data at the same time.
This statement basically tells you how to eliminate data race conditions:
More than one thread...
|
Only
one thread? No problem. This is the solution taken by most
modern software, and the one you've been using so far.
|
... writes to ...
|
No writes? No problem. So read-only data structures are fine.
As soon as anybody does any writes, you're in trouble.
|
... the same data ...
|
Separate data? No
problem. One common fix is to "privatize" all your data--make a
private copy for each thread to use. This is probably the highest performance fix too.
|
... at the same time.
|
There are several cool ways to separate different threads' accesses in time. Many architectures support "atomic" instructions that
the hardware guarantees will get the right answer, typically by
excluding other cores' accesses. All thread libraries
support a mutual exclusion primitive or "lock", typically built in software from atomic instructions.
|
Unfortunately, there's a false syllogism in people's minds, where:
multicore => threads => locks
That is, people assume the only way to write multicore code is using
threads, and the only way to avoid problems with threads is to add
locks. But there are lots of other options for both!
Basic OpenMP
There's a newish (mainstream in 2007) language extension out there called OpenMP, designed to make it very easy to write multithreaded code.
The basic idea is you take what looks like an ordinary sequential loop, like:
for (int i=0;i<n;i++) do_fn(i);
And you add a little note to the compiler saying it's a parallel
forloop, so if you've got six CPUs, the iterations should be spread
across the CPUs. The particular syntax they chose is a "#pragma"
statement like so:
#pragma omp parallel for
for (int i=0;i<n;i++) do_fn(i);
This is a deceptively powerful statement. Each iteration of the
loop can now potentially run on a separate core, in its own
thread. After the for loop
completes, all the threads go back to waiting, except the master
thread, which continues with the (otherwise serial) program. But
there are downsides:
- This statement creates threads, which takes 5-10us/thread. So it won't help short loops, those that take less than a few thousand nanoseconds.
- If any code called in the loop writes to shared data, you've just introduced a subtle timing-dependent bug into the program.
Unlike bare threads, with OpenMP:
- The compiler decides how many threads to create (typically the number of CPUs)
- The compiler moves the guts of your loop off into a separate function.
- The compiler creates and deallocates the threads.
- The compiler can actually help you find and fix memory race conditions.
Notice how much the compiler is involved? This means you need
compiler support, or that pragma won't do anything. Here's how you enable OpenMP
in various compilers. For market segmentation reasons, Microsoft
doesn't make it easy to use OpenMP in the Express versions of Visual
Studio, but here's how to hack in support for OpenMP in Visual C++ 2008 Express. gcc versions >=4.2 all support OpenMP, so most Linux or Mac desktops have it.
OpenMP Examples
Here's a non-OpenMP loop. Notice that &i, the address of local variable i, is constant (of course!).
int foo(void) {
for (int i=0;i<10;i++) {
std::cout<<"i="<<i<<" and &i="<<&i<<"\n";
}
return 0;
}
(Try this in NetRun now!)
Here's the same loop running in parallel with OpenMP. Notice that
(1) there are 4 separate threads on this quad-core machine, (2) each
thread gets its *own* copy of i (thanks to the compiler), and (3) every
time you run the program, the printouts arrive in a slightly different
order.
int foo(void) {
#pragma omp parallel for
for (int i=0;i<10;i++) {
std::cout<<"i="<<i<<" and &i="<<&i<<"\n";
}
return 0;
}
(Try this in NetRun now!)
Here's an example where OpenMP really shines. Each thread handles
its own part of the array, working alone and getting things done.
The net result is about 3x faster than a single core.
enum {n=100};
float array[n];
int foo(void) {
#pragma omp parallel for
for (int i=0;i<n;i++) {
array[i]=exp(-array[i]);
}
return array[0];
}
(Try this in NetRun now!)
OpenMP Gotchas
Here's a trivial little program to sum a million integers:
volatile int sum=0;
int foo(void) {
int i,j;
sum=0;
for (i=0;i<1000;i++)
for (j=0;j<1000;j++) {
sum++;
}
return sum;
}
(Try this in NetRun now!)
This takes 2.5 ns/iteration on my quad core machine.
Here's the obvious OpenMP version, which is 15x *slower* (32.3 ns/iteration), and gets the wrong answer besides!
volatile int sum=0;
int foo(void) {
int i,j;
sum=0;
#pragma omp parallel for
for (i=0;i<1000;i++)
for (j=0;j<1000;j++) {
sum++;
}
return sum;
}
(Try this in NetRun now!)
There are two problems here:
- "j" is shared between the threads.
- "sum" is shared between the threads.
In multithreaded programming, sharing is bad, both for performance and for correctness. We
can fix "j" by asking the compiler to make a separate private copy for
each thread. This speeds us up to under 1ns/iteration, about a 2x
speedup over the sequential code. We're still getting the wrong
answer, but now we're getting it *fast*!
volatile int sum=0;
int foo(void) {
int i,j;
sum=0;
#pragma omp parallel for private(j)
for (i=0;i<1000;i++)
for (j=0;j<1000;j++) {
sum++;
}
return sum;
}
(Try this in NetRun now!)
To get the right answer, we just ask OpenMP to automatically total up a private copy of "sum":
volatile int sum=0;
int foo(void) {
int i,j;
sum=0;
#pragma omp parallel for private(j) reduction(+:sum)
for (i=0;i<1000;i++)
for (j=0;j<1000;j++) {
sum++;
}
return sum;
}
(Try this in NetRun now!)
This is now 0.44ns/iteration, a solid 4x speedup over the original code, and it gets the right answer!