OpenMP: Simple Multicore Programming
CS 301 Lecture, Dr. Lawlor
Basic OpenMP
Because threaded programming is so ugly and tricky, there's a newish (mainstream in 2007) language extension out there called OpenMP, designed to make it substantially easier to write multithreaded code.
The basic idea is you start with what looks like an ordinary sequential loop, like:
for (int i=0;i<n;i++) do_fn(i);
And you add a little note to the compiler saying it's a parallel
forloop, so if you've got six CPUs, the iterations should be spread
across the CPUs. The particular syntax they chose is a "#pragma"
statement, with the "omp" prefix:
#pragma omp parallel for
for (int i=0;i<n;i++) do_fn(i);
Granted, this line has like a 10us/thread overhead, so it won't help
tiny loops, but it can really help long loops. After the for loop
completes, all the threads go back to waiting, except the master
thread, which continues with the (otherwise serial) program.
Here's how you enable OpenMP
in various compilers. Visual C++ 2005 & later (but NOT express!), Intel C++ version 9.0, and
gcc version 4.2 all support OpenMP, although earlier versions do not!
Here's the idiomatic OpenMP program: slap "#pragma omp parallel for" in front of your main loop. You're done!
Here's a more complex "who am I"-style OpenMP program from Lawrence Livermore National Labs. Note the compiler "#pragma" statements!
Multicore Shared Memory Pitfalls
Any iteration of an OpenMP threaded parallel loop still call other
functions, access global variables, or use anything it can find that
belongs
to other threads.
This shared access to common variables
immediately introduces the many problems of "thread
safety". For example, consider a piece of code like this:
int shared=0;
void inc(void) {
int i=shared;
i++;
shared=i;
}
If two threads try to call "inc" repeatedly, the two executions might interleave like this:
Thread A
|
Thread B
|
int i=shared; // i==0
i++; // i==1
// hit interrupt. switch to B
shared=i; // i is still 1, so shared==1!
|
int i=shared; // i==0 (again)
i++; //i==1
shared=i; // shared==1
int i=shared; // i==1
i++; //i==2
shared=i; // shared==2
int i=shared; // i==2
i++; //i==3
shared=i; // shared==3
// hit interrupt, switch back to A
|
Uh oh! When we switch back to thread A, the value stored in "i"
isn't right anymore. It's an older copy of a shared global variable,
stored in thread A's stack or registers. So thread A will happily
overwrite the shared variable with its old version, causing all of B's
work to be lost!
Unfortunately, "inc" might only get called every billion
iterations. So your two threads might only call "inc"
simultaniously with probability one in a billion (or less). This
means an *incorrect* parallel program, with unsafe accesses to shared
variables, might just run perfectly most of the time. But every
billionth run, it might crash, or much worse, silently slip in a subtly
wrong answer!
OpenMP Features
Here's the official OpenMP spec.
The canonical OpenMP loop is actually pretty restricted--basically, the
number of iterations must be known beforehand. This means you
can't do "for (int i=0;str[i]!=0;i++)", can't do "break" in the middle
of the loop, etc.
You can also add a variety of interesting options to the "parallel for" line:
- "if(n>100)" only creates parallel threads if the number of
iterations is big enough; otherwise it's a normal sequential
loop. You can use any condition there.
- "private(n)" would give each CPU a separate read-write copy of
the variable "n" (or any other variable). By default, the new
copy is uninitialized. WARNING: there's only one shared copy of
all variables declared outside the parallel loop, unless you explicitly
declare them private!
- "default(none)" is for error-checking your loops: it verifies
that every variable used in the loop is either local, or explicitly
private or shared.
- "shared(n)" declares that we only want a single shared copy
of n. This is the default for variables declared outside the loop.
- "copyin(n)" initializes n from the master's version.
- "firstprivate(n)" is like private, but initializes the new copy
from the master. It's basically private and copyin together.
- "lastprivate(n)" is also like private, but initializes each copy from the last-used version.
- "reduction(+:k)" would total up each CPU's (private) copy of the
variable "k" (or any other variable) using the "+" operation (or any
other operation). The compiler generates atomic ("lock" prefix) instructions to do this.
- "schedule(static,10)" tells each thread to do blocks of ten
iterations of the loop, with blocks distributed beforehand in
round-robin fashion. The default scheduling is basically
"schedule(static,n/num_threads)".
- "schedule(dynamic,10)" tells each thread to do ten iterations
of the loop at a time, but each thread takes the next available set of
iterations.
- "schedule(guided,10)" takes at least ten iterations at a time, dynamically, but usually more.
- "schedule(runtime)" says to guess the scheduling at runtime.
In addition to "pragma omp parallel for", there are other pragma lines:
- Just "pragma omp parallel" around a random block creates separate threads; call omp_get_thread_num() to figure out who you are.
- "barrier" tells each thread to not to continue until all threads have reached this point.
- "critical" only allows one thread at a time to execute the following block (wraps a lock around the block).
- "atomic" makes a simple ++ or += expression atomic (using atomic instructions).
- "master" makes a block of code only run on the master.
- "single" makes a block of code only run on a single (random) thread.
- "ordered" makes a part of the loop execute in sequential
order. This might be more cache efficient than just running a
second sequential loop from the master thread.
- "parallel sections" lets you statically create several parallel pieces of work.
By default you get the same number of threads as you have cores.
You can override the number of threads created with "num_threads(4)"
right inside the #pragma, calling omp_set_num_threads(4); in your
program (it's in omp.h), or setting the OMP_NUM_THREADS environment
variable. Often, I/O bound programs get better performance with a
few times more threads than cores.
By default, if you're running in a parallel loop, and you hit another
nested parallel loop, OpenMP won't make new threads to run the inner
loop (it's already parallel!). Of course, some implementations
are known to just crash instead.
OpenMP Problems
Due to its fork-join style, OpenMP could be called "AmdahlMP", after Amdahl's Law,
which quantifies the bad effect of serial code in a parallel
program. Basically, you've got to make 90% of the runtime of the
application parallel in order to even hope to get a 10x speedup (99%
for 100x, etc). In fact, Amdahl's law says you can only leave
some tiny fraction epsilon of your runtime in serial code to get big
speedups. Of course, in practice, users are perfectly happy getting a 2x or 3x
speedup, even on an 8-core machine, so OpenMP is still a useful
transition mechanism for speeding up mostly-serial code.
OpenMP Sorting
Here's a sort implementation in OpenMP. Basically, I start with a
"#pragma omp parallel" region that sorts per thread sub-pieces of the
array, then a sequential loop merges these sorted values.
#include <omp.h>
#include <cassert>
#include <algorithm> /* for std::sort */
enum{n=8000};
float arr[n+1]; /* silly: farray_fill uses n+1 elements */
float darr[n+1];
int foo(void)
{
farray_fill(arr,n,0.1); /* fill with random values */
/* Parallel stage 1: sort sub-pieces */
int nth=1; /* runtime thread count */
enum{MAXTHREADS=1024}; /* STUPID maximum thread count */
float *subpieces[MAXTHREADS]; /* subarray for each thread */
int subn[MAXTHREADS]; /* number of elements in each subarray */
#pragma omp parallel /* each thread sorts a subrange of the array */
{
int tid=omp_get_thread_num();
nth=omp_get_num_threads();
assert(nth<=MAXTHREADS);
float *marr=&arr[tid*n/nth];
subpieces[tid]=marr;
int mn=n/nth; /* FIXME: what if nth doesn't divide n evenly?! */
subn[tid]=mn;
printf("Thread %d: %d to %d more\n",tid,tid*n/nth,mn);
std::sort(marr,marr+mn);
}
/* Stage 2: merge sub-pieces */
int di=0;
int subi[MAXTHREADS]; /* index into this thread's subarray */
for (int i=0;i<nth;i++) subi[i]=0;
for (di=0;di<n;di++) {
int ts=-1; /* thread with smallest unconsumed number so far */
float smallest=1.0e30;
for (int t=0;t<nth;t++) {
if (subi[t]<subn[t]) /* this thread still has values to offer */
if (subpieces[t][subi[t]]<smallest)
{ /* this thread has the new smallest value */
ts=t;
smallest=subpieces[t][subi[t]];
}
}
subi[ts]++; /* consume that value */
darr[di]=smallest; /* copy to output array */
}
farray_print(arr,n);
farray_print(darr,n);
for (int i=0;i<n-1;i++) if (darr[i]>darr[i+1]) {printf("Not sorted at place %d!\n",i); return i;}
printf("Array sorted properly\n");
return 0;
}
(Try this in NetRun now!)
On a quad-core machine, this sorts a 1m-entry float array in about 48ms; the corresponding sequential code takes 60ms.
Speedup could be improved by merging the subarrays in parallel, for
example by doing pairwise merging. Some Intel guys wrote a recent
paper on multicore mergesort, and they can sort 64M floats in "less than 0.5 seconds"!