Multicore Programming Models: Threads vs OpenMP

CS 641 Lecture, Dr. Lawlor

"There is always a well-known solution to every human problem — neat, plausible, and wrong."
- H.L. Mecken, 1917 (not actually discussing threads)

The Problems With Threads

(1) Threads are a pain to create. Just like creating a new process, creating a thread is an OS call.

	Windows	UNIX: Linux, Mac OS X
User-Level Threads	fibers & others	makecontext/swapcontext (example) & others
Kernel Threads	CreateThread (example)	pthread_create (example) Seems to cost about 10us nowadays.
Processes	CreateProcess (example)	fork (example) Seems to cost about 300us nowadays.
Virtual Machines	varies (VirtualBox, QEMU, VMWare, etc.)

(See all the example code as a Directory, Zip, or Tar-gzip)

Here's an example of creating threads on UNIX:

#include <pthread.h>

volatile int delay=0;
void doWork(const char *caller) {
	std::cout<<caller<<"\n";
	for (int i=0;i<1000*1000*1;i++) delay++;
}
void *fnA(void *arg) {
	for (int i=0;i<10;i++) doWork("A");
	return 0;
}
void *fnB(void *arg) {
	for (int i=0;i<10;i++) doWork("B");
	return 0;
}
int foo(void) {
	pthread_t B; 
	pthread_create(&B,0,fnB,0); 
	fnA(0); 
	void *Bret;
	pthread_join(B,&Bret); 
	return delay;
}

(Try this in NetRun now!)

Annoying things about this include:

Moving the per-thread work into a separate function is a pain.
You need to remember to both create *and* join threads.
The A and B printouts come in some unpredictable order, depending on the speed of the machine, OS, etc.
"delay" ends up with a weird random answer (read the full story here).

This stinks.

A New Hope: OpenMP

Because threaded programming is so ugly and tricky, there's a newish (mainstream in 2007) language extension out there called OpenMP, designed to make it substantially easier to write multithreaded code.

The basic idea is you take what looks like an ordinary sequential loop, like:

    for (int i=0;i<n;i++) do_fn(i);

And you add a little note to the compiler saying it's a parallel forloop, so if you've got six CPUs, the iterations should be spread across the CPUs. The particular syntax they chose is a "#pragma" statement, with the "omp" prefix:

#pragma omp parallel for
    for (int i=0;i<n;i++) do_fn(i);

Granted, this line has like a 10us/thread overhead, so it won't help tiny loops, but it can really help long loops. After the for loop completes, all the threads go back to waiting, except the master thread, which continues with the (otherwise serial) program. Note that this is still shared-memory threaded programming, so global variables are still (dangerously) shared by default!

Unlike bare threads, with OpenMP:

The compiler decides how many threads to create (typically the number of CPUs)
The compiler moves the guts of your loop off into a separate function.
The compiler creates and deallocates the threads.
The compiler can help you find and fix memory race conditions.

Here's how you enable OpenMPin various compilers. Visual C++ 2005 & later (but NOT express!), Intel C++ version 9.0, and gcc version 4.2 all support OpenMP, although earlier versions do not! Evidently, OpenMP can work in modern Visual C++ Express if you just download the Windows SDK.

Here's the idiomatic OpenMP program: slap "#pragma omp parallel for" in front of your main loop. You're done!

Here's a more complex "who am I"-style OpenMP program from Lawrence Livermore National Labs. Note the compiler "#pragma" statements!

On the powerwall, you can compile and run OpenMP code as follows:
g++-4.2 get_info.c -o get_info -fopenmp
./get_info

Chris Granade has a nice page on OpenMP on the Cell Broadband Engine

OpenMP Gotchas

Here's a trivial little program to total up a million integers:

volatile int sum=0;
int foo(void) {
	int i,j;
	sum=0;
	for (i=0;i<1000;i++) 
	for (j=0;j<1000;j++) {
		sum++;
	}
	return sum;
}

(Try this in NetRun now!)

This takes 2.5 ns/iteration on my quad core machine.

Here's the obvious OpenMP version, which is 15x *slower* (32.3 ns/iteration), and gets the wrong answer besides!

volatile int sum=0;
int foo(void) {
	int i,j;
	sum=0;
#pragma omp parallel for
	for (i=0;i<1000;i++) 
	for (j=0;j<1000;j++) {
		sum++;
	}
	return sum;
}

(Try this in NetRun now!)

There are two problems here:

"j" is shared between the threads.
"sum" is shared between the threads.

It is still multithreaded programming, so these things matter. We can fix "j" by asking the compiler to make a separate private copy for each thread. This speeds us up to under 1ns/iteration, about a 2x speedup over the sequential code. We're still getting the wrong answer, but now we're getting it *fast*!

volatile int sum=0;
int foo(void) {
	int i,j;
	sum=0;
#pragma omp parallel for private(j)
	for (i=0;i<1000;i++) 
	for (j=0;j<1000;j++) {
		sum++;
	}
	return sum;
}

(Try this in NetRun now!)

To get the right answer, we just ask OpenMP to automatically total up a private copy of "sum":

volatile int sum=0;
int foo(void) {
	int i,j;
	sum=0;
#pragma omp parallel for private(j) reduction(+:sum)
	for (i=0;i<1000;i++) 
	for (j=0;j<1000;j++) {
		sum++;
	}
	return sum;
}

(Try this in NetRun now!)

This is now 0.44ns/iteration, a solid 4x speedup over the original code, and it gets the right answer!

Even in OpenMP, you can still get multithreaded race conditions, especially when calling library code. The compiler can help you fix these, but often you need to explicitly ask it for help.

False Sharing

It's tricky to build your own version of OpenMP's "private".

#include <omp.h>

const int space=16;
volatile int delay[4*space]={0};
int foo(void) {
	int i;
	for (int core=0;core<4;core++) delay[core*space]=0;
#pragma omp parallel for //reduction(+:delay) 
	for (i=0;i<1000;i++) 
	{
		int d=(i/250)*space;
		for (int j=0;j<1000;j++) 
		{
			delay[d]++;
		}
	}
	return delay[0*space];
}

(Try this in NetRun now!)

OpenMP Features

Here's the official OpenMP spec. The canonical OpenMP loop is actually pretty restricted--basically, the number of iterations must be known beforehand. This means you can't do "for (int i=0;str[i]!=0;i++)", can't do "break" in the middle of the loop, etc.

You can also add a variety of interesting options to the "parallel for" line:

"if(n>100)" only creates parallel threads if the number of iterations is big enough; otherwise it's a normal sequential loop. You can use any condition there.
"private(n)" would give each CPU a separate read-write copy of the variable "n" (or any other variable). By default, the new copy is uninitialized. WARNING: there's only one shared copy of all variables declared outside the parallel loop, unless you explicitly declare them private!

"default(none)" is for error-checking your loops: it verifies that every variable used in the loop is either local, or explicitly private or shared.

"shared(n)" declares that we only want a single shared copy of n. This is the default for variables declared outside the loop.

"copyin(n)" initializes n from the master's version.
"firstprivate(n)" is like private, but initializes the new copy from the master. It's basically private and copyin together.

"lastprivate(n)" is also like private, but initializes each copy from the last-used version.

"reduction(+:k)" would total up each CPU's (private) copy of the variable "k" (or any other variable) using the "+" operation (or any other operation). The compiler generates atomic ("lock" prefix) instructions to do this.
"schedule(static,10)" tells each thread to do blocks of ten iterations of the loop, with blocks distributed beforehand in round-robin fashion. The default scheduling is basically "schedule(static,n/num_threads)".
"schedule(dynamic,10)" tells each thread to do ten iterations of the loop at a time, but each thread takes the next available set of iterations.
"schedule(guided,10)" takes at least ten iterations at a time, dynamically, but usually more.
"schedule(runtime)" says to guess the scheduling at runtime.

In addition to "pragma omp parallel for", there are other pragma lines:

Just "pragma omp parallel" around a random block creates separate threads; call omp_get_thread_num() to figure out who you are.
"barrier" tells each thread to not to continue until all threads have reached this point.
"critical" only allows one thread at a time to execute the following block (wraps a lock around the block).
"atomic" makes a simple ++ or += expression atomic (using atomic instructions).
"master" makes a block of code only run on the master.
"single" makes a block of code only run on a single (random) thread.
"ordered" makes a part of the loop execute in sequential order. This might be more cache efficient than just running a second sequential loop from the master thread.
"parallel sections" lets you statically create several parallel pieces of work.

By default you get the same number of threads as you have cores. You can override the number of threads created with "num_threads(4)" right inside the #pragma, calling omp_set_num_threads(4); in your program (it's in omp.h), or setting the OMP_NUM_THREADS environment variable. Often, I/O bound programs get better performance with a few times more threads than cores.

By default, if you're running in a parallel loop, and you hit another nested parallel loop, OpenMP won't make new threads to run the inner loop (it's already parallel!). Of course, some implementations are known to just crash instead.

OpenMP Problems

Due to its fork-join style, OpenMP could be called "AmdahlMP", after Amdahl's Law, which quantifies the bad effect of serial code in a parallel program. Basically, you've got to make 90% of the runtime of the application parallel in order to even hope to get a 10x speedup (99% for 100x, etc). In fact, Amdahl's law says you can only leave some tiny fraction epsilon of your runtime in serial code to get big speedups. Of course, in practice, users are perfectly happy getting a 2x or 3x speedup, even on an 8-core machine, so OpenMP is still a useful transition mechanism for speeding up mostly-serial code.

OpenMP Sorting

Here's a sort implementation in OpenMP. Basically, I start with a "#pragma omp parallel" region that sorts per thread sub-pieces of the array, then a sequential loop merges these sorted values.

#include <omp.h>
#include <cassert>
#include <algorithm> /* for std::sort */

enum{n=8000};
float arr[n+1]; /* silly: farray_fill uses n+1 elements */
float darr[n+1];
int foo(void) 
{
  farray_fill(arr,n,0.1); /* fill with random values */

/* Parallel stage 1: sort sub-pieces */
  int nth=1; /* runtime thread count */
  enum{MAXTHREADS=1024}; /* STUPID maximum thread count */
  float *subpieces[MAXTHREADS]; /* subarray for each thread */
  int subn[MAXTHREADS]; /* number of elements in each subarray */
#pragma omp parallel  /*  each thread sorts a subrange of the array */
  {
     int tid=omp_get_thread_num();
     nth=omp_get_num_threads();
     assert(nth<=MAXTHREADS);
     float *marr=&arr[tid*n/nth];
     subpieces[tid]=marr;
     int mn=n/nth; /* FIXME: what if nth doesn't divide n evenly?! */
     subn[tid]=mn;
     printf("Thread %d: %d to %d more\n",tid,tid*n/nth,mn);
     std::sort(marr,marr+mn);
  }

/* Stage 2: merge sub-pieces */
  int di=0;
  int subi[MAXTHREADS]; /* index into this thread's subarray */
  for (int i=0;i<nth;i++) subi[i]=0;
  for (di=0;di<n;di++) {
     int ts=-1; /* thread with smallest unconsumed number so far */
     float smallest=1.0e30;
     for (int t=0;t<nth;t++) {
       if (subi[t]<subn[t]) /* this thread still has values to offer */
         if (subpieces[t][subi[t]]<smallest) 
         { /* this thread has the new smallest value */
             ts=t;
             smallest=subpieces[t][subi[t]];
         }
     }
     subi[ts]++; /* consume that value */
     darr[di]=smallest; /* copy to output array */
  }

  farray_print(arr,n);
  farray_print(darr,n);
  for (int i=0;i<n-1;i++) if (darr[i]>darr[i+1]) {printf("Not sorted at place %d!\n",i); return i;}
  printf("Array sorted properly\n");
  return 0;
}

(Try this in NetRun now!)

On a quad-core machine, this sorts a 1m-entry float array in about 48ms; the corresponding sequential code takes 60ms.

Speedup could be improved by merging the subarrays in parallel, for example by doing pairwise merging. Some Intel guys wrote a recent paper on multicore mergesort, and they can sort 64M floats in "less than 0.5 seconds"!