Different Roads to Multicore Histogramming

CS 441/641 Lecture, Dr. Lawlor

Serial Histogram

This is the base case, and takes about 0.7 nanoseconds per data item on my Sandy Bridge quad core machine.
const int ndata=1000*1000;
int data[ndata];

const int ncores=4;
const int nhist=1000;
volatile int hist[nhist];

int build_histo(void) {
for (int d=0;d<ndata;d++) {
hist[data[d]]++;
}
return 0;
}

(Try this in NetRun now!)

Naive Parallelization Fail

Just slapping in an OpenMP pragma has two effects: the program gets way slower, 4.6ns/item, and it also gets the wrong answer due to the threads overwriting each others' work.
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
hist[data[d]]++;
}
return 0;
}

(Try this in NetRun now!)

Adding Critical Section (Lock)

You can at least get the right answer by adding a critical section to the increment.  The only problem is we've actually destroyed all the parallelism, and the cores now have to fight for the critical section lock, so this takes 96 nanoseconds per element--over a 100x slowdown!

OpenMP is amazingly lacking in a finer-grained lock primitive.  Most other thread libraries support a "lock" or "mutex" (mutual exclusion) data structure, so you could make an array of them to reduce contention on the one big lock.  This 'lock splitting' technique can help reduce lock contention, but doesn't change the lock overhead, which is typically dozens of nanoseconds.
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
#pragma omp critical /* only one thread at a time here */
hist[data[d]]++;
}
return 0;
}

(Try this in NetRun now!)

Adding Atomic Access

The hardware actually supports 'atomic' multithread-safe versions of a few instructions, like integer addition.  This does some magic at the cache line level to guarantee exclusive access.  It's much finer grained than a critical section, which excludes all processors, so it's quite a bit faster, down to 6.1ns per element.

Confusingly, 'atomic' operations are implemented using the x86 'lock' prefix, but today a 'lock' has come to mean a much slower library-supported critical section.  Atomics have been getting much faster on recent hardware, and GPUs recently added a 'zero penalty atomic' that somehow runs at the same speed as normal arithmetic.  It still uses more bus traffic, or else I think they'd just make all operations atomic, and eliminate a huge class of multithreaded problems!
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
#pragma omp atomic /* perform this operation 'atomically' */
hist[data[d]]++;
}
return 0;
}

(Try this in NetRun now!)


Privatize Data

Sharing is bad.  Shared data needs to be accessed in some funky way, or you run the risk of another thread overwriting your work. 

Separate data is good.  Separate data is always fast (no cache thrashing), always correct (no multicore timing issues), and just generally works like the good old days when there was only one thread.

So we can give each core a separate area in memory to build its own histogram.  Done poorly, indexed like hist[nhist][ncores], the different cores fight for cache lines, and you get poor performance due to false sharing (cache coherence thrashing). Done properly, where each core's data is contiguous, this is very fast, about 0.3ns/item.  I'm not counting the time to merge the histograms afterward, but if the data is big enough, this time is small.
#include <omp.h>
const int ndata=1000*1000;
int data[ndata];

const int ncores=4;
const int nhist=1000;
volatile int hist[ncores][nhist]; // per-core histograms

int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
int c=omp_get_thread_num();
hist[c][data[d]]++;
}
return 0;
}

(Try this in NetRun now!)

Generally speaking, the highest performance solution will always be to make separate copies.  However, the problem is then merging the copies.  For example, in a database program, keeping track of 4-32 separate copies of the database would be a coherence nightmare--essentially the underlying multicore cache coherence problem pushed up one more level!