Different Roads to Multicore Histogramming
CS 441/641 Lecture, Dr. Lawlor
Serial Histogram
This is the base case, and takes about 0.7 nanoseconds per data item on my Sandy Bridge quad core machine.
const int ndata=1000*1000;
int data[ndata];
const int ncores=4;
const int nhist=1000;
volatile int hist[nhist];
int build_histo(void) {
for (int d=0;d<ndata;d++) {
hist[data[d]]++;
}
return 0;
}
(Try this in NetRun now!)
Naive Parallelization Fail
Just slapping in an OpenMP pragma has two effects: the program gets way
slower, 4.6ns/item, and it also gets the wrong answer due to the
threads overwriting each others' work.
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
hist[data[d]]++;
}
return 0;
}
(Try this in NetRun now!)
Adding Critical Section (Lock)
You can at least get the right answer by adding a critical section to
the increment. The only problem is we've actually destroyed all
the parallelism, and the cores now have to fight for the critical
section lock, so this takes 96 nanoseconds per element--over a 100x
slowdown!
OpenMP is amazingly lacking in a finer-grained lock primitive.
Most other thread libraries support a "lock" or "mutex" (mutual
exclusion) data structure, so you could make an array of them to reduce
contention on the one big lock. This 'lock splitting' technique
can help reduce lock contention, but doesn't change the lock overhead,
which is typically dozens of nanoseconds.
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
#pragma omp critical /* only one thread at a time here */
hist[data[d]]++;
}
return 0;
}
(Try this in NetRun now!)
Adding Atomic Access
The hardware actually supports 'atomic' multithread-safe versions of a
few instructions, like integer addition. This does some magic at
the cache line level to guarantee exclusive access. It's much
finer grained than a critical section, which excludes all processors,
so it's quite a bit faster, down to 6.1ns per element.
Confusingly, 'atomic' operations are implemented using the x86 'lock'
prefix, but today a 'lock' has come to mean a much slower
library-supported critical section. Atomics have been getting
much faster on recent hardware, and GPUs recently added a 'zero penalty
atomic' that somehow runs at the same speed as normal arithmetic.
It still uses more bus traffic, or else I think they'd just make all
operations atomic, and eliminate a huge class of multithreaded problems!
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
#pragma omp atomic /* perform this operation 'atomically' */
hist[data[d]]++;
}
return 0;
}
(Try this in NetRun now!)
Privatize Data
Sharing is bad. Shared data needs to be accessed in some funky
way, or you run the risk of another thread overwriting your work.
Separate data is good. Separate data is always fast (no cache
thrashing), always correct (no multicore timing issues), and just
generally works like the good old days when there was only one thread.
So we can give each core a separate area in memory to build its own
histogram. Done poorly, indexed like hist[nhist][ncores], the
different cores fight for cache lines, and you get poor performance due
to false sharing (cache coherence thrashing). Done properly, where each
core's data is contiguous, this is very fast, about 0.3ns/item.
I'm not counting the time to merge the histograms afterward, but if the
data is big enough, this time is small.
#include <omp.h>
const int ndata=1000*1000;
int data[ndata];
const int ncores=4;
const int nhist=1000;
volatile int hist[ncores][nhist]; // per-core histograms
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
int c=omp_get_thread_num();
hist[c][data[d]]++;
}
return 0;
}
(Try this in NetRun now!)
Generally speaking, the highest performance solution will always be to
make separate copies. However, the problem is then merging the
copies. For example, in a database program, keeping track of 4-32
separate copies of the database would be a coherence
nightmare--essentially the underlying multicore cache coherence problem
pushed up one more level!