High Performance Programming for Multicore

The whole point of having multiple cores is to improve performance. But for some problems it's surprisingly hard to get any performance benefit from multicore. One big reason is this:

If the program is CPU bound, adding CPUs can help. If not, it won't.

Many other things can limit the performance of real programs:

Network: if the program is waiting for data from the network, having more threads waiting for data won't speed up the program. This is confused by the fact that network performance tends to be measured in mega or giga *bits*, not bytes, so you're usually getting 1/8 or 1/10 as much as promised: gigabit ethernet tops out at only 120MB/sec. The solution is to buy a faster network, use a less latency-sensitive protocol such as raw sockets instead of HTTP, use a more space-efficient protocol such as binary data instead of XML, add data compression, or just deal with it.
Disk: again, if you're only getting 100MB/sec of data from a disk, it doesn't matter how much more than 100MB/sec your program can process, you'll still be waiting for the disk. The solution is to buy a faster disk (like a good SSD), interleave disks for higher bandwidth (RAID1 or 5), compress your data on disk, avoid disk entirely, or live with your current performance.
Memory: there's only one DRAM bus on modern multicore systems, so if you're not getting good cache hit rates, you might only get a few gigabytes per second of data from DRAM. If you're accessing small byte values at random locations in DRAM, the worst case, you might get less than 50MB/sec of data (20ns per access). If your threads are thrashing cache lines back and forth in memory due to false sharing, you're still memory limited.

Generally, if you want to avoid all these things and get good performance, you need a big array (or vector) of data sitting in memory in one contiguous chunk, that you can walk right through and compute. Files and network connections tend to be limited by the OS or hardware, and fancy trees or linked data structures tend to be memory-limited.

The other big factor limiting performance is the difference between bandwidth (wholesale) and latency (retail).

System	Bandwidth	Latency
CPU	Billions of instructions / sec	a few nanoseconds / branch
Multicore	CPU bandwidth times core count (assuming perfect speedup)	1000 ns or more to create threads
GPU	Trillions of instructions / sec	4000 ns or more to start kernel
Memory	10-20 GB/sec (streaming access)	20ns (random access)
Disk	Spinning disk: 50-100MB/s SSD: 50-500MB/s	Spinning disk: 5ms seek time SSD: 0.1ms seek time (WAY better!)
Network	Wifi: 54 Mbit/sec Gig E: 1000 Mbit/sec	Wifi: 2 ms Gig E: 0.2 ms

For a program with mostly reads (such as a search for a winning checkers move, or a faster solution to a scheduling problem), this is enough. For a program with both reads and writes to memory, it can be hard to beat a single threaded solution, but it is possible!

Serial Histogram

This is the serial base case, and takes about 0.7 nanoseconds per data item on my Sandy Bridge quad core machine.

const int ndata=1000*1000;
int data[ndata];

const int ncores=4;
const int nhist=1000;
volatile int hist[nhist]; 

int build_histo(void) {
	for (int d=0;d<ndata;d++) {
		hist[data[d]]++;
	}
	return 0;
}

(Try this in NetRun now!)

Naive Parallelization Fail

Just slapping in an OpenMP pragma has two effects: the program gets way slower, 4.6ns/item, and it also gets the wrong answer due to the threads overwriting each others' work.

int build_histo(void) {
	#pragma omp parallel for
	for (int d=0;d<ndata;d++) {
		hist[data[d]]++;
	}
	return 0;
}

(Try this in NetRun now!)

Adding Critical Section (Lock)

You can at least get the right answer by adding a critical section to the increment. The only problem is we've actually destroyed all the parallelism, and the cores now have to fight for the critical section lock, so this takes 96 nanoseconds per element--over a 100x slowdown!

OpenMP is amazingly lacking in a finer-grained lock primitive. Most other thread libraries support a "lock" or "mutex" (mutual exclusion) data structure, so you could make an array of them to reduce contention on the one big lock. This 'lock splitting' technique can help reduce lock contention, but doesn't change the lock overhead, which is typically dozens of nanoseconds.

int build_histo(void) {
	#pragma omp parallel for
	for (int d=0;d<ndata;d++) {
		#pragma omp critical /* only one thread at a time here */
		hist[data[d]]++;
	}
	return 0;
}

(Try this in NetRun now!)

Adding Atomic Access

The hardware actually supports 'atomic' multithread-safe versions of a few instructions, like integer addition. This does some magic at the cache line level to guarantee exclusive access. It's much finer grained than a critical section, which excludes all processors, so it's quite a bit faster, down to 6.1ns per element.

Confusingly, 'atomic' operations are implemented using the x86 'lock' prefix, but today a 'lock' has come to mean a much slower library-supported critical section. Atomics have been getting much faster on recent hardware, and GPUs recently added a 'zero penalty atomic' that somehow runs at the same speed as normal arithmetic. It still uses more bus traffic, or else I think they'd just make all operations atomic, and eliminate a huge class of multithreaded problems!

int build_histo(void) {
	#pragma omp parallel for
	for (int d=0;d<ndata;d++) {
		#pragma omp atomic /* perform this operation 'atomically' */
		hist[data[d]]++;
	}
	return 0;
}

(Try this in NetRun now!)

Privatize Data

Sharing is bad. Shared data needs to be accessed in some funky way, or you run the risk of another thread overwriting your work.

Separate data is good. Separate data is always fast (no cache thrashing), always correct (no multicore timing issues), and just generally works like the good old days when there was only one thread.

So we can give each core a separate area in memory to build its own histogram. Done poorly, indexed like hist[nhist][ncores], the different cores fight for cache lines, and you get poor performance due to false sharing (cache coherence thrashing). Done properly, where each core's data is contiguous, this is very fast, about 0.3ns/item, even including the time to merge them afterwards.

#include <omp.h>
const int ndata=1000*1000;
int data[ndata];

const int ncores=4; // FIXME: use omp_get_max_threads()
const int nhist=1000;
volatile int hist[ncores][nhist]; // per-core histograms

int build_histo(void) {
// Sum into local arrays
	#pragma omp parallel for
	for (int d=0;d<ndata;d++) {
		int c=omp_get_thread_num();
		hist[c][data[d]]++;
	}
// Sum across cores, to core 0
	for (int i=0;i<nhist;i++)
		for (int c=1;c<ncores;c++)
			hist[0][i]+=hist[c][i];
	return 0;
}

(Try this in NetRun now!)

I hardcoded "ncores" to 4 above, for clarity. In a real program, you need to dynamically allocate the "hist" array based on how many cores actually exist, like this:

int ncores=omp_get_max_threads();
const int nhist=1000;
typedef int histarray[nhist];
volatile histarray *hist=new histarray[omp_get_max_threads()];

Also, to prevent cache thrashing at the boundaries between threads, you should pad histarray by at least 16 integers, like "typedef int histarray[nhist+32];", which yields about a 10% performance boost.

Generally speaking, the highest performance solution will always be to make separate copies. However, the problem is then merging the copies. For example, in a database program, keeping track of 4-32 separate copies of the database would be a coherence nightmare--essentially the underlying multicore cache coherence problem pushed up one more level!

CS 441 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.