The whole point of having multiple cores is to improve performance. But for some problems it's surprisingly hard to get any performance benefit from multicore. One big reason is this:
If the program is CPU bound, adding CPUs can help. If not, it won't.
Many other things can limit the performance of real programs:
Generally, if you want to avoid all these things and get good performance, you need a big array (or vector) of data sitting in memory in one contiguous chunk, that you can walk right through and compute. Files and network connections tend to be limited by the OS or hardware, and fancy trees or linked data structures tend to be memory-limited.
The other big factor limiting performance is the difference between bandwidth (wholesale) and latency (retail).
System | Bandwidth | Latency |
CPU | Billions of instructions / sec | a few nanoseconds / branch |
Multicore | CPU bandwidth times core count (assuming perfect speedup) | 1000 ns or more to create threads |
GPU | Trillions of instructions / sec | 4000 ns or more to start kernel |
Memory | 10-20 GB/sec (streaming access) | 20ns (random access) |
Disk |
Spinning disk: 50-100MB/s SSD: 50-500MB/s |
Spinning disk: 5ms seek time SSD: 0.1ms seek time (WAY better!) |
Network |
Wifi: 54 Mbit/sec Gig E: 1000 Mbit/sec |
Wifi: 2 ms Gig E: 0.2 ms |
For a program with mostly reads (such as a search for a winning checkers move, or a faster solution to a scheduling problem), this is enough. For a program with both reads and writes to memory, it can be hard to beat a single threaded solution, but it is possible!
This is the serial base case, and takes about 0.7 nanoseconds per data item on my Sandy Bridge quad core machine.
const int ndata=1000*1000;
int data[ndata];
const int ncores=4;
const int nhist=1000;
volatile int hist[nhist];
int build_histo(void) {
for (int d=0;d<ndata;d++) {
hist[data[d]]++;
}
return 0;
}
Just slapping in an OpenMP pragma has two effects: the program gets way slower, 4.6ns/item, and it also gets the wrong answer due to the threads overwriting each others' work.
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
hist[data[d]]++;
}
return 0;
}
You can at least get the right answer by adding a critical section to the increment. The only problem is we've actually destroyed all the parallelism, and the cores now have to fight for the critical section lock, so this takes 96 nanoseconds per element--over a 100x slowdown!
OpenMP is amazingly lacking in a finer-grained lock primitive. Most other thread libraries support a "lock" or "mutex" (mutual exclusion) data structure, so you could make an array of them to reduce contention on the one big lock. This 'lock splitting' technique can help reduce lock contention, but doesn't change the lock overhead, which is typically dozens of nanoseconds.
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
#pragma omp critical /* only one thread at a time here */
hist[data[d]]++;
}
return 0;
}
The hardware actually supports 'atomic' multithread-safe versions of a few instructions, like integer addition. This does some magic at the cache line level to guarantee exclusive access. It's much finer grained than a critical section, which excludes all processors, so it's quite a bit faster, down to 6.1ns per element.
Confusingly, 'atomic' operations are implemented using the x86 'lock' prefix, but today a 'lock' has come to mean a much slower library-supported critical section. Atomics have been getting much faster on recent hardware, and GPUs recently added a 'zero penalty atomic' that somehow runs at the same speed as normal arithmetic. It still uses more bus traffic, or else I think they'd just make all operations atomic, and eliminate a huge class of multithreaded problems!
int build_histo(void) {
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
#pragma omp atomic /* perform this operation 'atomically' */
hist[data[d]]++;
}
return 0;
}
Sharing is bad. Shared data needs to be accessed in some funky way, or you run the risk of another thread overwriting your work.
Separate data is good. Separate data is always fast (no cache thrashing), always correct (no multicore timing issues), and just generally works like the good old days when there was only one thread.
So we can give each core a separate area in memory to build its own histogram. Done poorly, indexed like hist[nhist][ncores], the different cores fight for cache lines, and you get poor performance due to false sharing (cache coherence thrashing). Done properly, where each core's data is contiguous, this is very fast, about 0.3ns/item, even including the time to merge them afterwards.
#include <omp.h>
const int ndata=1000*1000;
int data[ndata];
const int ncores=4; // FIXME: use omp_get_max_threads()
const int nhist=1000;
volatile int hist[ncores][nhist]; // per-core histograms
int build_histo(void) {
// Sum into local arrays
#pragma omp parallel for
for (int d=0;d<ndata;d++) {
int c=omp_get_thread_num();
hist[c][data[d]]++;
}
// Sum across cores, to core 0
for (int i=0;i<nhist;i++)
for (int c=1;c<ncores;c++)
hist[0][i]+=hist[c][i];
return 0;
}
I hardcoded "ncores" to 4 above, for clarity. In a real program, you need to dynamically allocate the "hist" array based on how many cores actually exist, like this:
int ncores=omp_get_max_threads();
const int nhist=1000;
typedef int histarray[nhist];
volatile histarray *hist=new histarray[omp_get_max_threads()];
Also, to prevent cache thrashing at the boundaries between threads, you should pad histarray by at least 16 integers, like "typedef int histarray[nhist+32];", which yields about a 10% performance boost.
Generally speaking, the highest performance solution will always be to make separate copies. However, the problem is then merging the copies. For example, in a database program, keeping track of 4-32 separate copies of the database would be a coherence nightmare--essentially the underlying multicore cache coherence problem pushed up one more level!
CS 441 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.