Multithreaded & Multicore

CS 441 Lecture, Dr. Lawlor

First, to get into the proper revolutionary mindset, read:

The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software
written by Herb Sutter, smart Microsoft guy on the C++ standards committe

Notable quotes:

"Andy giveth, and Bill taketh away." That is, Andrew Grove, Intel, keeps building faster and faster CPUs. Bill Gates, Microsoft, keeps building slower and slower (yet better and better!) software.
"Concurrency is the next major revolution in how we write software." That is, by 2015 purely sequential code will have the same retro feeling that COBOL has today.
"Probably the greatest cost of concurrency is that concurrency really is hard: The programming model, meaning the model in the programmer’s head that he needs to reason reliably about his program, is much harder than it is for sequential control flow." That is, human reasoning is sequential. Execution is parallel. Horrible things can crawl into our universe through this gap!
"For example, Intel is talking about someday producing 100-core chips; a single-threaded application can exploit at most 1/100 of such a chip’s potential throughput." 90% of commercial applications today are single-threaded. They will either adapt or die out.
"just because it takes one woman nine months to produce a baby doesn’t imply that nine women could produce one baby in one month." So if the problem is "make one baby", parallelism is useless. But if you change the problem to "make 1 million babies", suddenly you have million-way parallelism.
"A few rare classes of applications are naturally parallelizable, but most aren’t." In other words, occasionally it's obvious how to run in parallel, and actually running in parallel is easy. Usually, though, you have to rethink and reformulate the basic algorithm used to solve the problem. I'd add that "A few rare applications are impossible to parallelize, but given enough effort, most aren't."

Kinds of Parallel Hardware

One ALU, one set of registers: serial computer (single processor).
Multiple ALUs, one decoder & one set of registers: Single Instruction Multiple Data (SIMD), like SSE.
One ALU, shared by several sets of decoders & registers: Symmetric Multi-Threading (SMT) or HyperThreading. This is only a tiny amount of additional chip area, but increases ALU utilization way more than wider superscalar execution. ALU contention limits the parallelism to perhaps 4-way on modern CPUs, but can reach 64-fold on GPUs.
Several different CPUs, one shared RAM area: Symmetric Multi-Processing (SMP) or Multi-Core. 2 cores is entirely standard today, 4 cores are on the way, and commercial SMP nodes with 16 or 32 way access exist. SMP can also combine with SMT for absurd numbers of threads: e.g., a typical GPU is 128-way SMP, with a total of 8192-way SMT.
Several CPUs, each with local RAM, but with a nonlocal RAM interconnect: Non-Uniform Memory Architecture (NUMA). Unlike SMP, where all RAM is equal, in a NUMA, it's faster to go to your own RAM (60ns) compared to some other CPU's RAM (1000ns).
Several complete nodes, connected with network: Cluster or Massively Parallel Processor (MPP). This is the only architecture known to scale to hundreds of thousands of CPUs--nothing is shared except the network. If all the computers are running the same program, as is typical to solve tightly coupled problems, it's called Single Program Multiple Data (SPMD). The Internet is a huge and very very loosely coupled MPP, because all the computers on the internet are running different programs at different times for different reasons.

Simple Kernel Threads

You probably want to make the OS kernel to do the hard work of creating threads, which are then called "kernel threads". On UNIX systems, you create a new kernel thread by calling "pthread_create", which is in "#include <pthread.h>" and the "-lpthread" library. You just pass pthread_create the function to run in the thread, and the kernel will then run that function, switching back and forth between threads at basically random times (100 to 1000 times per second, usually).

#include <pthread.h>

void doWork(const char *caller) {
	std::cout<<caller<<"\n";
}
void *fnA(void *arg) {
	for (int i=0;i<3;i++) doWork("A");
	return 0;
}
void *fnB(void *arg) {
	for (int i=0;i<3;i++) doWork("B");
	return 0;
}
int foo(void) {
	pthread_t B; 
	pthread_create(&B,0,fnB,0); /* make a thread B, to run fnB */
	fnA(0); 
	void *Bret;
	pthread_join(B,&Bret);  /* wait until B finishes */
	return 0;
}

(executable NetRun link)

To illustrate the way the kernel randomly switches between these functions, run this thing several times--here's what I got on several different runs:

A
A
A
B
B
B

AB

AB

AB

B
B
B
A
A
A

AB
A
A

B
B

AB

B
B
A
A

That is, the kernel runs A for a while, then B for a while, then back to A. You can't tell when a switch is going to happen. Note that the kernel sometimes switches to B between when A prints the letter 'A' and when it prints the newline immediately after it in the string "A\n"!

The danger of threads

So basically a kernel thread is just a fancy way to run one of your functions simultaniously with all the other functions. A kernel thread can still call other functions, access your global variables, or use anything it can find that belongs to other threads.

This shared access to common variables immediately introduces the many problems of "thread safety". For example, consider a piece of code like this:

int shared=0;
void inc(void) {
	int i=shared;
	i++;
	shared=i;
}

If two threads try to call "inc" repeatedly, the two executions might interleave like this:

Thread A	Thread B
int i=shared; // i==0 i++; // i==1 // hit interrupt. switch to B shared=i; // i is still 1, so shared==1!	int i=shared; // i==0 (again) i++; //i==1 shared=i; // shared==1 int i=shared; // i==1 i++; //i==2 shared=i; // shared==2 int i=shared; // i==2 i++; //i==3 shared=i; // shared==3 // hit interrupt, switch back to A

Uh oh! When we switch back to thread A, the value stored in "i" isn't right anymore. It's an older copy of a shared global variable, stored in thread A's stack or registers. So thread A will happily overwrite the shared variable with its old version, causing all of B's work to be lost!

Here's an executable example of this problem. Both threads are trying to increment "sum". They do this 6,000 times, so "sum" should be 6000 at the end. But with optimization on, they both store a copy of "sum" in a register, so one guy overwrites the other guy's work when they write the modified value back to "sum", and you (usually!) end up with the totally-wrong value 3000:

#include <pthread.h>

int sum=0;  /*<- Careful!  This variable is shared between threads! */
void doWork(void) {
	for (int i=0;i<1000;i++)
		sum++;
}
void *fnA(void *arg) {
	for (int i=0;i<3;i++) doWork();
	return 0;
}
void *fnB(void *arg) {
	for (int i=0;i<3;i++) doWork();
	return 0;
}
int foo(void) {
	sum=0;
	pthread_t B; 
	pthread_create(&B,0,fnB,0); 
	fnA(0); 
	void *Bret;
	pthread_join(B,&Bret); 
	return sum;
}

(executable NetRun link)

An easy solution to this race condition is to make a separate copy of "sum" for each thread. The only trouble, then, is that you then need to combine these separate "sum" values across threads, which itself involves a second parallel summation! Privatization is still an excellent trick for reducing race conditions, and sometimes it can even eliminate the problem completely.

Thread Safety with "volatile"

Sometimes, the only problem is the compiler's over-optimization of access to global variables. You can scare the compiler away from a variable by putting the keyword "volatile" in front of it. This makes the compiler fear the variable, and do exactly what you ask with it:

volatile int sum=0;  /*<- Careful!  This variable is shared between threads! */
void doWork(void) {
	for (int i=0;i<1000;i++)
		sum++;
}

(executable NetRun link)

Adding "volatile" does indeed improve the thread safety of this program--in fact, on a uniprocessor machine, it now gets the right answer! This is because "sum++" becomes a single instruction ("inc dword [sum]"), and an instruction executes as one piece--if the kernel switches to another thread, it will be either before or after this instruction, and never "in the middle of an instruction".

However, on a multicore machine, this program STILL GIVES THE WRONG ANSWER.

The reason is curious. On a real multiprocessor, two processors might simultaneously execute the crucial "inc" instruction. Since the instruction involves reading "sum" from memory, incrementing it, and writing it back out, it can still happen that one processor overwrites the result of another processor.

Thread Safety with "lock" Instruction Prefix

One way to *really* fix this code on a multiprocessor would be to use the assembly language "lock" prefix, which causes an instruction to execute "atomically". This changes the memory access mode so that no other processor can modify that instruction's data until the instruction completes. Here's what our loop looks like using the lock prefix. "incl (sum)" is the g++ inline assembly version of "inc dword [sum]".

volatile int sum=0;  /*<- Careful!  This variable is shared between threads! */
void doWork(void) {
	for (int i=0;i<1000;i++)
		__asm__ ( "lock incl (sum)" );
}

(executable NetRun link)

Now, finally, we get the right answer! However, our program is now 10x slower! The trouble is the "lock" prefix must talk to the memory bus to actually guarantee the right answer, and the memory bus is way slower than the CPU. Only certain instructions (mov, add, xchg) take the "lock" prefix.

Thread Safety With Mutexes

There's another way to make this function threadsafe that works even if you've got more than one assembly-instruction worth of work to do. It's called a "mutex", which is just a special object with "lock" and "unlock" operations--while you've got the mutex locked, no other thread can lock the mutex (think of the lock in a bathroom stall!). In pthreads, this looks like:

int sum=0;  /*<- Careful!  This variable is shared between threads! */
pthread_mutex_t sum_lock=PTHREAD_MUTEX_INITIALIZER;
void doWork(void) {
	pthread_mutex_lock(&sum_lock);
	for (int i=0;i<1000;i++)
		sum++;
	pthread_mutex_unlock(&sum_lock);
}

This supports long stretches of protected code, is guaranteed to work on all machines, and requires no ugly unportable assembly. But there are a few problems:

Mutex aquisition is kinda slow, like 60ns per lock. This means you want to lock and unlock as rarely as possible.
Nobody else can do the stuff protected by the lock while you've got it. This means you want to leave the lock locked as short a time as possible.
It's possible to try to re-lock a lock you've locked. This means you're waiting for yourself to unlock it! (mutex reentrancy)
It's possible to wait on one lock while you hold another lock. If the guy that holds that lock is waiting for your lock, you're both "deadlocked".

Thread Safety with Privatization

The ideal way to eliminate shared memory problems is to NOT share memory! The trick here is just giving every function its own copy of "sum", and then adding them together at the end. This is called "privatization", and it's sort of the gold standard.

The irony is that the best way to do shared memory programming is to avoid sharing.

"There is always a well-known solution to every human problem — neat, plausible, and wrong."
- H.L. Mecken, 1917 (not actually discussing threads)