Multithreaded Execution
CS 441 Lecture, Dr. Lawlor
Pop quiz--so multicore parallel computing is finally going mainstream. What will this event be most like?
- The 2006 Democratic majority in the US Congress. (New names, lots of hype, but zero measureable effect.)
- The worst 1950's rumors about communist spies infiltrating honest
American institutions. (Reliable components slowly going bad
without anybody knowing... until it's too late!)
- The
worst rumors about global warming coming true. (Plagues of locusts,
drought, heat, and major lifestyle changes for everybody.)
- A meteorite similar to that which drove the dinosaurs extinct. (Slow suffocating starvation for 90%+ of earth's life.)
- The second coming. (Dead shall rise, seas shall boil, new
Jerusalem on a new Earth.)
Answer: all of the above!
- Whatever happens, there's still going to be C++-lookin' code, if
for no other reason than the huge quantity of the C++ that's out there
already! The names may change, and the way the code is called may change a lot, but deep down tons of sequential "variable=value;" code will not change. It's the code that calls your deep-down application code that's likely to change a lot.
- Multiple things happening at once means you have to worry about
"race conditions", where two (or more) pieces of code "race" to see
which one finishes first. You have to worry about three
(nondeterministic!) possibilities: A finishes first, B finishes first,
or a tie. Maybe two of these cases are really rare. But
either one might cause your code to blow up! (Crash, fail,
silently give the wrong answer to a crucial query...)
- Parallel programming is a new lifestyle. You've got to
worry about things you didn't have to worry about before.
Concurrency makes some formerly-moral stuff (like static variables)
immoral, and can make some formerly immoral stuff (like public members)
moral, because the situation has changed.
- Many
perfectly innocent serial programs will slowly wither and
die, because they don't take advantage of parallelism. They will
have the embarrasingly limited, dated feeling of command-line DOS
applications (8 character limit),FAT32 partitions (4GB file limit), or
software-rendered 3D games (ugly and slow). Few users will mourn their
passing.
- Parallel programming really has the potential to change
everything. Just the hardware ranges from staid stuff like SSE to
multicore to distributed to GPU to FPGA. FPGA's don't even run
C++ code, dangit. It's the end of the world! (As we know it... and I feel fine.)
To get into the proper revolutionary mindset, read:
The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software
written by Herb Sutter, smart Microsoft guy on the C++ standards committe
Notable quotes:
- "Andy giveth, and Bill
taketh away." That is, Andrew Grove, Intel, keeps building faster
and faster CPUs. Bill Gates, Microsoft, keeps building slower and
slower (yet better and better!) software.
- "Concurrency is the next
major revolution in how we write software." That is, by 2015
purely sequential code will have the same retro feeling that COBOL has
today.
- "Probably
the greatest cost of
concurrency is that concurrency really is hard: The programming model,
meaning the model in the programmer’s head that he needs to reason
reliably about his program, is much harder than it is for sequential
control flow." That is, human reasoning is sequential. Execution is
parallel. Horrible things can crawl into our universe through this gap!
- "For example, Intel is
talking
about someday producing 100-core chips; a single-threaded application
can exploit at most 1/100 of such a chip’s potential throughput."
90% of commercial applications today are single-threaded. They
will either adapt or die out.
- "just because it takes one
woman nine months to produce a baby doesn’t imply that nine women could
produce one baby in one month." So if the problem is "make one
baby", parallelism is useless. But if you change the problem to
"make 1 million babies", suddenly you have million-way parallelism.
- "A few rare classes of
applications are naturally parallelizable, but most aren’t." In
other words, occasionally it's obvious how to run in parallel, and
actually running in parallel is easy. Usually, though, you have
to rethink and reformulate the
basic algorithm used to solve the problem. I'd add that "A few
rare applications are impossible to parallelize, but given enough
effort, most aren't."
Kinds of Parallel Computers
- One ALU, one set of registers: serial computer (single processor).
- Multiple ALUs, one decoder & one set of registers: Single Instruction Multiple Data (SIMD), like SSE.
- One ALU, several sets of registers: Symmetric Multi-Threading
(SMT) or HyperThreading. This is only a tiny amount of additional
chip area, but increases ALU utilization way more than wider
superscalar execution. ALU contention limits the parallelism to
perhaps 4-way on modern CPUs, but can reach 64-fold on GPUs.
- Several different CPUs, one shared RAM area: Symmetric
Multi-Processing (SMP) or Multi-Core. 2 cores is entirely
standard today, 4 cores are on the way, and commercial SMP nodes with
16 or 32 way access exist. SMP can also combine with SMT for
absurd numbers of threads: e.g., a typical GPU is 128-way SMP, with a
total of 8192-way SMT.
- Several CPUs, each with local RAM, but with a nonlocal RAM
interconnect: Non-Uniform Memory Architecture (NUMA). Unlike SMP,
where all RAM is equal, in a NUMA, it's faster to go to your own RAM
(60ns) compared to some other CPU's RAM (1000ns).
- Several
complete nodes, connected with network: Cluster or
Massively Parallel Processor (MPP). This is the only architecture
known to scale to hundreds of thousands of CPUs--nothing is shared
except the network. If all the computers are running the same
program, as is typical to solve tightly coupled problems, it's called
Single Program Multiple Data (SPMD). The Internet is a huge and
very very loosely coupled MPP, because all the computers on the
internet are
running different programs at different times for different reasons.
Kinds of Program Control Flows
There are quite a few different levels of saving and restoring that you
can design. Existing levels are nesting, and from least-to-most
context include:
- Functions, where things execute one at a time. Hopefully you're familiar with normal functions.
- Registers: by convention, some are saved (must restore value for caller), some are scratch (can freely overwrite)
- Stack: organized by convention into "stack frames"
- Memory: shared between all functions
- Files: shared between all functions
- User-level threads,
where you save and restore processor registers yourself, usually via an
explicit "swap" subroutine call. Also called "coroutines" or
"user/userspace threads", or (by Sun Java folks) "green threads".
Check out
my little "minithread" example code
(Directory, Zip, Tar-gzip).
The key routine is just the "mth_swap" subroutine, which saves an old
set of registers and loads up a new one, and the core of this is just:
mov [rax],rsp ; Save the old stack pointer
mov rsp, [rcx] ; Load up the new stack pointer
Typical examples include Netscape Threads, or any of a thousand tiny projects mostly relying on setjmp/longjmp or "makecontext/swapcontext".
Each thread has its own stack, usually just allocated with "malloc",
but shares the rest of memory with other user-level threads. - Registers: explicitly saved and restored by a "swap" routine.
- Stack: each user-level thread has a separate stack region in memory.
- Memory & Files: shared between all threads.
- Kernel-level threads,
or just "threads", are where the OS saves and restores processor
registers to switch threads at interrupt time. Also called
"native threads". The most common example is OpenMP or pthreads.
There's actually a hybrid of kernel and user-level threads where you
save and restore processor registers yourself (not in the OS), but you
do it inside a signal handler during an interrupt.
- Registers: implicitly saved and restored by the OS when it decides to switch threads.
- Stack: separate regions of memory for each thread.
- Memory & Files: usually shared between all threads.
- Processes, where
the OS swaps out all of memory and all I/O connections in addition to
the registers. This is what you usually think of as a "program".
- Registers: implicitly saved and restored by the OS when it decides to switch processes.
- Stack & memory: completely separate memory area for each process.
- Files: open files are local to a process (I open a file, it's
not open for you), but data in files is shared between all processes.
- Virtual Machines, where the
OS itself and everything in it is swapped out. QEMU is a pretty typical
example. A virtual machine is usually treated as an ordinary
process inside the outer ("host") OS.
- Registers, stack, memory, files, disks, and even CPUs are all private to a virtual machine.
All these levels actually nest. So a virtual machine might
contain several processes, with the emulated OS switching between
them.
Each process might contain several kernel threads, with the OS again
switching between them. Each kernel thread might contain several
user-level threads, which user code explicitly switches between.
Each user-level thread contains several functions, which call one
another.
Creating Different Program Control Flows
See all the example code as a
(Directory, Zip, Tar-gzip).
You should become familiar with both Windows and UNIX flavors of thread and process creation.
Simple Kernel Threads
You probably want to make the OS
kernel to do the hard work of creating threads, which are then called
"kernel threads". On UNIX systems, you create a new kernel thread
by calling "pthread_create", which is in "#include <pthread.h>"
and the "-lpthread" library. You just pass pthread_create the
function to run in the thread, and the kernel will then run that
function, switching back and forth between threads at
basically random times (100 to 1000 times per second, usually).
#include <pthread.h>
void doWork(const char *caller) {
std::cout<<caller<<"\n";
}
void *fnA(void *arg) {
for (int i=0;i<3;i++) doWork("A");
return 0;
}
void *fnB(void *arg) {
for (int i=0;i<3;i++) doWork("B");
return 0;
}
int foo(void) {
pthread_t B;
pthread_create(&B,0,fnB,0); /* make a thread B, to run fnB */
fnA(0);
void *Bret;
pthread_join(B,&Bret); /* wait until B finishes */
return 0;
}
(executable NetRun link)
To illustrate the way the kernel randomly switches between these functions, run
this thing several times--here's what I got on several different runs:
A A A B B B
|
AB
AB
AB
|
B B B A A A
|
AB A A
B B
|
AB
B B A A
|
That is, the kernel runs A for a while, then B for a while, then back
to A. You can't tell when a switch is going to happen. Note that the kernel sometimes switches to B between when A
prints the letter 'A' and when it prints the newline immediately after
it in the string "A\n"!
The danger of threads
So basically a kernel thread is just a fancy way to run one of your subroutines
simultaniously with all the other subroutines. A kernel thread can
still call other functions, access your global variables, or use anything it can find that belongs
to other threads.
This shared access to common variables
immediately introduces the many problems of "thread
safety". For example, consider a piece of code like this:
int shared=0;
void inc(void) {
int i=shared;
i++;
shared=i;
}
If two threads try to call "inc" repeatedly, the two executions might interleave like this:
Thread A
|
Thread B
|
int i=shared; // i==0
i++; // i==1
// hit interrupt. switch to B
shared=i; // i is still 1, so shared==1!
|
int i=shared; // i==0 (again)
i++; //i==1
shared=i; // shared==1
int i=shared; // i==1
i++; //i==2
shared=i; // shared==2
int i=shared; // i==2
i++; //i==3
shared=i; // shared==3
// hit interrupt, switch back to A
|
Uh oh! When we switch back to thread A, the value stored in "i"
isn't right anymore. It's an older copy of a shared global variable,
stored in thread A's stack or registers. So thread A will happily
overwrite the shared variable with its old version, causing all of B's
work to be lost!
Here's an executable example of this problem. Both threads are
trying to increment "sum". They do this 6,000 times, so "sum"
should be 6000 at the end. But with optimization on, they both
store a copy of "sum" in a register, so one guy overwrites the other
guy's work when they write the modified value back to "sum", and you
(usually!) end up with the totally-wrong value 3000:
#include <pthread.h>
int sum=0; /*<- Careful! This variable is shared between threads! */
void doWork(void) {
for (int i=0;i<1000;i++)
sum++;
}
void *fnA(void *arg) {
for (int i=0;i<3;i++) doWork();
return 0;
}
void *fnB(void *arg) {
for (int i=0;i<3;i++) doWork();
return 0;
}
int foo(void) {
sum=0;
pthread_t B;
pthread_create(&B,0,fnB,0);
fnA(0);
void *Bret;
pthread_join(B,&Bret);
return sum;
}
(executable NetRun link)
An easy solution to this race condition is to make a separate copy of
"sum" for each thread. The only trouble, then, is that you then
need to combine these separate "sum" values across threads, which
itself involves a second parallel summation! Privatization is still an excellent trick for reducing race conditions, and sometimes it can even eliminate the problem completely.
Thread Safety with "volatile"
Sometimes, the only problem is the compiler's over-optimization of
access to global variables. You can scare the compiler away from
a variable by putting the keyword "volatile" in front of it. This
makes the compiler fear the variable, and do exactly what you ask with
it:
volatile int sum=0; /*<- Careful! This variable is shared between threads! */
void doWork(void) {
for (int i=0;i<1000;i++)
sum++;
}
(executable NetRun link)
Adding "volatile" does indeed improve the thread safety of this
program--in fact, on a uniprocessor machine, it now gets the right
answer! This is because "sum++" becomes a single instruction
("inc dword [sum]"), and an instruction executes as one piece--if the
kernel switches to another thread, it will be either before or after
this instruction, and never "in the middle of an instruction".
However, on a multicore machine, this program STILL GIVES THE
WRONG ANSWER.
The reason is curious. On a real multiprocessor, two processors might simultaneously
execute the crucial "inc" instruction. Since the instruction
involves reading "sum" from memory, incrementing it, and writing it
back out, it can still happen that one processor overwrites the result
of another processor.
Thread Safety with "lock" Instruction Prefix
One way to *really* fix this code on a multiprocessor would be to use
the assembly language "lock" prefix, which causes an instruction to
execute "atomically". This changes the memory access mode so that
no other processor can modify that instruction's data until the
instruction completes. Here's what our loop looks like using the
lock prefix. "incl (sum)" is the g++ inline assembly version of
"inc dword [sum]".
volatile int sum=0; /*<- Careful! This variable is shared between threads! */
void doWork(void) {
for (int i=0;i<1000;i++)
__asm__ ( "lock incl (sum)" );
}
(executable NetRun link)
Now, finally, we get the right answer! However, our program is
now 10x slower! The trouble is the "lock" prefix must talk to the
memory bus to actually guarantee the right answer, and the memory bus
is way slower than the CPU. Only certain instructions (mov, add, xchg) take the "lock" prefix.
Thread Safety With Mutexes
There's another way to make this function threadsafe that works even if
you've got more than one assembly-instruction worth of work to
do. It's called a "mutex", which is just a special object with
"lock" and "unlock" operations--while you've got the mutex locked, no
other thread can lock the mutex (think of the lock in a bathroom
stall!). In pthreads, this looks like:
int sum=0; /*<- Careful! This variable is shared between threads! */
pthread_mutex_t sum_lock=PTHREAD_MUTEX_INITIALIZER;
void doWork(void) {
pthread_mutex_lock(&sum_lock);
for (int i=0;i<1000;i++)
sum++;
pthread_mutex_unlock(&sum_lock);
}
This supports long stretches of protected code, is guaranteed to work
on all machines, and requires no ugly unportable assembly.
But there are a few problems:
- Mutex aquisition is kinda slow, like 60ns per lock. This means you want to lock and unlock as rarely as possible.
- Nobody else can do the stuff protected by the lock while you've
got it. This means you want to leave the lock locked as short a
time as possible.
- It's possible to try to re-lock a lock you've locked. This means you're waiting for yourself to unlock it! (mutex reentrancy)
- It's
possible to wait on one lock while you hold another lock. If the
guy that holds that lock is waiting for your lock, you're both
"deadlocked".
Parallel interface: pthreads
So the whole point of making threads is to get some work done in each thread. This is pretty easy in theory:
// ... prepare program to be run in pieces ...
// Run the pieces as separate threads:
pthread_t *thr=new pthread_t[n];
for (int i=0;i<n;i++) {
worker *w=new ... i'th piece of work ...
pthread_create(&thr[i],0,work,w);
}
// Wait until all pieces are finished (until work routine returns)
for (int i=0;i<n;i++) {
void *ret;
pthread_join(thr[i],&ret);
}
You just have to be very careful with threads that the workers don't
accidentally overwrite each other's work, since all memory is shared
between threads. Sometimes this requires locking, but usually it
requires making separate copies of variables. Keeping track of
the separate copies can be a pain. Separating the copies used
across a big complicated program can be a real pain.
Parallel interface: processes
Unlike threads, processes don't share memory by default. This
substantially reduces the number of places you have to worry about race
conditions and other parallel problems. However, there are times
when you want to share some memory between processes--for example, to
collect the pieces of the final result. You must allocate
these shared regions in a special way--on UNIX, you can make a shared
memory region really easily using mmap or with more code using shmget/shmat (see Beej's shm example). On Windows, you can create a shared-memory region with CreateFileMapping and then MapViewOfFile (see codeproject example).
Here's my UNIX example (works on Linux and Mac OS X, but not Windows):
#include <unistd.h> /* for fork */
#include <sys/mman.h> /* for mmap */
#include <sys/wait.h> /* for wait */
// Return len bytes of RAM that will automagically be shared after a fork().
void *shared_malloc(int len) {
void *p=mmap(0,len, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1,0);
if (p==(void *)-1) printf("mmap failed!\n");
return p;
}
void shared_free(void *p,int len) {
munmap(p,len);
}
...
// ... prepare program to be run in pieces ...
// ... call shared_malloc for memory shared between pieces ...
// Run the pieces as separate processes:
int *pid=new int[n];
for (int i=0;i<n;i++) {
worker *w=new ... i'th piece of work ...
pid[i]=fork();
if (pid[i]==0) { /* I'm the child laborer */
work(w);
exit(0);
}
}
// Wait until all pieces are finished (until children exit)
for (int i=0;i<n;i++) {
int ret;
waitpid(pid[i],&ret,0);
}
Unlike the threaded version
above, the separate pieces of this program can't possibly mess up each
other's work--in fact, they have no way to communicate except by the
shared_malloc regions allocated by the parent! I claim that for
many real programs, this "private by default, shared only when
explicitly asked" is the correct approach. It means all your old
(buggy, global variable-laden) C++ code still works. It means you
don't get annoying race conditions (as often).