Parallelism in Practice
CS 321 2007 Lecture, Dr. Lawlor
I promise this is the last lecture you'll hear on parallelism--I think it's a really interesting and important topic, but there is a lot more material to cover this semester.
So you've got this application. The application is
sequential--it's a normal top-to-bottom C++ program. You want to
make it work on multiple CPUs.
Step one is to figure out what parts of the program are taking the
time--there's no point in parallelizing the idle loop. If the
program is slow because it's waiting for the disk, or network card, or
graphics card, STOP, because using multiple CPUs is probably not going to help.
If the program spends most of its time computing (doing arithmetic or
memory operations), then using multiple CPUs should be able to speed up
the program substantially. The first and hardest part is
splitting up the big glob-o-code into little pieces that can run
independently. Once you've split up the program into pieces, you then
just have to run each piece. Below are several ways to run pieces
of code simultaniously.
Splitting up Programs into Pieces
So you've got a big ball of C code. You want to run pieces of
that code on multiple CPUs. This is almost always possible, but
it's almost never easy.
- For programs that are computing a big array of stuff, you can
compute pieces of the array separately. Most graphics programs
can divide up the image (or 3D model) into pieces.
- For programs that search over a big area for stuff, you can search subareas separately.
- For programs that total up a bunch of stuff, you can usually
total up the pieces independently, and then come up with a grand total
at the end.
There's often a substantial amount of rethinking required to split a
program into pieces. It's easy, for example, to mess up the
division of work so some things get done twice, and others never get
done at all!
Stuff that's really hard to do in parallel includes sorting, compiling,
and talking to existing sequential code. It's also usually a bad
idea to try to split up file I/O operations
into pieces--if you can eventually make the file locking and seeking
work reliably,
you'll often still get bad performance. I always try to switch to
in-memory operations first, or have only one thread do most of the I/O.
One well-known problem with dividing a problem into pieces is "load
balance"--the pieces are almost never all the same size. For
example, if you divide a problem into two pieces for two processors,
but the left piece is way bigger, the right processor will spend most
of its time waiting for the left one to finish. If you have 100
processors (the 2017 Intel Centium), it's easy to screw up your work
division so that 99 of the processors are waiting for the last one to
finish. This can destroy the performance improvement you'd get
from parallelism, so load imbalance is a bad thing.
Luckily, there's a cool trick for fixing load balance problems--just
make way more pieces than you think you'll need
("overdecomposition"). For example, if you've got two processors,
make like ten threads. Then if one thread finishes early, there
still are nine more for your processor to choose from. In
general, you'll get good performance from threads until the threads are
doing less than a few dozen milliseconds of work (or until the OS can't
create you any more threads!).
Load balance is usually impossible to fix if you've divided the problem into inherently
unequal pieces--for example, if you divide a compiler into three
threads (preprocessor, syntax parse, and code generation). The
problem with this "heterogenous" each-thread-does-a-different-thing
approach is that you've only got three threads, so you can't
overdecompose. You're also limited to how many CPUs you can
support. Doing a threaded compiler that has one thread per *input
file* would be a better approach, since almost anytime you're waiting
for the compiler, there are lots of different files to compile.
Parallelism Using Threads (UNIX)
// ... prepare program to be run in pieces ...
// Run the pieces as separate threads:
pthread_t *thr=new pthread_t[n];
for (int i=0;i<n;i++) {
worker *w=new ... i'th piece of work ...
pthread_create(&thr[i],0,work,w);
}
// Wait until all pieces are finished (until work routine returns)
for (int i=0;i<n;i++) {
void *ret;
pthread_join(thr[i],&ret);
}
You do have to be very careful with threads that the workers don't
accidentally overwrite each other's work, since all memory is shared
between threads. Sometimes this requires locking, but usually it
requires making separate copies of variables. Keeping track of
the separate copies can be a pain. Separating the copies used
across a big program can be a real pain.
Parallelism Using Processes
Unlike threads, processes don't share memory by default. This
substantially reduces the number of places you have to worry about race
conditions and other parallel problems. However, there are times
when you want to share some memory between processes--for example, to
collect the pieces of the final result. You must allocate
these shared regions in a special way--on UNIX, you can make a shared
memory region really easily using mmap or with more code using shmget/shmat (see Beej's shm example). On Windows, you can create a shared-memory region with CreateFileMapping and then MapViewOfFile (see codeproject example).
Here's my UNIX example (works on Linux and Mac OS X, but not Windows):
#include <unistd.h> /* for fork */
#include <sys/mman.h> /* for mmap */
#include <sys/wait.h> /* for wait */
// Return len bytes of RAM that will automagically be shared after a fork().
void *shared_malloc(int len) {
void *p=mmap(0,len, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1,0);
if (p==(void *)-1) printf("mmap failed!\n");
return p;
}
void shared_free(void *p,int len) {
munmap(p,len);
}
...
// ... prepare program to be run in pieces ...
// ... call shared_malloc for memory shared between pieces ...
// Run the pieces as separate processes:
int *pid=new int[n];
for (int i=0;i<n;i++) {
worker *w=new ... i'th piece of work ...
pid[i]=fork();
if (pid[i]==0) { /* I'm the child laborer */
work(w);
exit(0);
}
}
// Wait until all pieces are finished (until children exit)
for (int i=0;i<n;i++) {
int ret;
waitpid(pid[i],&ret,0);
}
Parallelism Using the Network
There are also several stranger ways to communicate between different
processes, including MPI calls and TCP sockets. We will talk
about the network later this semester.