Parallel Graphics
CS 481/681 2007 Lecture, Dr. Lawlor
Parallel in General
The basic idea is just to divide up your work into pieces. The
question of "which pieces?" depends on the problem. Usually
there's a natural answer where the pieces are almost totally
independent, and hence can be worked on in parallel. For images,
it's natural to divide up the image into separate regions. For
*any* program that outputs a big array, it's natural to divide up the
output array. For more complicated programs, sometimes it's
tricky to figure out how to make things parallel, but luckily most
graphics tasks are "embarrasingly parallel", or as I prefer to say,
"naturally parallel".
There is one quite common problem with doing parallel work--unequal
work distribution. If two processors are working on a problem,
but one processor ended up with 90% of the work (even if it's only 50%
of the image), then that processor isn't getting much help from the
other processor. In general, to get good performance, you've got
to keep all your CPUs busy doing useful stuff. One trivial but
suprisingly effective technique for this is called "overdecomposition",
where you just make way more pieces than processors. This lets
the OS switch between pieces dynamically, improving the "load balance"
and overall performance.
One technique I can't recommend is making threads for each of your
program's major software units. For example, a game might consist
of an AI module, a physics module, and a rendering module. You
could make three threads out of these modules. But there are
several problems with this approach:
- Every interaction between modules becomes an interaction between
unrelated threads. This can get really ugly, for example, if the
render thread is trying to draw stuff while the physics thread is
moving it. So this usually isn't as easy as it first appears.
- Usually one of the modules will require way more CPU time than
the others (the bottleneck). If this module is only one thread,
the machine's other CPUs can't help speed the program up.
- If you've got three modules, you can't run on more than a 3-CPU machine. What are you going to do on a 64-CPU machine?
Instead, I try to split up the work done inside each module into
threads--data decomposition instead of task decomposition. This
lets you support way more CPUs, can give you acceptable load balance,
and is usually a better overall approach.
Threads
See my CS321 lecture
for how to split your program into threads or processes. This is
an easy-but-limiting approach: since threads and processes usually
communicate by sharing memory, the physical hardware you run on MUST
have shared memory. Shared memory is supposedly somewhat easier
to program (I'm not sure about that), but it definitely imposes some
additional hardware cost. In particular, with shared memory you
can't just have a big pile of separate boxes connected with a
network. For that, you need MPI.
MPI
MPI
is a very strange but powerful interface. It's designed to run on
separate machines that share only a network. So MPI actually
calls your "main" routine once on every processor, almost like "main"
is already the thread worker function.
Every MPI program must begin with a call to MPI_Init, and end with a call to MPI_Finalize:
#include <mpi.h>
int main(int argc,char **argv) {
MPI_Init(&argc,&argv);
... all work here ..
MPI_Finalize();
}
If you forget to call MPI_Finalize, under MPICH you'll get weird errors
like this when your program exits--and this is extremely embarassing if
it happens in class and you don't recognize it!
p17_20106: p4_error: net_recv read: probable EOF on socket: 1
rm_l_17_20107: (0.286554) net_send: could not write to fd=5, errno = 32
bm_list_30320: (2.274217) wakeup_slave: unable to interrupt slave 0 pid 30319
Once you've initialized MPI, you've got to figure out which processor you are, and how many processors exist:
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
std::cout<<"Hello-- I am processor "<<rank<<" of "<<size<<"\n";
You compile your program (say "main.cpp") using mpiCC:
mpiCC main.cpp -o main
Now you can run using mpirun. "-np" controls the number of processes to create.
mpirun -np 2 ./main
My version (on the powerwall in ~olawlor/481_demo/simple_mpi/) gives:
olawlor@powerwall0:~/481_demo/simple_mpi$ mpiCC main.cpp -o main
olawlor@powerwall0:~/481_demo/simple_mpi$ mpirun -np 2 ./main
Hello-- I am processor 0 of 2
Hello-- I am processor 1 of 2
olawlor@powerwall0:~/481_demo/simple_mpi$ mpirun -np 7 ./main
Hello-- I am processor 0 of 7
Hello-- I am processor 6 of 7
Hello-- I am processor 4 of 7
Hello-- I am processor 2 of 7
Hello-- I am processor 3 of 7
Hello-- I am processor 1 of 7
Hello-- I am processor 5 of 7
Based only on the rank and size information, you can usually figure out
what part of the image you should work on. So you do some
computing. You then want to forward
the result to some central processor to be stored in a file.
Unlike with threads, you can't just stuff your results into a big
shared data structure, nor can the central processor just pull the data
out of your arrays. Instead, you usually communicate your results
using sends and receives...
MPI Sends and Receives
MPI_Send
takes six parameters: a pointer to the data to send, the number of
bytes to send, MPI_BYTE, the processor to send the data to, a "tag"
used to match sends and receives (make up a number here, like zero),
and MPI_COMM_WORLD.
MPI_Send(&myWork,sizeof(myWork),MPI_BYTE, bossRank, tag, MPI_COMM_WORLD);
MPI_Recv
is the other side of send. It takes the exact same six
parameters, this time describing the receiver. It's also got a
(silly) "MPI_Status" tacked on at the end, which can be used to figure
out the message sender, tag, and length if you don't know them.
MPI_Status stat;
MPI_Recv(&hisWork,sizeof(hisWork),MPI_BYTE,
hisRank, tag, MPI_COMM_WORLD,&stat);
You MUST make sure every MPI_Send has a matching MPI_Recv in your programs. Sometimes this is tricky!
For example, to have the boss receive one piece of work from each minion, the boss would have to loop over minions like so:
/* send all our data to the boss processor */
int tag=1234;
int boss=0;
if (rank==boss) { /* I'm the boss */
for (int minion=0;minion<size;minion++)
if (minion!=boss) {
MPI_Status stat;
char c;
MPI_Recv(&c,sizeof(c),MPI_BYTE,
minion, tag, MPI_COMM_WORLD,&stat);
std::cout<<"Minion "<<minion<<" computed the value '"<<c<<"'.\n";
}
} else { /* I'm a lowly minion */
char myWork='I';
MPI_Send(&myWork,sizeof(myWork),MPI_BYTE,
boss, tag, MPI_COMM_WORLD);
}
(You can also use the slightly faster and cleaner MPI_Gather for this if all the minion's results are the same size.)
See the example on the powerwall in ~olawlor/cs481_demo/sendrecv/.