Load Balancing in OpenMP
CS 641 Lecture, Dr. Lawlor
Here's an application that sums up some data repeatedly. It takes about 0.1s to run.
enum {Nouter=10000,Ninner=10000};
int arr[Nouter];
int runloop(int i) {
int sum=0;
for (int j=0;j<Ninner;j++)
sum+=i^j;
return sum;
}
int foo(void) {
arr[0]=0;
for (unsigned int i=0;i<Nouter;i++) {
arr[i]=runloop(i);
}
return arr[0];
}
(Try this in NetRun now!)
Where do we put the #pragma omp? Well, putting it inside runloop
actually slows the program down, to 0.2s! The problem there is it
takes a parallel loop a certain amount of time to start the threads, so
you want the threads in the outermost loop if you can. Putting
the #pragma omp into foo's loop speeds us up by a little over 3x on
four cores, which is pretty good.
Here's a similar problem, but now the inner loop runs to i. In
serial, gives a 2x speedup. But now OpenMP only gives about a 2x
speedup on four cores--not very good!
enum {Nouter=10000,Ninner=10000};
int arr[Nouter];
int runloop(int i) {
int sum=0;
for (int j=0;j<i;j++)
sum+=i^j;
return sum;
}
int foo(void) {
arr[0]=0;
//#pragma omp parallel for
for (unsigned int i=0;i<Nouter;i++) {
arr[i]=runloop(i);
}
return arr[0];
}
(Try this in NetRun now!)
The problem is even worse with this program. Now OpenMP hardly helps at all!
enum {Nouter=10000,Ninner=10000};
int arr[Nouter];
int runloop(int i) {
int sum=0;
int n=(int)(pow(i,4.0)/pow(Nouter,3.0));
for (int j=0;j<n;j++)
sum+=i^j;
return sum;
}
int foo(void) {
arr[0]=0;
//#pragma omp parallel for
for (unsigned int i=0;i<Nouter;i++) {
arr[i]=runloop(i);
}
return arr[0];
}
(Try this in NetRun now!)
The problem here is that the different iterations of the loop have
radically different costs: when i is small, n is very small; when i is
almost at the end, n gets huge. This means there's basically only
one thread doing work at a time! One fix is to interleave the
work (statically or dynamically), for example with a
"schedule(static,1)". In general,
time to run openMP loop = time to start threads + maximum time for any thread
The time to start threads depends on the platform, but it's typically a few dozen microseconds on modern implementations.
False Sharing / Cache Thrashing
OK, similar problem, but just 1D now. Integer ++ takes about 1ns/int, mostly for the load and store.
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
for (int i=0;i<n;i++) {
arr[i]++;
}
return arr[0];
}
(Try this in NetRun now!)
Switch to OpenMP, and I get about 0.5ns/int, about a 2x speedup on a quad-core machine.
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
#pragma omp parallel for
for (int i=0;i<n;i++) {
arr[i]++;
}
return arr[0];
}
(Try this in NetRun now!)
Switch to incrementing only four values in the array. In serial,
things get a little faster due to cache effect (0.9ns/int). But
with multiple threads, suddenly I get a fairly massive slowdown, to
8ns/int!
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
#pragma omp parallel for
for (unsigned int i=0;i<n;i++) {
arr[i%4]++;
}
return arr[0];
}
(Try this in NetRun now!)
I'm also getting the wrong answer--should be 1/4 million, but it's way
less than that, because processors are overwriting each others'
work. Switching the thread scheduling so that each thread will
increment a separate variable gives the right answer at least:
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
#pragma omp parallel for schedule(static,1)
for (unsigned int i=0;i<n;i++) {
arr[i%4]++;
}
return arr[0];
}
(Try this in NetRun now!)
Rationale: quad-core machine. Four threads doing one iteration each will each hit a unique array element.
But my performance is still quite bad--2ns/int, still worse than serial code! Why? Cache coherence.
Instead of doing work, the CPUs are busy fighting over the only used
cache line in the program. You can fix this with a very strange
technique: move the CPUs data apart in memory further than one cache
line. For example, here we move the integers accessed by each
thread 20 units apart.
enum {n=1000*1000};
int arr[n];
int foo(void) {
arr[0]=0;
#pragma omp parallel for schedule(static,1)
for (unsigned int i=0;i<n;i++) {
arr[(i%4)*20]++;
}
return arr[0];
}
(Try this in NetRun now!)
Just moving the data apart in RAM improves our performance substantially, to 0.7ns/int.
This cache thrashing business is annoying, because:
- For good performance, processors need to trade work from heavily loaded to more lightly loaded processors.
- But moving data across processors itself has a cost, especially
when cache lines are shared between processors, so they have to keep
exchanging them.
It's doubly annoying because it happens secretly, behind the
scenes--the programmer doesn't know anything is wrong unless they get
the wrong answer or bad performance, and then it's a mystery why it's
happening. One solution to this, which we'll explore in the next
few weeks: explicit data movement, such as across a network.
Parallelism Generally
Here's a survey of data structures
used by scientific simulation software. I prepared this while in
Illinois, working on a variety of scientific simulation software.
Some applications are quite easy to parallelize. Others... aren't.
- "Naturally Parallel" applications don't need any
communication. For example, Mandelbrot set rendering is naturally
parallel, because each Mandelbrot pixel is a stand-alone problem you
can solve independently of all the other Mandelbrot pixels. GPUs
can handle naturally parallel applications in a single pass. The
other common term for this is "Trivially Parallel" or "Embarrasingly
Parallel", which makes it sound like a bad thing--but parallelism is both natural and good!
- "Neighbor" applications communicate with their immediate
neighbors, and that's it. For example, heat flow in a plate can
be computed based solely on one's immediate neighbors (new value at
arr[i] is a function of the old value of arr[i-1], arr[i], and
arr[i+1]). Neighbor applications naturally fit into the "ghost
exchange" communication pattern, and for large problem sizes usually
can be tweaked to get good performance.
- "Other" applications have a weirder, often collective
communication pattern. Depending on the structure of the problem,
such applications can sometimes have relatively good performance, but
are often network bound.
- "Sequential" applications might have multiple threads or
processes, but they don't have any parallelism--only one thread/process
can do useful work at a time. Many I/O limited applications are
basically sequential, since ten CPUs can wait on one disk no faster
than one CPU! (Solution: add more disks!)