Multicore Parallel Programming & Pitfalls
CS 441/641 Lecture, Dr. Lawlor
Let's say I want to print out 20 integers, and for some insane reason I care how fast this happens.
const int n=20;
double start=time_in_seconds();
#pragma omp parallel for /* make this loop multicore! */
for (int i=0;i<n;i++) {
#pragma omp critical /* do this stuff single-core */
{
std::cout<<"i="<<i<<"\n";
}
}
double end=time_in_seconds()-start;
double per=end/n;
std::cout<<"Time: "<<per*1.0e6<<" microseconds per iteration\n";
(Try this in NetRun now!)
This takes 10us per print, which is pretty slow. But it's only using one core.
Right, then: let's put more processors on the job, using the very simple OpenMP directives.
These split up the loop iterations across the cores, so core 0 gets
iterations 0-4, core 1 iterations 5-9, core 2 iterations 10-14, and
core 3 iterations 15-19.
const int n=20;
double start=time_in_seconds();
#pragma omp parallel for /* make this loop multicore! */
for (int i=0;i<n;i++) {
std::cout<<"i="<<i<<"\n";
}
double end=time_in_seconds()-start;
double per=end/n;
std::cout<<"Time: "<<per*1.0e6<<" microseconds per iteration\n";
(Try this in NetRun now!)
Sadly, this now takes about 28us per print--it got 180% slower!
Here's the other problem: it never quite runs the same way twice. Hmmm.
i=i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 15 i=16 i=17 i=18 i=19 i=10 i=11 i=12 i=13 i=14
|
i=i=0 i=1 i=2 i=3 i=4 i=10 i=11 i=12 i=13 i=14 15 i=16 i=17 i=18 i=19 i=5 i=6 i=7 i=8 i=9
|
i=i=0 i=1 i=2 i=3 i=4 i=15 i=16 i=17 i=18 i=19 5 i=6 i=7 i=8 i=9 i=10 i=11 i=12 i=13 i=14
|
i=i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 15 i=16 i=17 i=18 i=19 i=10 i=11 i=12 i=13 i=14
|
i=i=0 i=1 i=2 i=3 i=4 i=15 i=16 i=17 i=18 i=19 5 i=6 i=7 i=8 i=9 i=10 i=11 i=12 i=13 i=14
|
i=i=i=0 i=1 i=2 i=3 i=4 5 i=6 i=7 i=8 i=9 15 i=16 i=17 i=18 i=19 i=10 i=11 i=12 i=13 i=14
|
We can improve matters a little bit by making each entire output line a
"critical section", so it's done as a unit. This makes the lines
a little cleaner, and actually improves the speed a bit, to 24us per
print, only 140% slower than single core.
const int n=20;
double start=time_in_seconds();
#pragma omp parallel for /* make this loop multicore! */
for (int i=0;i<n;i++) {
#pragma omp critical /* do this stuff single-core */
{
std::cout<<"i="<<i<<"\n";
}
}
double end=time_in_seconds()-start;
double per=end/n;
std::cout<<"Time: "<<per*1.0e6<<" microseconds per iteration\n";
(Try this in NetRun now!)
i=5 i=6 i=7 i=8 i=9 i=0 i=1 i=2 i=3 i=4 i=15 i=16 i=17 i=18 i=19 i=10 i=11 i=12 i=13 i=14
|
i=5 i=6 i=7 i=8 i=9 i=0 i=1 i=2 i=3 i=4 i=15 i=16 i=17 i=18 i=19 i=10 i=11 i=12 i=13 i=14
|
i=10 i=11 i=12 i=13 i=14 i=0 i=1 i=2 i=3 i=4 i=15 i=16 i=17 i=18 i=19 i=5 i=6 i=7 i=8 i=9
|
i=5 i=6 i=7 i=8 i=9 i=0 i=1 i=2 i=3 i=4 i=15 i=16 i=17 i=18 i=19 i=10 i=11 i=12 i=13 i=14
|
i=5 i=6 i=7 i=8 i=9 i=0 i=1 i=2 i=3 i=4 i=15 i=16 i=17 i=18 i=19 i=10 i=11 i=12 i=13 i=14
|
PROBLEM: Parallel access to any single resource results in
contention. Contention leads to wrong answers, bad performance,
or both at once.
SOLUTION(I): Ignore this, and get the wrong answer, slowly.
SOLUTION(F): Forget parallelism. You get the right answer, but only using one core.
SOLUTION(C): Add a critical section (or a lock, or a mutex, etc) to
control access to the single resource. This gives you the right
answer, but costs performance--the whole point of the critical section
is to reduce parallelism.
SOLUTION(P): Parallelize (or "privatize") all resources. This is
the best solution, but making several copies of hardware or software is
expensive. This is the model that highly scalable software like
MPI recommends: even the main function is parallel!
SOLUTION(H): Hybrid: use any of the above where it's appropriate.
This is the model OpenMP recommends: you start serial, add parallelism
where it makes sense, and privatize or restrict access to shared things
to get the right answer.
For example, we can eliminate contention on the single "cout" variable
by writing the results into an array of strings. Generally,
arrays or vectors are nice, because multicore machines do a decent job
at simultanious access to different places in memory, but do a bad job
at simultanious access to a single shared data structure like a tree.
const int n=10000;
double start=time_in_seconds();
char results[n][100]; /* array of resulting strings (so each core can write its own)*/
#pragma omp parallel for /* make this loop multicore! */
for (int i=0;i<n;i++) {
sprintf(results[i],"i=%d\n",i);
}
/* concat & print all strings? */
double end=time_in_seconds()-start;
double per=end/n;
std::cout<<"Time: "<<per*1.0e6<<" microseconds per iteration\n";
(Try this in NetRun now!)
This version gets the right answer (assuming you concatenate all the
strings properly), and by running a big enough problem, here with 10K
strings, you can soundly beat the single-core performance, here by
about 200%.