The combination of pipelining and superscalar execution is designed to extract maximum runtime parallelism from sequential machine code. Though a combination of branch prediction, register renaming, and operand forwarding, this is remarkably effective, able to finish up to a half-dozen instructions per clock cycle for real code. The problem is there just isn't much more parallelism available in sequential code.
One solution is to change the code.
For example, we could keep each CPU the same, and just stick several of them together onto the same chip. This is called multicore.
First, to get into the proper revolutionary mindset, read this now decade-old but prescient article:
The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software
written by Herb Sutter, smart Microsoft guy on the C++ standards committe
Notable quotes:
C++11 added a thread library, in <thread>. This makes it easy to create threads in a portable way, without needing to call Windows kernel threads or UNIX pthreads.
#include <thread> void do_work(char where) { for (int i=0;i<10;i++) { std::cout<<where<<i<<"\n"; } } void worker_thread(void) { do_work('B'); } void foo(void) { std::thread t(worker_thread); do_work('A'); t.join(); std::cout<<"Done!\n"; }
Notice in the above program, sometimes the prints from A and B overlap, and run in an arbitrary order.
PROBLEM: Parallel access to any single resource, like cout, results in resource contention. Contention leads to wrong answers, bad performance, or both at once.
SOLUTION(I): Ignore this, and get the wrong answer, slowly. You can also occasionally crash onstage.
SOLUTION(F): Forget parallelism. You get the right answer, but only using one core.
SOLUTION(A): Use an atomic operation to access the resource. This is typically only possible for single-int or single-float operations, and has some overhead on CPUs, but it's possible for the hardware to make this efficient, like on most GPUs.
SOLUTION(C): Add a mutex, a mutual exclusion device (AKA lock) to control access to the single resource. This gives you the right answer, but costs performance--the whole point of the critical section is to reduce parallelism.
SOLUTION(P): Parallelize (or "privatize") all resources--then there aren't any sharing problems because nothing is shared. This is the best solution, but making several copies of hardware or software can be expensive. This is the model that highly scalable software like MPI recommends: even the main function is parallel!
SOLUTION(H): Hybrid: use any of the above where it's appropriate. This is the model OpenMP recommends: you start serial, add parallelism where it makes sense, and privatize or restrict access to shared things to get the right answer.
For example, we can eliminate contention on the single "cout" variable by writing the results into an array of strings. Generally, arrays or vectors are nice, because multicore machines do a decent job at simultaneous access to different places in memory, but do a bad job at simultaneous access to a single shared data structure like a file or network device.
Because threaded programming is so ugly and tricky, there's a simple loop-oriented language extension out there called OpenMP, designed to make it substantially easier to write multithreaded code.
The basic idea is you take what looks like an ordinary sequential loop, like:
for (int i=0;i<n;i++) do_fn(i);
And you add a little note to the compiler saying it's a parallel forloop, so if you've got four CPUs, the iterations should be spread across the CPUs. The particular syntax they chose is a "#pragma" statement, with the "omp" prefix:
#pragma omp parallel for num_threads(4) for (int i=0;i<n;i++) do_fn(i);
Granted, this line has like a 5,000ns/thread overhead, so it won't help tiny loops, but it can really help long loops. After the for loop completes, all the threads go back to waiting, except the master thread, which continues with the (otherwise serial) program. Note that this is still shared-memory threaded programming, so global variables are still (dangerously) shared by default!
If your compiler supports OpenMP 4.0 (e.g., gcc 4.9 or later), you can ask the compiler to use SIMD instructions in your loops like this, although at the moment it doesn't seem to combine with threads.
#pragma omp simd for (int i=0;i<n;i++) arr[i]=3*src[i];
Unlike bare threads, with OpenMP:
Here's how you enable OpenMP in various compilers. Visual C++ 2005 & later (but NOT express!), Intel C++ version 9.0, and gcc version 4.2 all support OpenMP, although earlier versions do not!
Here's the idiomatic OpenMP program: slap "#pragma omp parallel for" in front of your main loop. You're done!
Here's a more complex "who am I"-style OpenMP program from Lawrence Livermore National Labs. Note the compiler "#pragma" statements!
CS 441 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.