Modern Parallelism: Superscalar, SMP, SMT, and SIMD

CS 641 Lecture, Dr. Lawlor

Like lawn fertilizer, I'm going to adopt a four-number computer classificiation system:
There are lots of similar taxonomies out there; the most common is Flynn's SISD/SIMD/MIMD.

T-I-D-A
Description
Sequential Programming Model

1-1-1-1:
Sequential
A sequential CPU reads a single instruction, pulls its data, and executes it.  This is the model programmers have been using since 1948.
1-p-p-1:
Pipelined
Here we've split the CPU's execution path into p pipeline stages.  You need at least p instructions, and at least p independent data items.  Only one arithmetic unit actually completes a result each clock cycle, though. The point of pipelining a CPU is to do the fetching, decoding, and operand reads in parallel with actual execution.  Typical pipelines are 3 to 30 stages long, with a current typical value of about 10 stages.
1-k-k-k:
Superscalar
A superscalar CPU simultaniously fetches a set of k instructions, reads all their data, and executes them all at once.  This only works if all k instructions and data are utterly independent, which is rare, so k is typically 2 to 4 for real CPUs.  In a search for more independent instructions, superscalar machines typically need register renaming, out-of-order execution, and sophisticated branch prediction.
Multithreaded Programming Model

s-1-s-s:
SMP
(Multicore)

This is just s replicated copies of a single CPU.  Unlike superscalar, multicore requires the programmer to specify s independent execution threads, but the benefit is the CPU doesn't need to do dependency analysis, so s can reach into dozens or even hundreds.
h-1-h-1:
SMT
(Hyperthreading)

Here we have h replicated registers and decoders, but they all share a single set of arithmetic units.  This has the same programming model as multicore, but the hardware is cheaper.
SIMD Programming Model

1-1-m-m:
SIMD
Single Instruction Multiple Data: the programmer issues a single instruction, like "addps", and it runs several data items through a set of arithmetic units.  The advantage is fewer instructions, which means less work fetching and decoding.
1-1-v-1:
Vector
Vector machines, such as the Cray Y-MP, could add vector registers with a single instruction.  But unlike full SIMD, the hardware only had one arithmetic circuit, so the vector's values had to go in one at a time.  Again, this is the same programming model as SIMD, but the hardware is cheaper.

Any of the above techniques can be combined.  For example, IBM's recent Power7 chips are 8-way multicore, 4-way SMT, 4-way SIMD (VMX/AltiVec), and deeply pipelined.  This means hundreds of floating-point numbers can be in flight at once.

As another example, a modern graphics processing unit (GPU) is just a set of well-pipelined SIMD vector processors (typically 16-32 floats per instruction), using hyperthreading (typically up to hundreds of threads per processor), replicated in multicore fashion (2 to 32 cores).  The net result is that thousands of arithmetic circuits can be working on hundreds of thousands of floats at once.


Single threaded
One
Instruction
at a time

One Data
Many Data Items
One
Circuit
Sequential CPU
Vector
Many Circuits
Replicated Arithmetic
SIMD

Many Instructions
simultaniously

One Data
Many Data Items
One
Circuit
Bottleneck
(e.g., branch)
Pipelined
Many
Circuits
??
Superscalar