Single Instruction, Multiple Data

CS 441/641 Lecture, Dr. Lawlor

"SIMD" stands for Single Instruction Multiple Data:
The basic idea: one instruction operates on multiple data items simultaniously.  This is a form of software parallelism.

You can do lots interesting SIMD work without using any special instructions--plain old C will do, if you treat an "int" as 32 completely independent bits, because any normal "int" instructions will operate on all the bits at once.  This use of bitwise operations is often called "SIMD within a register (SWAR)" or "word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.

Back in the 1980's, "vector" machines were quite popular in supercomputing centers.  For example, the 1988 Cray Y-MP was a typical vector machine.  When I was an undergraduate, ARSC still had a Y-MP vector machine.  The Y-MP had eight "vector" registers, each of which held 64 doubles (that's a total of 4KB of registers!).  A single Y-MP machine language instruction could add all 64 corresponding numbers in two vector registers, which enabled the Y-MP to achieve the (then) mind-blowing speed of *millions* floating-point operations per second.  Vector machines have now almost completely died out; the NEC SV-1 and the Japanese "Earth Simulator" are the last of this breed.  Vector machines are classic SIMD, because one instruction can modify 64 doubles.

But the most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs.  The usual arrangment for multimedia instructions is for single instructions to operate on four 32-bit floats.  These four-float instructions exist almost everywhere nowdays:
We'll look at the x86 version, SSE, in detail next week.

Branching in SIMD

One big problem in SIMD is branching.  If half the elements in a single SIMD register need one instruction, and half need a different instruction, you can't do them both in a single instruction. 

So the AND-OR trick is used to simulate branches.  The situation where these are useful is when you're trying to convert a loop like this to SIMD:
	for (int i=0;i<n;i++) { 
        if (vec[i]<7)
vec[i]=vec[i]*a+b;
else
vec[i]=c;
}
(Try this in NetRun now!)

You can implement this branch by setting a mask indicating where vals[i]<7, and then using the mask to pick the correct side of the branch to squash.  Note that this code is the *exact* software version of a 2-in mux circuit!
	for (int i=0;i<n;i++) { 
        unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
vec[i]=((vec[i]*a+b)&mask) | (c&~mask);
}
Written in ordinary sequential code, this is actually a slowdown, not a speedup!  There are lots of tricks to speed this up, like building the 0xffFFffFF 'mask' from the sign bit of "vec[i]-7" (negative number indicates vec[i]<7, positive the other way).  Also many SIMD instruction sets, including SSE, have a multi-compare instruction that returns a mask of all zeros or all ones--the implied intent being for you to use the bitwise trick above to avoid any branching!

SIMD Within a Register (SWAR)

SIMD execution can speed things up even without fancy instructions.  For example, the key operation in Conway's Game of Life is to collect up neighbor counts.  We can use SIMD by keeping the three-bit neighbor counts for 64 neighbors stored "vertically" inside corresponding bits of three 64-bit long integers.  The machine's native addition operation doesn't do the right thing, but we can manually build a half adder in software, like this:
/* Conway's Game of Life using SIMD bitwise operations 
Dr. Orion Lawlor, lawlor@alaska.edu, 2011-10-18 (Public Domain)
*/

/* This treats an integer as an array of bits: SIMD Within A Register */
typedef unsigned long swar;

/* This stores an array of 3-bit counters *vertically* */
class set_of_counters {
public:
swar L, M, H; // 3-bit saturating counters, one per *bit*
set_of_counters() {L=M=H=0;}

/* Add one bit from N to each of our counters */
void add(long N) {
// low bit half adder
swar Lcarry=L&N;
L=L^N;

// middle bit half adder
swar Mcarry=M&Lcarry;
M=M^Lcarry;

// last bit saturates
H=H|Mcarry;
}
};
/*
Run the rules for the game of life on these three rows.
prev is the row above you, next is the row below you,
cur is your current row.

A 1 bit indicates a living cell.
*/
swar run_game(swar prev,swar cur,swar next) {
// Each counter stores the number of neighbors around this cell
set_of_counters c;

// Add all eight neighbors to our counts
c.add(prev>>1); c.add(prev); c.add(prev<<1);
c.add(cur>>1); c.add(cur<<1);
c.add(next>>1); c.add(next); c.add(next<<1);

// Run the rules on the resulting counts
long are2=(~c.H)&c.M&(~c.L); // 2==010
long are3=(~c.H)&c.M&c.L; // 3==011
return (are2&cur)|are3; // if 2, unchanged. If 3, alive.
}

// A 2D grid containing game of life cells.
class grid {
public:
enum {ht=30};
swar data[ht];

// Create random initial conditions
void randomize(int seed=1) {
srand(seed);
swar mask=1; mask=~mask;
mask=mask<<1; mask=mask>>1;// knock off high bit
for (int y=0;y<ht;y++) data[y]=rand()&mask;
data[0]=data[ht-1]=0; // clear top and bottom rows
}

// Run one game of life update, writing new cells into dest.
void update(grid &dest) const {
for (int y=1;y<ht-1;y++) {
dest.data[y]=run_game(data[y-1],data[y],data[y+1]);
}
}

// Print the grid onscreen
void print(void) {
for (int y=1;y<ht-1;y++) {
for (unsigned int x=0;x<8*sizeof(swar);x++) {
int bit=(data[y]>>x)&1;
if (bit) std::cout<<"X"; else std::cout<<" ";
}
std::cout<<"\n";
}
}
};

int foo(void) {
grid a,b;
a.randomize(1);

// Run time loop
for (int step=0;step<50;step++) {
std::cout<<"iteration "<<step<<"\n";
a.print();

a.update(b);
std::swap(a,b);
}
return 0;
}
(Try this in NetRun now!)
Compared to a straightforward loop across x, this is much, much faster:
update_swar: 381.17 ns/call
update_bits: 15880.91 ns/call
Program complete. Return 0 (0x0)
(Try this in NetRun now!)
This is a factor of over 40x speedup; not quite the 64x you'd get with perfect SIMD execution on a 64-bit machine, but pretty close!

The downside, of course, is that the software is deeply bizarre; it feels much closer to a circuit design than code!  Unresolved question: how can we design a language/compiler that will allow a computer to make the sequential-to-SIMD code transformation?