Single Instruction, Multiple Data

CS 441/641 Lecture, Dr. Lawlor

"SIMD" stands for Single Instruction Multiple Data:

Single, meaning just one.
Instruction, as in a machine code instruction, executed by hardware.
Multiple, as in more than one--from 2 to a thousand or so.
Data, as in floats or ints.

The basic idea: one instruction operates on multiple data items simultaniously. This is a form of software parallelism.

You can do lots interesting SIMD work without using any special instructions--plain old C will do, if you treat an "int" as 32 completely independent bits, because any normal "int" instructions will operate on all the bits at once. This use of bitwise operations is often called "SIMD within a register (SWAR)" or "word-SIMD"; see Sean Anderson's "Bit Twiddling Hacks" for a variety of amazing examples.

Back in the 1980's, "vector" machines were quite popular in supercomputing centers. For example, the 1988 Cray Y-MP was a typical vector machine. When I was an undergraduate, ARSC still had a Y-MP vector machine. The Y-MP had eight "vector" registers, each of which held 64 doubles (that's a total of 4KB of registers!). A single Y-MP machine language instruction could add all 64 corresponding numbers in two vector registers, which enabled the Y-MP to achieve the (then) mind-blowing speed of *millions* floating-point operations per second. Vector machines have now almost completely died out; the NEC SV-1 and the Japanese "Earth Simulator" are the last of this breed. Vector machines are classic SIMD, because one instruction can modify 64 doubles.

But the most common form of SIMD today are the "multimedia" instruction set extensions in normal CPUs. The usual arrangment for multimedia instructions is for single instructions to operate on four 32-bit floats. These four-float instructions exist almost everywhere nowdays:

x86, where they are called "SSE" (Streaming SIMD Extensions) or "AVX" (Advanced Vector eXtensions)
PowerPC, where they are called "AltiVec"
Cell Broadband Engine, where they are part of the "Synergistic Processing Elements"
Graphics cards, where they are part of the pixel processors

We'll look at the x86 version, SSE, in detail next week.

Branching in SIMD

One big problem in SIMD is branching. If half the elements in a single SIMD register need one instruction, and half need a different instruction, you can't do them both in a single instruction.

So the AND-OR trick is used to simulate branches. The situation where these are useful is when you're trying to convert a loop like this to SIMD:

	for (int i=0;i<n;i++) { 
	        if (vec[i]<7) 
			vec[i]=vec[i]*a+b;
		else
			vec[i]=c;
	}

(Try this in NetRun now!)

You can implement this branch by setting a mask indicating where vals[i]<7, and then using the mask to pick the correct side of the branch to squash. Note that this code is the *exact* software version of a 2-in mux circuit!

	for (int i=0;i<n;i++) { 
	        unsigned int mask=(vec[i]<7)?0xffFFffFF:0;
		vec[i]=((vec[i]*a+b)&mask) | (c&~mask);
	}

Written in ordinary sequential code, this is actually a slowdown, not a speedup! There are lots of tricks to speed this up, like building the 0xffFFffFF 'mask' from the sign bit of "vec[i]-7" (negative number indicates vec[i]<7, positive the other way). Also many SIMD instruction sets, including SSE, have a multi-compare instruction that returns a mask of all zeros or all ones--the implied intent being for you to use the bitwise trick above to avoid any branching!

SIMD Within a Register (SWAR)

SIMD execution can speed things up even without fancy instructions. For example, the key operation in Conway's Game of Life is to collect up neighbor counts. We can use SIMD by keeping the three-bit neighbor counts for 64 neighbors stored "vertically" inside corresponding bits of three 64-bit long integers. The machine's native addition operation doesn't do the right thing, but we can manually build a half adder in software, like this:

/* Conway's Game of Life using SIMD bitwise operations 
Dr. Orion Lawlor, lawlor@alaska.edu, 2011-10-18 (Public Domain)
*/

/* This treats an integer as an array of bits: SIMD Within A Register */
typedef unsigned long swar; 

/* This stores an array of 3-bit counters *vertically* */
class set_of_counters {
public:
	swar L, M, H; // 3-bit saturating counters, one per *bit*
	set_of_counters() {L=M=H=0;}

	/* Add one bit from N to each of our counters */
	void add(long N) {
		// low bit half adder
		swar Lcarry=L&N;
		L=L^N;
		
		// middle bit half adder
		swar Mcarry=M&Lcarry;
		M=M^Lcarry;
		
		// last bit saturates
		H=H|Mcarry; 
	}
};
/*
 Run the rules for the game of life on these three rows.
 prev is the row above you, next is the row below you,
 cur is your current row.

 A 1 bit indicates a living cell.
*/
swar run_game(swar prev,swar cur,swar next) {
// Each counter stores the number of neighbors around this cell
	set_of_counters c;

// Add all eight neighbors to our counts
	c.add(prev>>1);	c.add(prev); c.add(prev<<1);
	c.add(cur>>1);               c.add(cur<<1);
	c.add(next>>1); c.add(next); c.add(next<<1);

// Run the rules on the resulting counts
	long are2=(~c.H)&c.M&(~c.L); // 2==010
	long are3=(~c.H)&c.M&c.L; // 3==011
	return (are2&cur)|are3; // if 2, unchanged.  If 3, alive.
}

// A 2D grid containing game of life cells.
class grid {
public:
	enum {ht=30};
	swar data[ht];
	
	// Create random initial conditions
	void randomize(int seed=1) {
		srand(seed);
		swar mask=1; mask=~mask; 
		mask=mask<<1; mask=mask>>1;// knock off high bit
		for (int y=0;y<ht;y++) data[y]=rand()&mask;
		data[0]=data[ht-1]=0; // clear top and bottom rows
	}
	
	// Run one game of life update, writing new cells into dest.
	void update(grid &dest) const {
		for (int y=1;y<ht-1;y++) {
			dest.data[y]=run_game(data[y-1],data[y],data[y+1]);
		}
	}

	// Print the grid onscreen
	void print(void) {
		for (int y=1;y<ht-1;y++) {
			for (unsigned int x=0;x<8*sizeof(swar);x++) {
				int bit=(data[y]>>x)&1;
				if (bit) std::cout<<"X"; else std::cout<<" ";
			}
			std::cout<<"\n";
		}
	}
};	

int foo(void) {
	grid a,b;
	a.randomize(1);
	
// Run time loop 
	for (int step=0;step<50;step++) {
		std::cout<<"iteration "<<step<<"\n";
		a.print();

		a.update(b);
		std::swap(a,b);
	}
	return 0;
}
(Try this in NetRun now!)

Compared to a straightforward loop across x, this is much, much faster:

update_swar: 381.17 ns/call
update_bits: 15880.91 ns/call
Program complete.  Return 0 (0x0)
(Try this in NetRun now!)

This is a factor of over 40x speedup; not quite the 64x you'd get with perfect SIMD execution on a 64-bit machine, but pretty close!

The downside, of course, is that the software is deeply bizarre; it feels much closer to a circuit design than code! Unresolved question: how can we design a language/compiler that will allow a computer to make the sequential-to-SIMD code transformation?