SSE for High Performance Programming

This simple C++ program loops over an array of floats, and does some work on each float:

enum {n=1024};
float arrIn[n];
float arrOut[n];

int foo(void) {
	for (int i=0;i<n;i++) arrOut[i]=arrIn[i]*2.1+1.2;
	return 3;
}

(Try this in NetRun now!)

Currently, it takes over 2000 nanoseconds for 1000 floats. This is slower than you expect. Examining the assembly code, you see this inner loop code:

  1f:	f3 0f 10 14 02       	movss  xmm2,DWORD PTR [rdx+rax*1]
  24:	0f 5a d2             	cvtps2pd xmm2,xmm2
  27:	f2 0f 59 d1          	mulsd  xmm2,xmm1
  2b:	f2 0f 58 d0          	addsd  xmm2,xmm0
  2f:	f2 0f 12 d2          	movddup xmm2,xmm2
  33:	66 0f 5a d2          	cvtpd2ps xmm2,xmm2
  37:	f3 0f 11 14 01       	movss  DWORD PTR [rcx+rax*1],xmm2
  3c:	48 83 c0 04          	add    rax,0x4
  40:	48 3d 00 10 00 00    	cmp    rax,0x1000
  46:	75 d7                	jne    1f

Note the conversion from "ss" (serial single-precision float) to "sd" (serial double-precision). This is caused by the floating-point constants like 2.1, which have type "double" by default. We can speed the program up by over twofold(!) by adding the "f" suffix to the constants, making them floats instead of doubles, and eliminating the sd <-> ss conversions.

arrOut[i]=arrIn[i]*2.1f+1.2f;

(Try this in NetRun now!)

For this new version, the inner loop is now very clean, similar to what we might write by hand:

  1f:	f3 0f 10 04 02       	movss  xmm0,DWORD PTR [rdx+rax*1]
  24:	f3 0f 59 c2          	mulss  xmm0,xmm2
  28:	f3 0f 58 c1          	addss  xmm0,xmm1
  2c:	f3 0f 11 04 01       	movss  DWORD PTR [rcx+rax*1],xmm0
  31:	48 83 c0 04          	add    rax,0x4
  35:	48 3d 00 10 00 00    	cmp    rax,0x1000
  3b:	75 e2                	jne    1f

Parallel SSE in Assembly

It is possible to speed this up even further by using "ps" instructions to operate on four floats at once, as discussed last lecture. We can start to understand this process by writing the single-float loop in assembly language. This is exactly as fast as the C++ version above, 1100 nanoseconds per run.

mov rcx,0 ; i
movss xmm10,[ten]
movss xmm2,[mulConst]

start:
	movss xmm0,[arrIn + 4*rcx]
	addss xmm0,xmm10
	mulss xmm0,xmm2
	movss [arrOut + 4*rcx],xmm0
	add rcx,1
	cmp rcx,1024
	jl start
ret

section .data
arrIn:
	times 1024 dd 3.4
arrOut:
	times 1024 dd 0.0
ten:
	dd 10.0
mulConst:
	dd 2.1234566

(Try this in NetRun now!)

The "times 1024" prefix indicates the following instruction should be repeated that many times.

To convert this serial loop to operating on four floats at once, we need to:

Convert "ss" arithmetic to "ps" arithmetic.
Convert "movss" loads and stores to either "movups" (unaligned, always works but slower) or "movaps" (must be 16-byte aligned or it segfaults, but it's faster). We may need to replicate constant instructions to do this.
Convert "ucomiss" branch instructions into "maxps", "minps", or the complicated masked if-then-else conditionals (see last lecture).
Change the loop to jump over 4 floats at each iteration.

Here are the code changes in red italics. We don't need to change the register names in SSE. Because I'm using the fast aligned loads, this takes almost 1/4 the time, 290 ns per run!

mov rcx,0 ; i
movaps xmm10,[ten]
movaps xmm2,[mulConst]

start:
	movaps xmm0,[arrIn + 4*rcx]
	addps xmm0,xmm10
	mulps xmm0,xmm2
	movaps [arrOut + 4*rcx],xmm0
	add rcx,4
	cmp rcx,1024
	jl start
ret

section .data
align 16


arrIn:
	times 1024 dd 3.4
arrOut:
	times 1024 dd 0.0
ten:
	times 4   dd 10.0
mulConst:
	times 4   dd 2.1234566

(Try this in NetRun now!)

I see these common bugs year after year, sometimes even in my own code!

Load four-wide constant from a single isolated float? You'll get the wrong answer, since the other three floats are garbage.
Increment the loop by one float instead of four floats? It'll crash with movaps, since i==1 is misaligned, and it'll be very slow with movups anyway. It'll also run several floats off the end of the array when i==n-1.
Miss a "ps" on one instruction? "ss" will only modify the low float of the four.

In other words, it's assembly: it's trickier than it looks!

Beyond Assembly: xmmintrin Functions for SSE

Since nobody likes writing in assembly (even me!), the x86 SSE instructions can be accessed from C/C++ via the header <xmmintrin.h>. This header exists and works out-of-the-box with most modern compilers:

GCC 3.x or newer under linux or MacOS X. Be sure to compile with "-msse", or the header will whine.
Visual C++ 2003 or better under Windows.

Documentation can be found by just reading the xmmintrin header. The underlying instructions have different names, which are listed in the x86 reference manual and summarized below and at this graphical and helpful SSE site. Keep in mind that the names of these things are absurdly ugly; I take this as a hint that you should wrap them in a nicer interface, such as a "floats" or "vec4" class (see below!).

The C version of an SSE register is the type "__m128". gcc provides factory overloads for arithmetic on __m128's. All the instructions start with "_mm_", while the suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats). SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot. So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc. Major useful operations are:

__m128 _mm_load_ps(float *src)	Load 4 floats from a 16-byte aligned address. WARNING: Segfaults if the address isn't a multiple of 16!
__m128 _mm_loadu_ps(float *src)	Load 4 floats from an unaligned address (4x slower on older machines, same speed on latest machines if data is aligned)
__m128 _mm_load1_ps(float *src)	Load 1 individual float into all 4 fields of an __m128
__m128 _mm_setr_ps(float a,float b,float c,float d)	Load 4 separate floats from parameters into an __m128
void _mm_store_ps(float *dest,__m128 src)	Store 4 floats to an aligned address.
void _mm_storeu_ps(float *dest,__m128 src)	Store 4 floats to unaligned address
__m128 _mm_add_ps(__m128 a,__m128 b)	Add corresponding floats (also "sub")
__m128 _mm_mul_ps(__m128 a,__m128 b)	Multiply corresponding floats (also "div", but it's slow)
__m128 _mm_min_ps(__m128 a,__m128 b)	Take corresponding minimum (also "max")
__m128 _mm_sqrt_ps(__m128 a)	Take square roots of 4 floats (12ns, slow like divide)
__m128 _mm_rcp_ps(__m128 a)	Compute rough (12-bit accuracy) reciprocal of all 4 floats (as fast as an add!)
__m128 _mm_rsqrt_ps(__m128 a)	Rough (12-bit) reciprocal-square-root of all 4 floats (fast)
__m128 _mm_shuffle_ps(__m128 lo,__m128 hi, _MM_SHUFFLE(hi3,hi2,lo1,lo0))	Interleave inputs into low 2 floats and high 2 floats of output. Basically out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3]; For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats.

There are also instructions for integer conversion, comparsions, various bitwise operations, and cache-friendly prefetches and streaming store operations.

So take your classic "loop over floats":

	for (int i=0;i<n_vals;i++) { 
	        vals[i]=vals[i]*a+b;
	}

(executable NetRun link)

This takes about 4.5 ns per float.

Step 1 is to unroll this loop 4 times, since SSE works on blocks of 4 floats at a time:

	for (int i=0;i<n_vals;i+=4) { 
	        vals[i+0]=vals[i+0]*a+b;
	        vals[i+1]=vals[i+1]*a+b;
	        vals[i+2]=vals[i+2]*a+b;
	        vals[i+3]=vals[i+3]*a+b;
	}

(Try this in NetRun now!)

This alone speeds the code up by about 2x, because we don't have to check the loop counter as often.

We can then replace the guts of the loop with SSE instructions:

	__m128 SSEa=_mm_load1_ps(&a);
	__m128 SSEb=_mm_load1_ps(&b);
	__m128 v=_mm_load_ps(&vec[i]);
	v=_mm_add_ps(_mm_mul_ps(v,SSEa),SSEb);
	_mm_store_ps(&vec[i],v);

(Try this in NetRun now!)

This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version! Note that if we use the unaligned loads and stores (loadu and storeu), we lose almost all the performance benefit from using SSE in the first place.

C++ Classes and SSE

The intrinsics above are much cleaner than assembly, but they're still quite ugly. You can make a very nice interface using C++ classes and operator overloading, such as the "floats" class we developed during today's lecture:

class floats {
	__m128 v; // our four floats
public:
	enum {n=4}; // how many floats?

	floats(const float *ptr) { v=_mm_load_ps(ptr); }
	floats(float c) { v=_mm_load1_ps(&c); }
	floats(__m128 m) { v=m; }

	void store(float *dest) { _mm_store_ps(dest,v); }

	friend floats operator+(const floats &a,const floats &b) {
		return _mm_add_ps(a.v,b.v);
	}
	friend floats operator*(const floats &a,const floats &b) {
		return _mm_mul_ps(a.v,b.v);
	}
};

int foo(void) {
	for (long i=0;i<n;i+=floats::n) {
		((floats(&arrIn[i])+10.0)*2.12345).store(&arrOut[i]);
	}
	return 3;
}

(Try this in NetRun now!)

The thing I really like about C++ is this kind of wrapper can have very nice syntax, a complete feature set, and yet compile into no-overhead machine code just like I'd write by hand.

If you want a real version of this wrapper, supporting all the math operations, integers, bools, and floats, you should try Agner Fog's Vector Class Library (VCL).

CS 301 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.