This simple C++ program loops over an array of floats, and does some work on each float:
enum {n=1024}; float arrIn[n]; float arrOut[n]; int foo(void) { for (int i=0;i<n;i++) arrOut[i]=arrIn[i]*2.1+1.2; return 3; }
Currently, it takes over 2000 nanoseconds for 1000 floats. This is slower than you expect. Examining the assembly code, you see this inner loop code:
1f: f3 0f 10 14 02 movss xmm2,DWORD PTR [rdx+rax*1] 24: 0f 5a d2 cvtps2pd xmm2,xmm2 27: f2 0f 59 d1 mulsd xmm2,xmm1 2b: f2 0f 58 d0 addsd xmm2,xmm0 2f: f2 0f 12 d2 movddup xmm2,xmm2 33: 66 0f 5a d2 cvtpd2ps xmm2,xmm2 37: f3 0f 11 14 01 movss DWORD PTR [rcx+rax*1],xmm2 3c: 48 83 c0 04 add rax,0x4 40: 48 3d 00 10 00 00 cmp rax,0x1000 46: 75 d7 jne 1f
Note the conversion from "ss" (serial single-precision float) to "sd" (serial double-precision). This is caused by the floating-point constants like 2.1, which have type "double" by default. We can speed the program up by over twofold(!) by adding the "f" suffix to the constants, making them floats instead of doubles, and eliminating the sd <-> ss conversions.
arrOut[i]=arrIn[i]*2.1f+1.2f;
For this new version, the inner loop is now very clean, similar to what we might write by hand:
1f: f3 0f 10 04 02 movss xmm0,DWORD PTR [rdx+rax*1] 24: f3 0f 59 c2 mulss xmm0,xmm2 28: f3 0f 58 c1 addss xmm0,xmm1 2c: f3 0f 11 04 01 movss DWORD PTR [rcx+rax*1],xmm0 31: 48 83 c0 04 add rax,0x4 35: 48 3d 00 10 00 00 cmp rax,0x1000 3b: 75 e2 jne 1f
It is possible to speed this up even further by using "ps" instructions to operate on four floats at once, as discussed last lecture. We can start to understand this process by writing the single-float loop in assembly language. This is exactly as fast as the C++ version above, 1100 nanoseconds per run.
mov rcx,0 ; i movss xmm10,[ten] movss xmm2,[mulConst] start: movss xmm0,[arrIn + 4*rcx] addss xmm0,xmm10 mulss xmm0,xmm2 movss [arrOut + 4*rcx],xmm0 add rcx,1 cmp rcx,1024 jl start ret section .data arrIn: times 1024 dd 3.4 arrOut: times 1024 dd 0.0 ten: dd 10.0 mulConst: dd 2.1234566
The "times 1024" prefix indicates the following instruction should be repeated that many times.
To convert this serial loop to operating on four floats at once, we need to:
Here are the code changes in red italics. We don't need to change the register names in SSE. Because I'm using the fast aligned loads, this takes almost 1/4 the time, 290 ns per run!
mov rcx,0 ; i
movaps xmm10,[ten]
movaps xmm2,[mulConst]
start:
movaps xmm0,[arrIn + 4*rcx]
addps xmm0,xmm10
mulps xmm0,xmm2
movaps [arrOut + 4*rcx],xmm0
add rcx,4
cmp rcx,1024
jl start
ret
section .data
align 16
arrIn:
times 1024 dd 3.4
arrOut:
times 1024 dd 0.0
ten:
times 4 dd 10.0
mulConst:
times 4 dd 2.1234566
I see these common bugs year after year, sometimes even in my own code!
In other words, it's assembly: it's trickier than it looks!
Since nobody likes writing in assembly (even me!), the x86 SSE instructions can be accessed from C/C++ via the header <xmmintrin.h>. This header exists and works out-of-the-box with most modern compilers:
Documentation can be found by just reading the xmmintrin header. The underlying instructions have different names, which are listed in the x86 reference manual and summarized below and at this graphical and helpful SSE site. Keep in mind that the names of these things are absurdly ugly; I take this as a hint that you should wrap them in a nicer interface, such as a "floats" or "vec4" class (see below!).
The C version of an SSE register is the type "__m128". gcc provides factory overloads for arithmetic on __m128's. All the instructions start with "_mm_", while the suffix indicates the data type; in these examples, we'll just be talking about 4 floats, which use the suffix "_ps" (Packed Single-precision floats). SSE supports other data types in the same 128 bits, but 4 floats seems to be the sweet spot. So, for example, "_mm_load_ps" loads up 4 floats into a __m128, "_mm_add_ps" adds 4 corresponding floats together, etc. Major useful operations are:
__m128 _mm_load_ps(float *src) | Load 4 floats from a 16-byte aligned address. WARNING: Segfaults if the address isn't a multiple of 16! |
__m128 _mm_loadu_ps(float *src) | Load 4 floats from an unaligned address (4x slower on older machines, same speed on latest machines *if* data is aligned) |
__m128 _mm_load1_ps(float *src) | Load 1 individual float into all 4 fields of an __m128 |
__m128 _mm_setr_ps(float a,float b,float c,float d) | Load 4 separate floats from parameters into an __m128 |
void _mm_store_ps(float *dest,__m128 src) | Store 4 floats to an aligned address. |
void _mm_storeu_ps(float *dest,__m128 src) | Store 4 floats to unaligned address |
__m128 _mm_add_ps(__m128 a,__m128 b) | Add corresponding floats (also "sub") |
__m128 _mm_mul_ps(__m128 a,__m128 b) | Multiply corresponding floats (also "div", but it's slow) |
__m128 _mm_min_ps(__m128 a,__m128 b) | Take corresponding minimum (also "max") |
__m128 _mm_sqrt_ps(__m128 a) | Take square roots of 4 floats (12ns, slow like divide) |
__m128 _mm_rcp_ps(__m128 a) | Compute rough (12-bit accuracy) reciprocal of all 4 floats (as fast as an add!) |
__m128 _mm_rsqrt_ps(__m128 a) | Rough (12-bit) reciprocal-square-root of all 4 floats (fast) |
__m128 _mm_shuffle_ps(__m128 lo,__m128 hi, _MM_SHUFFLE(hi3,hi2,lo1,lo0)) |
Interleave inputs into low 2 floats and high 2 floats of output. Basically out[0]=lo[lo0]; out[1]=lo[lo1]; out[2]=hi[hi2]; out[3]=hi[hi3]; For example, _mm_shuffle_ps(a,a,_MM_SHUFFLE(i,i,i,i)) copies the float a[i] into all 4 output floats. |
There are also instructions for integer conversion, comparsions, various bitwise operations, and cache-friendly prefetches and streaming store operations.
So take your classic "loop over floats":
for (int i=0;i<n_vals;i++) {
vals[i]=vals[i]*a+b;
}
(executable NetRun link)
This takes about 4.5 ns per float.
Step 1 is to unroll this loop 4 times, since SSE works on blocks of 4 floats at a time:
for (int i=0;i<n_vals;i+=4) {
vals[i+0]=vals[i+0]*a+b;
vals[i+1]=vals[i+1]*a+b;
vals[i+2]=vals[i+2]*a+b;
vals[i+3]=vals[i+3]*a+b;
}
(Try this in NetRun now!)
This alone speeds the code up by about 2x, because we don't have to check the loop counter as often.
We can then replace the guts of the loop with SSE instructions:
__m128 SSEa=_mm_load1_ps(&a);
__m128 SSEb=_mm_load1_ps(&b);
__m128 v=_mm_load_ps(&vec[i]);
v=_mm_add_ps(_mm_mul_ps(v,SSEa),SSEb);
_mm_store_ps(&vec[i],v);
This gives us about a 4x speedup over the original, and still a 2x speedup over the unrolled version! Note that if we use the unaligned loads and stores (loadu and storeu), we lose almost all the performance benefit from using SSE in the first place.
The intrinsics above are much cleaner than assembly, but they're still quite ugly. You can make a very nice interface using C++ classes and operator overloading, such as the "floats" class we developed during today's lecture:
class floats { __m128 v; // our four floats public: enum {n=4}; // how many floats?
floats(const float *ptr) { v=_mm_load_ps(ptr); } floats(float c) { v=_mm_load1_ps(&c); } floats(__m128 m) { v=m; }
void store(float *dest) { _mm_store_ps(dest,v); } friend floats operator+(const floats &a,const floats &b) { return _mm_add_ps(a.v,b.v); } friend floats operator*(const floats &a,const floats &b) { return _mm_mul_ps(a.v,b.v); } }; int foo(void) { for (long i=0;i<n;i+=floats::n) { ((floats(&arrIn[i])+10.0)*2.12345).store(&arrOut[i]); } return 3; }
The thing I really like about C++ is this kind of wrapper can have very nice syntax, a complete feature set, and yet compile into no-overhead machine code just like I'd write by hand.
If you want a real version of this wrapper, supporting all the math operations, integers, bools, and floats, you should try Agner Fog's Vector Class Library (VCL).
CS 301 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.