SIMD: x86 SSE and ARM VFP

CS 641 Lecture, Dr. Lawlor

Here's my writeup for x86 SSE instructions. There's both an assembly language interface, and a set of C/C++ intrinsics in <xmmintrin.h>.

ARM VFP

There's a pretty good summary of all ARM instructions, including VFP ones, over at HeyRick. Regarding registers, briefly r0-r3 are parameter passing and scratch, r4-r11 are saved registers, and higher register numbers have specialized functions (like sp, the stack pointer; or pc, the program counter). Like x86 64-bit, you need to align the stack *if* you're calling a function that uses floating point, but only to 8 bytes (not 16 bytes).

For floating point registers, ARM uses a fairly standard even-odd division to store single and double precision floats in the same storage. This means "D0" stores one double, or you can store two single precision floats in "S0" and "S1" using the same bits. Similarly, D1 overlaps S2 and S3. See the ARM floating point register diagram. Here's an ARM assembly example where we load up a constant, add it to itself, and store it back to memory for printing:

push {r4,lr}    @ (note: we push r4 too, just for 8-byte stack alignment}
sub sp,sp, 32   @ make plenty of space on the stack

adr r0,.myfloats @ makes r0 point to myfloats
flds s0,[r0] @ load single-precision float (from constant below)
fadds s0,s0,s0 @ add to itself
fsts s0,[sp] @ store out to the stack 

mov r0,sp @ location of floats to print
mov r1,1  @ number of floats to print
bl farray_print @ print some floats (FAILS if stack is not 8-byte aligned!)

add sp,sp,32 @ hand back stack space
pop {r4,pc} @ restore link register, and return

.myfloats: @ Note that this is read-only constant space (segfault on store!)
   .word 0x3F9E0419 @ floating point 1.2345
@ Generate constants above via C++: "float x=10.0; return *(int *)&x;"

(Try this in NetRun now!)

(Note: I just added ".syntax unified" to NetRun's boilerplate code, so you no longer need # in front of constants.)

ARM offers a very interesting "rotating register banks" vector setup. Bank 0 (registers D0-D4, or S0-S7) are always single scalar values, but if you set the funky FPSCR LEN field to a nonzero vector length, then Banks 1 through 3 can operate in vector mode.   If you set FPSCR's LEN field to 4, for example, an operation like

fadds s8,s8,s16

actually adds four floats: S8+=S16; S9+=S17; S10+=S18; and S11+=S19;

This ability to mix and match vector operations (on Banks 1-3) and scalar operations (in Bank 0) is quite handy, although I don't like having to store the vector length in LEN.   Loads and stores never go vector according to LEN, but FLDM/FSTM can load multiple registers already.

Here's an example of using LEN=4 vectors:

push {r4,lr}    @ (note: we push r4 too, just for 8-byte stack alignment}
sub sp,sp, 32   @ make plenty of space on the stack

@ Enter vector compute mode
    FMRX    r12,FPSCR           @ copy FPSCR into r12
    BIC     r12,r12,#0x00370000 @ clears STRIDE and LEN
    ORR     r12,r12,#0x00030000 @ sets STRIDE = 1, LEN = 4
    FMXR    FPSCR,r12           @ copy r12 back into FPSCR

adr r0,.myfloats @ makes r0 point to myfloats
fldmias r0,{s8-s11} @ load four single-precision floats (from constants below)
fadds s8,s8,s8 @ add *four* floats (from LEN above)
fstmias sp,{s8-s11} @ store four single-precision floats (to the stack)

@ Leave vector compute mode
    BIC     r12,r12,#0x00370000 @ clears STRIDE =1 and LEN = 1
    FMXR    FPSCR,r12           @ copy r12 back into FPSCR

mov r0,sp @ location of floats to print
mov r1,4  @ number of floats to print
bl farray_print @ print some floats (FAILS if stack is not 8-byte aligned!)

add sp,sp,32 @ hand back stack space
pop {r4,pc} @ restore link register, and return

.myfloats: @ Note that this is read-only constant space (segfault on store!)
   .word 0x3F9E0419 @ floating point 1.2345
   .word 0x42C80000 @ floating point 100.0
   .word 0x41200000 @ floating point 10.0
   .word 0x4048F5C3 @ floating point 3.14
@ Generate constants above via C++: "float x=10.0; return *(int *)&x;"

(Try this in NetRun now!)

Generally, the vector operations seem to be quite fast, taking only a little longer than the scalar versions. In addition, unlike many chip designers, ARM publishes detailed execution information, including cycle counts, pipeline hazards and scoreboarding, so you have something to start with during optimization!