Three Generations of x86 Floating-Point Numbers: FPU, SSE, and AVX

CS 301 Lecture, Dr. Lawlor

Intel has actually created three separate generations of floating-point hardware for the x86. They started with a rather hideous stack-oriented FPU modeled after a pocket calculator in the early 1980's, started over again with a register-based version called SSE in the 1990's, and have just recently created a three-input extension of SSE called AVX.

Briefly, the generations are:

	Register	Typical Instruction	Data Type	Year	Hardware
Generation 1: FPU	st0	faddp	1x 80-bit long double	1981	8087
Generation 2: SSE	xmm0	addss xmm0,xmm1	4x 32-bit float 2x 64-bit double or __m128	1999	Pentium III
Generation 3: AVX	ymm0	vaddss ymm0,ymm3,ymm2	8x 32-bit float 4x 64-bit double or __m256	2011	Sandy Bridge Core CPU

Generation 1: "FPU Register Stack"

Like an HP calculator using RPN arithmetic, to add two numbers with the original 1981 floating point instructions, you:

Load the first number (pushing it onto the "floating point register stack")
Load the second number
Add the two numbers (adds the top two numbers on the floating point register stack)
Store the result

Almost uniquely, you don't specify registers for most operations: instead, the values come from the top of the "floating point register stack". So "faddp" (add and pop) takes two parameters off the stack, and pushes the resulting value. A few instructions include enough bits to specify something other than the top two registers, but usually one value must be on top of the stack.

Internally, there are 8 floating point registers (pushing too many values into those registers makes floating point operations stop working). The registers are all 80 bit "long doubles" (sign bit, 15 bit exponent, and 64 bit mantissa).

On old 32-bit machines, the typical way to return a floating point value from a function was to leave it on the floating-point register stack.

extern "C" double bar(void);
__asm__(
"bar:\n"
"  fld1\n"
"  fldpi\n"
"  faddp\n"
"  ret\n"
);

int foo(void) {
	double d=bar();
	std::cout<<" Function returns "<<d<<"\n";
	return 0;
}

(Try this in NetRun now!)

Here's a similar operation in assembly. Because I move the resulting value to memory myself, this runs in 64-bit mode.

fldpi
fadd st0,st0 ; add register 0 to itself

fstp DWORD [a]; copy top of floating point register stack to memory
mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8    ; Clean up stack

ret ; Done with function

section .data
a: dd 1.234

(Try this in NetRun now!)

This implementation worked reasonably OK for many years, but the restriction that you can only operate on the top of the stack makes it cumbersome for compilers to generate the code for big arithmetic intensive functions--lots of instructions are spent shuffling values around in the stack rather than doing work.

Generation 2: SSE

Intel realized that a more conventional register-based implementation could deliver higher performance, so in the late 1990's they built an entirely separate replacement floating point unit called SSE. This includes:

A brand new set of registers, xmm0 through xmm15. They're all scratch registers.
A brand new set of instructions, like "movss" and "addss".

Here's a typical use:

movss xmm0,[a] ; load from memory
addss xmm0,xmm0 ; add to itself (double it)
movss [a],xmm0 ; store back to memory

mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8    ; Clean up stack

ret ; Done with function

section .data
a: dd 1.234

(Try this in NetRun now!)

The full list of single-float instructions is below. There are also double precision instructions, and some very interesting parallel instructions (we'll talk about these next week).

	Instruction	Comments
Arithmetic	addss	sub, mul, div all work the same way
Compare	minss	max works the same way
Sqrt	sqrtss	Square root (sqrt), reciprocal (rcp), and reciprocal-square-root (rsqrt) all work the same way
Move	movss	Copy DWORD sized data to and from memory. One annoyance is that the fast "aligned" parallel version of this instruction will crashif the destination isn't 16-byte aligned, so the 64-bit call conventions require you to carefully align the stack.
Convert	cvtss2sd cvtss2si cvttss2si	Convert to ("2", get it?) Single Integer (si, stored in register like eax). "cvtt" versions do truncation (round down); "cvt" versions round to nearest.
Compare to flags	ucomiss	Sets CPU flags like normal x86 "cmp" instruction, but from SSE registers. Use with "jb", "jbe", "je", "jae", or "ja" for normal comparisons. Sets "pf", the parity flag, if either input is a NaN.

Here's an example of cvtss2si to convert to integer:

movss xmm3,[pi]; load up constant
addss xmm3,xmm3 ; add pi to itself
cvtss2si eax,xmm3 ; round to integer
ret
section .data
pi: dd 3.14159265358979 ; constant

(Try this in NetRun now!)

Today, SSE is the typical way to do floating point work. Some older compilers might still use the FPU (to work with very old pre-SSE hardware), and the very latest cutting edge machines can use AVX, but this is the mainstream typical version you should probably use for your homeworks.

Generation 3: AVX

Most other modern CPUs have "three operand instructions": X=Y+Z is a single instruction. Note that SSE still uses "two operation instructions": X+=Y is a single instruction. The very latest 2011 x86 instruction set addition AVX changes that: you can now write three-operand instructions!

vmovss xmm1,[a] ; load from memory
vaddss xmm0,xmm1,xmm1 ; add to itself (double it), and store to xmm0
vmovss [a],xmm0 ; store back to memory

mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8    ; Clean up stack

ret ; Done with function

section .data
a: dd 1.234

(Try this in NetRun now!)

There are a few other additions, such as a new set of "ymm" registers with wider vector parallelism, but for scalar code the big change is three operand inputs.

The only downside with AVX is that if your processor was built before 2011, it won't run those instructions!