Three Generations of x86 Floating-Point Numbers: FPU, SSE, and AVX
CS 301 Lecture, Dr. Lawlor
Intel has actually created three separate generations of floating-point
hardware for the x86. They started with a rather hideous
stack-oriented FPU modeled after a pocket calculator in the early
1980's, started over again with a register-based version called SSE in
the 1990's, and have just recently created a three-input extension of SSE called AVX.
Briefly, the generations are:
|
Register
|
Typical Instruction
|
Data Type
|
Year
|
Hardware
|
Generation 1: FPU
|
st0
|
faddp
|
1x 80-bit
long double
|
1981
|
8087
|
Generation 2: SSE
|
xmm0
|
addss xmm0,xmm1
|
4x 32-bit float
2x 64-bit double
or __m128
|
1999
|
Pentium III
|
Generation 3: AVX
|
ymm0
|
vaddss ymm0,ymm3,ymm2
|
8x 32-bit float
4x 64-bit double
or __m256
|
2011
|
Sandy Bridge Core CPU
|
Generation 1: "FPU Register Stack"
Like an HP calculator using RPN arithmetic, to add two numbers with the original 1981 floating point instructions, you:
- Load the first number (pushing it onto the "floating point register stack")
- Load the second number
- Add the two numbers (adds the top two numbers on the floating point register stack)
- Store the result
Almost uniquely, you don't specify registers for most operations:
instead, the values come from the top of the "floating point register
stack". So "faddp" (add and pop) takes two parameters off the
stack, and pushes the resulting value. The reason they did this
is because they *had* to: they weren't willing to use enough bits to
specify two separate register numbers. A few instructions include
enough bits to specify something other than the top two registers, but
usually one value must be on top of the stack.
The FPU register stack was introduced with the 8087 math
coprocessor. It got pretty popular around the 386 era, when you
needed to add a separate 387 chip to the motherboard
to get floating point instructions. Floating point was integrated
into the main CPU around the 486 era, with the 486DX including on-chip
floating point.
Internally, there are 8 floating point registers (pushing too many
values into those registers makes floating point operations stop
working). The registers are each an 80 bit "long double" (sign
bit,
15 bit exponent, and 64 bit mantissa). 80 bits is a really weird
size; one of the only hardware-supported data types that's not a power
of two. I've seen machines that store those 80 bits using 10
bytes of memory (no padding), 12 bytes of memory (padding to 4-byte
multiple), or even 16 bytes of memory (padding to 16-byte multiple).
On old 32-bit machines, the typical way to return a floating point
value from a function was to leave it on the floating-point register
stack.
extern "C" double bar(void);
__asm__(
"bar:\n"
" fld1\n"
" fldpi\n"
" faddp\n"
" ret\n"
);
int foo(void) {
double d=bar();
std::cout<<" Function returns "<<d<<"\n";
return 0;
}
(Try this in NetRun now!)
Here's a similar operation in assembly. Because I move the resulting value to memory myself, this runs in 64-bit mode.
fldpi
fadd st0,st0 ; add register 0 to itself
fstp DWORD [a]; copy top of floating point register stack to memory
mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8 ; Clean up stack
ret ; Done with function
section .data
a: dd 1.234
(Try this in NetRun now!)
This
implementation worked reasonably OK for many years, but the restriction
that you can only operate on the top of the stack makes it cumbersome
for compilers to generate the code for big arithmetic intensive
functions--lots of instructions are spent shuffling values around in
the stack rather than doing work.
Generation 2: SSE
Intel realized that a more conventional register-based implementation
could deliver higher performance, so in the late 1990's they built an
entirely separate replacement floating point unit called SSE. Every 64-bit x86 supports SSE. This includes:
- A brand new set of registers, xmm0 through xmm15. They're all scratch registers.
- A brand new set of instructions, like "movss" and "addss".
Here's a typical use; see previous lectures for the gory details.
movss xmm0,[a] ; load from memory
addss xmm0,xmm0 ; add to itself (double it)
movss [a],xmm0 ; store back to memory
mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8 ; Clean up stack
ret ; Done with function
section .data
a: dd 1.234
(Try this in NetRun now!)
Today, SSE is the typical way to do floating point work. Some
older compilers might still use the FPU (to work with very old pre-SSE
hardware), and the very latest cutting edge machines can use AVX, but
this is the mainstream typical version you should probably use for your
homeworks.
Generation 3: AVX
Most other modern CPUs have "three operand instructions": X=Y+Z is a
single instruction. Note that SSE still uses "two operation
instructions", so X+=Y is a single instruction. The very latest 2011
x86 instruction set addition AVX changes that: you can now write three-operand instructions!
vmovss xmm1,[a] ; load from memory
vaddss xmm0,xmm1,xmm1 ; add to itself (double it), and store to xmm0
vmovss [a],xmm0 ; store back to memory
mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8 ; Clean up stack
ret ; Done with function
section .data
a: dd 1.234
(Try this in NetRun now!)
There
are a few other additions, such as a new set of "ymm" registers with
wider vector parallelism, but for scalar code the big change is three
operand inputs.
The only downside with AVX is that if your processor was built before 2011, it won't run those instructions!