Register |
Typical Instruction |
Data Type |
Year |
Hardware |
|
Generation 1: FPU |
st0 |
faddp |
1x 80-bit long double |
1981 |
8087 |
Generation 2: SSE |
xmm0 |
addss xmm0,xmm1 |
4x 32-bit float 2x 64-bit double or __m128 |
1999 |
Pentium III |
Generation 3: AVX |
ymm0 |
vaddss ymm0,ymm3,ymm2 |
8x 32-bit float 4x 64-bit double or __m256 |
2011 |
Sandy Bridge Core CPU |
extern "C" double bar(void);Here's a similar operation in assembly. Because I move the resulting value to memory myself, this runs in 64-bit mode.
__asm__(
"bar:\n"
" fld1\n"
" fldpi\n"
" faddp\n"
" ret\n"
);
int foo(void) {
double d=bar();
std::cout<<" Function returns "<<d<<"\n";
return 0;
}
fldpiThis implementation worked reasonably OK for many years, but the restriction that you can only operate on the top of the stack makes it cumbersome for compilers to generate the code for big arithmetic intensive functions--lots of instructions are spent shuffling values around in the stack rather than doing work.
fadd st0,st0 ; add register 0 to itself
fstp DWORD [a]; copy top of floating point register stack to memory
mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8 ; Clean up stack
ret ; Done with function
section .data
a: dd 1.234
movss xmm0,[a] ; load from memoryThe full list of single-float instructions is below. There are also double precision instructions, and some very interesting parallel instructions (we'll talk about these next week).
addss xmm0,xmm0 ; add to itself (double it)
movss [a],xmm0 ; store back to memory
mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8 ; Clean up stack
ret ; Done with function
section .data
a: dd 1.234
Instruction |
Comments |
|
Arithmetic |
addss |
sub, mul, div all work the same way |
Compare |
minss |
max works the same way |
Sqrt |
sqrtss |
Square root (sqrt), reciprocal (rcp), and reciprocal-square-root (rsqrt) all work the same way |
Move |
movss |
Copy DWORD sized data to and from
memory. One annoyance is that the fast "aligned" parallel version
of this instruction will crashif the destination isn't 16-byte aligned, so the 64-bit call conventions require you to carefully align the stack. |
Convert | cvtss2sd cvtss2si cvttss2si |
Convert to ("2", get it?) Single
Integer (si, stored in register like eax). "cvtt" versions do truncation (round down); "cvt"
versions round to nearest. |
Compare to flags |
ucomiss |
Sets CPU flags like normal x86 "cmp" instruction, but from SSE registers.
Use with "jb", "jbe", "je", "jae", or "ja" for normal
comparisons. Sets "pf", the parity flag, if either input is a NaN. |
movss xmm3,[pi]; load up constantToday, SSE is the typical way to do floating point work. Some older compilers might still use the FPU (to work with very old pre-SSE hardware), and the very latest cutting edge machines can use AVX, but this is the mainstream typical version you should probably use for your homeworks.
addss xmm3,xmm3 ; add pi to itself
cvtss2si eax,xmm3 ; round to integer
ret
section .data
pi: dd 3.14159265358979 ; constant
vmovss xmm1,[a] ; load from memoryThere are a few other additions, such as a new set of "ymm" registers with wider vector parallelism, but for scalar code the big change is three operand inputs.
vaddss xmm0,xmm1,xmm1 ; add to itself (double it), and store to xmm0
vmovss [a],xmm0 ; store back to memory
mov rdi,a; address of our float
mov rsi,1; number of floats to print
sub rsp,8 ; align stack for farray_print
extern farray_print
call farray_print
add rsp,8 ; Clean up stack
ret ; Done with function
section .data
a: dd 1.234