Floating-Point Numbers

CS 641 Lecture, Dr. Lawlor
To represent fractional values, prior to fast floating-point hardware we used to use fixed point arithmetic, where you keep track of the decimal point's location at compile time.  Floating point allows the decimal point to move at runtime, making it more flexible than fixed point.

If you need a review of floats, see these CS 301 lecture notes:
This patent provides a decent summary of a floating-point add circuit.  The basic idea is usually:
You can shift both input numbers into huge fixed-point values (for example, a 32-bit float can be shifted into a fx128.153 floating-point number), but it's much more circuitry-efficient to shift the smaller number so it matches up with the larger value, as we discussed in class.

Software Examples

x86 ancient (1980's) interface: floating-point register stack.
fldpi ; Push "pi" onto floating-point stack
fld DWORD[my_float] ; push constant
faddp ; add one and pi

sub esp,8 ; Make room on the stack for an 8-byte double
fstp QWORD [esp]; Push printf's double parameter onto the stack
push my_string ; Push printf's string parameter (below)
extern printf
call printf ; Print string
add esp,12 ; Clean up stack

ret ; Done with function

my_string: db "Yo! Here's our float: %f",0xa,0
my_float: dd 1.0 ; floating-point DWORD

(Try this in NetRun now!)

x86 newer (1990's) interface: SSE registers
movups xmm0,[my_arr] ; load up array
addps xmm0,xmm0 ; add array to itself
movups [my_arr],xmm0 ; store back to memory

push 4 ; number of values to print
push my_arr ; array to print
extern farray_print
call farray_print ; Print string
add esp,8 ; Clean up stack

ret ; Done with function

section ".data"
my_arr: dd 1.0, 2.0, 3.0, 4.0 ; floating-point DWORD

(Try this in NetRun now!)

Bits in Floating-point Numbers

We can pretty easily count the bits in a float, by making the float smaller and smaller until roundoff loses the "x":
float x=1.0+1.0e-9*(rand()%2); /* FEAR ME, OPTIMIZER!!! */
int itcount=0;
while (x+1.0f!=1.0f) {
x=x*0.5;
itcount++;
}
std::cout<<"itcount=="<<itcount<<"\n";

(Try this in NetRun now!)

We can't do the cout on the GPU, but we can print out different colors.  Green is right on the money; blue is too small, red is too big.
float x=1.0+texcoords.x*1.0e-10; /* FEAR ME, OPTIMIZER!!! */
int itcount=0;
while (x+1.0!=1.0) {
x=x*0.5;
itcount++;
}

if (itcount>24) gl_FragColor=vec4(1,0,0,0); /* red */
else if (itcount==24) gl_FragColor=vec4(0,1,0,0); /* green */
else if (itcount<24) gl_FragColor=vec4(0,0,1,0); /* blue */

(Try this in NetRun now!)

This outputs green, indicating that "float" on the GPU has exactly the same number of bits (23) as on the CPU.

Similarly, we can keep shrinking an "x" until it gets rounded off to zero.  We use up all the exponent values (128), then all the mantissa values (23) as denormals to get a total of 150 places: 2^-150 "underflows" to zero.
float x=1.0+1.0e-9*(rand()%2); /* FEAR ME, OPTIMIZER!!! */
int itcount=0;
while (x!=0.0f) {
x=x*0.5;
itcount++;
}
std::cout<<"itcount=="<<itcount<<"\n";

(Try this in NetRun now!)

Now on the GPU:
float x=1.0+texcoords.x*1.0e-10; /* FEAR ME, OPTIMIZER!!! */
int itcount=0;
while (x!=0.0) {
x=x*0.5;
itcount++;
}

int expected=150;
if (itcount>expected) gl_FragColor=vec4(1,0,0,0); /* red */
else if (itcount==expected) gl_FragColor=vec4(0,1,0,0); /* green */
else if (itcount<expected) gl_FragColor=vec4(0,0,1,0); /* blue */

(Try this in NetRun now!)

Wait--the GPU is returning blue, indicating fewer iterations than we expected!  Experimentally, we can find that the GPU hits underflow at 2^-127.  Wait, that's the last of the normal numbers before denormals kick in.  So the GPU doesn't do denormals, and indeed you can measure that there is zero performance penalty for using values near them.