Floats!
CS 301 Lecture, Dr. Lawlor
Ordinary integers can only represent integral values.
"Floating-point numbers" can
represent fractions. This is useful for engineering,
science, statistics, graphics, and any time you need to represent
numbers from the real world, which are rarely integral!
Floats store numbers in an odd way--they're really storing the number
in scientific notation, like
x = + 3.785746 * 105
Note that:
- You only need one bit to represent the sign--plus or minus.
- The exponent's just an integer, so you can store it as an integer.
- The 3.785746 part can be stored as the integer 3785746, as least
as long as you can figure out where the decimal point goes.
Normalized Numbers
Scientific notation can represent the same number in several different
ways:
x = + 3.785746 * 105 = + 0.3785746
* 106 = + 0.003785746 * 107 = + 37.85746 * 104
It's common to "normalize" a number in scientific notation so that:
- There's exactly one digit to the left of the decimal point.
- And that digit ain't zero.
This means the 105 version is the "normal" way to write the
number above.
In binary, a "normalized" number *always* has a 1 at the left of the
decimal point. So there's no reason to even store the 1; you just
know it's there!
(Note that there are also "denormalized" numbers, like 0.0, that don't have
a leading 1; it's an implicit leading 0 if the exponent field is
zero...)
Float as Bits
Floats represent continuous values. But they do it using discrete
bits.
A "float" (as defined by IEEE Standard
754) consists of three bitfields:
Sign
|
Exponent
|
Fraction (or
"Mantissa")
|
1 bit--
0 for positive
1 for negative
|
8 unsigned bits--
127 means 20
137 means 210
|
23 bits-- a binary fraction.
Don't forget the implicit
leading 1!
|
The sign is in the highest-order bit, the exponent in the next 8 bits,
and the fraction in the remaining bits.
The hardware interprets a float as having the value:
value = (-1) sign
* 2 (exponent-127) * 1.fraction
Note that the mantissa has an implicit leading binary 1 applied
(unless the exponent field is zero, when it's an implicit leading 0; a
"denormalized" number).
For example, the value "8" would be stored with sign bit 0, exponent
130 (==3+127), and mantissa 000... (without the leading 1), since:
8 = (-1) 0
* 2 (130-127) * 1.0000....
You can actually dissect the parts of a float using a "union" and a
bitfield like so:
/* IEEE floating-point number's bits: sign exponent mantissa */
struct float_bits {
unsigned int fraction:23; /**< Value is binary 1.fraction ("mantissa") */
unsigned int exp:8; /**< Value is 2^(exp-127) */
unsigned int sign:1; /**< 0 for positive, 1 for negative */
};
/* A union is a struct where all the fields *overlap* each other */
union float_dissector {
float f;
float_bits b;
};
float_dissector s;
s.f=8.0;
std::cout<<s.f<<"= sign "<<s.b.sign<<" exp "<<s.b.exp<<" fract "<<s.b.fraction<<"\n";
return 0;
(Executable
NetRun link)
There are several different sizes of floating-point types:
C Datatype
|
Size
|
Approx. Precision
|
Approx. Range
|
Exponent Bits
|
Fraction Bits
|
+-1 range
|
float
|
4 bytes (everywhere)
|
1.0x10-7
|
1038
|
8
|
23
|
224
|
double
|
8 bytes (everywhere)
|
2.0x10-15
|
10308
|
11
|
52
|
253
|
long double
|
12-16 bytes (if it exists)
|
2.0x10-20
|
104932
|
15
|
64
|
265
|
Nowadays floats have roughly the same
performance as
integers:
addition takes a little over a nanosecond (slightly slower than integer
addition); multiplication takes a few nanoseconds; and division takes a
dozen or more nanoseconds. That is, floats are now cheap, and you
can consider using floats for all sorts of stuff--even when you don't
care about fractions.
Roundoff
They're funny old things, floats. The fraction part only stores
so much precision; further bits are lost. For example, in reality,
1.2347654 * 104 = 1.234* 104 +
7.654* 100
But to three decimal places,
1.234 * 104 = 1.234* 104 +
7.654* 100
which is to say, adding a tiny value to a great big value might not
change the great big value at
all, because the tiny value gets lost when rounding off to 3
places. This "roundoff" has implications.
For example, adding one repeatedly will eventually stop doing anything:
float f=0.73;
while (1) {
volatile float g=f+1;
if (g==f) {
printf("f+1 == f at f=%.3f, or 2^%.3f\n",
f,log(f)/log(2.0));
return 0;
}
else f=g;
}
(executable
NetRun link)
Recall that for integers, adding one repeatedly will *never* give you
the same value--eventually the integer will wrap around, but it won't
just stop moving like floats!
For another example, floating-point arithmetic isn't "associative"--if
you change the order
of operations, you change the result (up to roundoff):
1.2355308 * 104 = 1.234* 104 +
(7.654* 100 + 7.654* 100)
1.2355308 * 104 = (1.234* 104
+ 7.654* 100) + 7.654* 100
In other words, parenthesis don't matter if you're computing the exact
result. But to three decimal places,
1.235 * 104 = 1.234* 104 +
(7.654* 100 + 7.654* 100)
1.234 * 104 = (1.234* 104 +
7.654* 100) + 7.654* 100
In the first line, the small values get added together, and
together they're enough to move the big value. But separately,
they splat like bugs against the windshield of the big value, and don't
affect it at all!
double lil=1.0;
double big=pow(2.0,64);
printf(" big+(lil+lil) -big = %.0f\n", big+(lil+lil) -big);
printf("(big+lil)+lil -big = %.0f\n",(big+lil)+lil -big);
(executable
NetRun link)
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
windshield += gnats;
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
(executable
NetRun link)
In fact, if you've got a bunch of small values to add to a big value,
it's more roundoff-friendly to add all the small values together first,
then add them all to
the big value:
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
(executable
NetRun link)
See page 80 of the textbook for piles of examples and details.