Floating Point Numbers and Roundoff
CS 301 Lecture, Dr. Lawlor
Normalized Numbers
In C++, "float" and "double" store numbers in an odd way--internally they're really storing the number
in scientific notation, like
x = + 3.785746 * 105
Note that:
- You only need one bit to represent the sign--plus or minus.
- The exponent's just an integer, so you can store it as an integer.
- The 3.785746 part, called the "mantissa" or just "fraction" part, can be stored as the integer 3785746 (at least
as long as you can figure out where the decimal point goes!). Of course, we store the fraction part as a binary number.
Scientific notation is designed to be compatible with slide rules (here's a circular slide rule demo);
slide rules are basically a log table starting at 1. This works
because log(1) = 0, and log(a) + log(b) = log(ab). But slide
rules only give you the mantissa; you need to figure out the exponent
yourself. The "order of magnitude" guess that engineers (and I)
like so much is just a calculation using zero significant digits--no
mantissa, all exponent.
One problem is scientific notation can represent the same number in several different
ways:
x = + 3.785746 * 105 = + 0.3785746
* 106 = + 0.03785746 * 107 = + 37.85746 * 104
It's common to "normalize" a number in scientific notation so that:
- There's exactly one digit to the left of the decimal point.
- And that digit ain't zero.
This means the 105 version above is the "normal" way to write the
number above.
In binary, a "normalized" number *always* has a 1 at the left of the
decimal point (if it ain't zero, it's gotta be one). So there's no reason to even store the 1; you just
know it's there!
Roundoff in Arithmetic
They're funny old things, floats. The fraction part (mantissa) only stores
so much precision; further bits are lost. For example, in reality,
1.234* 104 +
7.654* 100 = 1.2347654 * 104
But if we only keep three decimal places,
1.234* 104 +
7.654* 100 = 1.234 * 104
which is to say, adding a tiny value to a great big value might not
change the great big value at
all,
because the tiny value gets lost when rounding off to 3
places. To avoid this "roundoff error", when you're doing
arithmetic by hand, people recommend keeping lots of digits, and only
rounding once, at the end. But for a given value of "lots of
digits", did you keep enough?
For example, on a real computer adding one to a float repeatedly will eventually stop changing the float!
float f=0.73;
while (1) {
volatile float g=f+1;
if (g==f) {
std::cout<<"f+1 == f at f="<< f <<", or 2^"<< log(f)/log(2.0) <<std::endl;
return 0;
}
else f=g;
}
(Try this in NetRun now!)
For "double", you can add one more times, but eventually the double
will stop changing despite your additions. Recall that for
integers, adding one repeatedly will *never* give you
the same value--eventually the integer will wrap around, but it won't
just stop moving like floats!
This has really weird effects. For example, floating-point arithmetic isn't "associative"--if
you change the order
of operations, you change the result due to accumulated roundoff. In exact arithmetic:
1.2355308 * 104 = 1.234* 104 +
(7.654* 100 + 7.654* 100)
1.2355308 * 104 = (1.234* 104
+ 7.654* 100) + 7.654* 100
In other words, parenthesis don't matter if you're computing the exact
result. But to three decimal places,
1.235 * 104 = 1.234* 104 +
(7.654* 100 + 7.654* 100)
1.234 * 104 = (1.234* 104 +
7.654* 100) + 7.654* 100
In the first line, the small values get added together, and
together they're enough to move the big value. But separately,
they splat like bugs against the windshield of the big value, and don't
affect it at all!
double lil=1.0;
double big=pow(2.0,53); //<- carefully chosen for IEEE 64-bit float (52 bits of fraction + implicit 1)
std::cout<<" big+(lil+lil) -big = "<< big+(lil+lil) -big <<std::endl;
std::cout<<"(big+lil)+lil -big = "<< (big+lil)+lil -big <<std::endl;
(Try this in NetRun now!)
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
for (int i=0;i<1000;i++)
windshield += gnats;
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
(executable
NetRun link)
In fact, if you've got a bunch of small values to add to a big value,
it's more roundoff-friendly to add all the small values together first,
then add them all to
the big value:
float gnats=1.0;
volatile float windshield=1<<24;
float orig=windshield;
volatile float gnatcup=0.0;
for (int i=0;i<1000;i++)
gnatcup += gnats;
windshield+=gnatcup; /* add all gnats to the windshield at once */
if (windshield==orig) std::cout<<"You puny bugs can't harm me!\n";
else std::cout<<"Gnats added "<<windshield-orig<<" to the windshield\n";
(executable
NetRun link)
Roundoff can be very annoying. But it's not the end of the world
if you don't care
about exact answers, like in computer games, and even in many
simulations (where "exact" is unmeasureable anyway). You just
need to be able to estimate the amount of roundoff, and make sure it's
not too much.
However, the amount of roundoff depends on the precision you
keep in your numbers. This, in turn, depends on the size of the
numbers. For example, a "float" is just 4 bytes, so it's not
very precise. A "double" is 8 bytes, and so more precise.
A "long double" is 12 bytes (or more!), using more memory, but it's got tons of
precision. There's often a serious tradeoff between precision and
space (and time), so just using long double for everything isn't a good
idea: your program may get bigger and slower, and you still might not
have enough precision.
Roundoff in Representation
Sadly, 0.1 decimal is an infinitely repeating pattern in binary: 0.0(0011),
with 0011 repeating. This means multiplying by some *finite*
pattern to approximate 0.1 is only an approximation of really dividing
by the integer 10.0. The exact difference is proportional to the
precision of the numbers and the size of the input data:
for (int i=1;i<1000000000;i*=10) {
double mul01=i*0.1;
double div10=i/10.0;
double diff=mul01-div10;
std::cout<<"i="<<i<<" diff="<<diff<<"\n";
}
(executable NetRun link)
In a perfect world, multiplying by 0.1 and dividing by 10 would give
the exact same result. But in reality, 0.1 has to be approximated
by a finite series of binary digits, while the integer 10 can be stored
exactly, so on NetRun's Pentium4 CPU, this code gives:
i=1 diff=5.54976e-18
i=10 diff=5.55112e-17
i=100 diff=5.55112e-16
i=1000 diff=5.55112e-15
i=10000 diff=5.55112e-14
i=100000 diff=5.55112e-13
i=1000000 diff=5.54934e-12
i=10000000 diff=5.5536e-11
i=100000000 diff=5.54792e-10
Program complete. Return 0 (0x0)
That is, there's a factor of 10^-18 difference between double-precision 0.1 and the true 1/10! This can add up over time.
Roundoff Taking Over Control
One place roundoff is very annoying is in your control
structures. For example, this loop will execute *seven* times,
even though it looks like it should only execute *six* times:
std::cout.precision(20);
for (double k=0.0;k<1.0;k+=1.0/6.0) {
std::cout<<"k="<<k<<"\n";
}
(Try this in NetRun now!)
Notice that the very last iteration isn't quite 1.0, it's k=0.99999999999999988898. Sigh.
The trouble is of course that 1/6 can't be represented exactly in
floating-point, so if we add our approximation for 1/6 six times, we
haven't quite hit 1.0, so the loop executes one additional time.
There are several possible fixes for this:
- Don't use floating-point as your loop variable. Loop over an
integer i (without roundoff), and divide by six to get k. This is
the recommended approach if you care about the exact number of times
around the loop. Many coding standards mandate this approach.
- Or, only increment the loop by a value with a short binary
representation, like 0.125 (==0.001 in binary). This clearly
deserves a big obvious comment explaining what's subtle here.
- Or you could adjust the loop
termination condition so it's
"k<1.0-0.00001", where the "0.00001" provides some safety margin for
roundoff. This sort of "epsilon" value is common along
floating-point boundaries, although too small and you can still get
roundoff, and too big and you've screwed up the computation.
- Or you could use a lower-precision comparison, like
"(float)k<1.0f". This also provides roundoff margin, because
the comparison is taking place at the lower "float" precision.
Any of these fixes will work, but you do have to realize this is a
potential problem, and put the precision-compensation code in there!