Swapping Operations for Speed: Strength Reduction
CS 301 Lecture, Dr. Lawlor
Not every machine language instruction runs at the same speed.
"Heavy" operations, like memory accesses or divide instructions, run
substantially slower than "light" operations like addition or register
move.
This program measures the time for various primitive operations.
#include <fstream>
int a=10,b=11;
double x=2.3, y=3.4;
int array[100];
int fn_constant(void) {return 0;}
int fn_mov(void) {return a;}
int fn_index(void) {return array[a];}
int fn_add(void) {return a+b;}
int fn_sub(void) {return a-b;}
int fn_not(void) {return ~a;}
int fn_and(void) {return a&b;}
int fn_or (void) {return a|b;}
int fn_xor(void) {return a^b;}
int fn_shl(void) {return a<<b;}
int fn_shr(void) {return a>>b;}
int fn_addshr(void) {return (a+b)>>b;}
int fn_mul(void) {return a*b;}
int fn_mad(void) {return a*b+a;}
int fn_div(void) {return a/b;}
int fn_call(void) {return fn_constant();}
int fn_fmov(void) {return (int)(x);}
int fn_fadd(void) {return (int)(x+y);}
int fn_fmul(void) {return (int)(x*y);}
int fn_fmad(void) {return (int)(x*y+x);}
int fn_fdiv(void) {return (int)(x/y);}
int fn_sqrt(void) {return (int)sqrt(x);}
int fn_cos(void) {return (int)cos(x);}
int fn_malloc(void) {free(malloc(4)); return 0;}
int fn_new(void) {delete new int; return 0;}
int fn_vector(void) {std::vector<int> v(a,3); return 0;}
int fn_timer(void) {return (int)time_in_seconds();}
int fn_ofstream(void) {std::ofstream f("foo.dat"); return 0;}
int foo(void) {
print_time("constant",fn_constant);
print_time("index",fn_index);
print_time("mov",fn_mov);
print_time("add",fn_add);
print_time("sub",fn_sub);
print_time("not",fn_not);
print_time("and",fn_and);
print_time("or",fn_or);
print_time("xor",fn_xor);
print_time("shl",fn_shl);
print_time("shr",fn_shr);
print_time("addshr",fn_addshr);
print_time("mul",fn_mul);
print_time("mad",fn_mad);
print_time("div",fn_div);
print_time("fmov",fn_fmov);
print_time("fadd",fn_fadd);
print_time("fmul",fn_fmul);
print_time("fmad",fn_fmad);
print_time("fdiv",fn_fdiv);
print_time("call",fn_call);
print_time("sqrt",fn_sqrt);
print_time("cos",fn_cos);
print_time("malloc",fn_malloc);
print_time("new",fn_new);
print_time("vector",fn_vector);
print_time("timer",fn_timer);
print_time("ofstream",fn_ofstream);
return 0;
}
(executable NetRun link)
Here's the output of this program on various machines. Times are
reported in nanoseconds, billionths of a second (1 ns = 10-9 seconds).
x86-64: Intel Q6600, 2.4GHz
|
x86-32: Intel Pentium 4, 2.8GHz
|
PowerPC: G5, 2GHz
|
ia64: Itanium, 800MHz
|
PowerPC: 604e, 200MHz
|
constant: 2.92 ns/call index: 3.34 ns/call mov: 3.33 ns/call add: 3.34 ns/call sub: 3.33 ns/call not: 2.92 ns/call and: 3.34 ns/call or: 3.33 ns/call xor: 3.33 ns/call shl: 3.34 ns/call shr: 3.33 ns/call addshr: 3.34 ns/call mul: 3.34 ns/call mad: 3.33 ns/call div: 5.00 ns/call fmov: 2.92 ns/call fadd: 3.33 ns/call fmul: 3.34 ns/call fmad: 3.33 ns/call fdiv: 12.92 ns/call call: 5.00 ns/call sqrt: 24.18 ns/call cos: 31.69 ns/call malloc: 36.27 ns/call new: 41.69 ns/call vector: 54.81 ns/call timer: 70.47 ns/call ofstream: 5760.81 ns/call
|
constant: 5.33 ns/call index: 5.17 ns/call mov: 5.09 ns/call add: 5.16 ns/call sub: 5.19 ns/call not: 5.12 ns/call and: 5.16 ns/call or: 5.11 ns/call xor: 5.16 ns/call shl: 5.34 ns/call shr: 5.12 ns/call addshr: 5.05 ns/call mul: 5.13 ns/call mad: 5.56 ns/call div: 16.78 ns/call fmov: 5.14 ns/call fadd: 8.99 ns/call fmul: 9.33 ns/call fmad: 10.92 ns/call fdiv: 16.52 ns/call call: 9.42 ns/call sqrt: 17.28 ns/call cos: 70.73 ns/call malloc: 63.86 ns/call new: 83.28 ns/call vector: 161.34 ns/call timer: 814.82 ns/call ofstream: 13558.66 ns/call
|
constant: 23.83 ns/call index: 26.88 ns/call mov: 16.89 ns/call add: 19.02 ns/call sub: 17.02 ns/call not: 17.09 ns/call and: 18.11 ns/call or: 19.19 ns/call xor: 18.35 ns/call shl: 19.09 ns/call shr: 18.08 ns/call addshr: 19.08 ns/call mul: 18.77 ns/call mad: 17.52 ns/call div: 20.85 ns/call fmov: 27.22 ns/call fadd: 17.61 ns/call fmul: 23.83 ns/call fmad: 31.10 ns/call fdiv: 20.87 ns/call call: 19.51 ns/call sqrt: 35.22 ns/call cos: 72.50 ns/call malloc: 169.44 ns/call new: 186.45 ns/call vector: 267.62 ns/call timer: 134.22 ns/call ofstream: 28818.38 ns/call
|
constant: 45.14 ns/call index: 48.92 ns/call mov: 46.42 ns/call add: 45.16 ns/call sub: 45.15 ns/call not: 55.17 ns/call and: 45.15 ns/call or: 45.17 ns/call xor: 45.14 ns/call shl: 45.14 ns/call shr: 48.90 ns/call addshr: 55.18 ns/call mul: 65.22 ns/call mad: 66.47 ns/call div: 133.09 ns/call fmov: 61.46 ns/call fadd: 76.61 ns/call fmul: 68.96 ns/call fmad: 68.99 ns/call fdiv: 132.03 ns/call call: 54.99 ns/call sqrt: 141.82 ns/call cos: 156.80 ns/call malloc: 421.63 ns/call new: 441.96 ns/call vector: 371.20 ns/call timer: 990.96 ns/call ofstream: 27243.05 ns/call
|
constant: 100.11 ns/call index: 115.05 ns/call mov: 100.11 ns/call add: 105.12 ns/call sub: 105.12 ns/call not: 105.13 ns/call and: 105.13 ns/call or: 105.10 ns/call xor: 105.12 ns/call shl: 105.12 ns/call shr: 105.12 ns/call addshr: 105.10 ns/call mul: 105.15 ns/call mad: 110.05 ns/call div: 195.22 ns/call fmov: 155.09 ns/call fadd: 175.23 ns/call fmul: 175.11 ns/call fmad: 175.20 ns/call fdiv: 320.56 ns/call call: 165.10 ns/call sqrt: 781.37 ns/call cos: 910.89 ns/call malloc: 1120.97 ns/call new: 1245.60 ns/call vector: 617.67 ns/call timer: 2903.34 ns/call ofstream: 27621.16 ns/call
|
Here's what we can see generally:
- Almost every integer arithmetic operation
(+-*&|^~<<>>) takes almost exactly the same amount of
time--under half a nanosecond on typical x86 machines! In fact,
it's difficult to measure the cost of a single arithmetic operation,
because the CPU can often overlap simple arithmetic with other
instructions.
- The classic case of "strength reduction" was replacing "2*x"
with "x+x". When mul was slow and add was fast, this helped a
lot. Those days seem to be over!
- The one exception is that divides (as well as mod) are pretty slow--over
10ns! This is way more than any other integer arithmetic
operation.
- Array indexing is often very fast, under half a nanosecond.
- Calling a function is pretty slow; typically costing several nanoseconds.
- Especially on older machines, floating point is a little more expensive than integer, a few
nanoseconds per operation. Floating point divides and square
roots are expensive, just like integer. Floating-point cosine is
even more expensive--70ns!
- Memory allocation, whether via malloc/free, new/delete, or std::vector, is very expensive--around 100ns!
- Timer (and other operating system) calls are yet more expensive--800ns!
- File operations are insanely expensive--tens of thousands of nanoseconds!
- With slower-clock-rate CPUs, times of most operations scale up
close to linearly (10x slower clock rate -> about 10x longer time).
In general, you can often replace expensive operations with cheaper
operations. For example, instead of doing a divide (10ns), you
can do an integer bit shift (<1ns) or a floating-point multiply by
reciprocal (x/y = x*(1.0/y)). Instead of making a function call,
use "inline" to have the compiler insert the machine code into the
caller. Instead of calling "cos", you might eliminate the operation
with an algebraic substitution (there are hundreds of useful trig identities),
or replace it with a table lookup (since array indexing is very
fast). Instead of calling new/delete every iteration, you might
reuse the allocated buffer and postpone the delete to the end of the
program. And anything you can do to avoid opening and closing
files repeatedly will dramatically improve performance!
The compiler is a master of this sort of operation replacement. For example, if you divide by a constant 8, the compiler will put in a bit shift:
unsigned int a=read_input();
const unsigned int b=8;
int foo(void) {
return a/b;
}
(Try this in NetRun now!)
Giving this assembly for foo:
If you divide by a constant 10, the compiler puts in an amazing fixed-point multiply-by-reciprocal, giving this assembly:
mov edx,0xcccccccd
mov eax,a
mul edx
mov eax,edx
shr eax,0x3
ret
What's happening here?
a/b = a*(1.0/b)
= ( a*(235/b) )/235
= ( a*(235/b) )>> 35
= (a* 0xcccccccd) >> 32 >> 3
"mul" puts the high bits of the product into edx. By pulling out
only the high bits, we in effect right-shift by 32. We then need
to right-shift by 3 more bits to make up for the 235 scale factor.
Rounded to integer, (235/10) == 0xcccccccd. Doesn't
the roundoff cause problems? In general, yes, but here we just
want an integer output, and with the 235 scale factor, every integer has the right output from this "divide"!
However, keep in mind that the compiler can only pull this trick *if*
it knows that you're dividing by 10. If you divide by some
unknown value, the compiler can't help but put in a general "div"
instruction.
The other thing to keep in mind is that this technique depends on the
vagaries of the current circuit implementation of divide.
Different generations of CPUs have different implementations, and in
general everything has been getting faster, but slow things have been
catching up with fast things. Once, "mul" was considered slow,
but now it's just as fast as add for most operations. Today,
"div" is slow, but this may change in the future. Benchmark!