Swapping Operations for Speed: Strength Reduction

CS 301 Lecture, Dr. Lawlor

Not every machine language instruction runs at the same speed.  "Heavy" operations, like memory accesses or divide instructions, run substantially slower than "light" operations like addition or register move.
This program measures the time for various primitive operations.
#include <fstream>
int a=10,b=11;
double x=2.3, y=3.4;
int array[100];
int fn_constant(void) {return 0;}
int fn_mov(void) {return a;}
int fn_index(void) {return array[a];}
int fn_add(void) {return a+b;}
int fn_sub(void) {return a-b;}
int fn_not(void) {return ~a;}
int fn_and(void) {return a&b;}
int fn_or (void) {return a|b;}
int fn_xor(void) {return a^b;}
int fn_shl(void) {return a<<b;}
int fn_shr(void) {return a>>b;}
int fn_addshr(void) {return (a+b)>>b;}
int fn_mul(void) {return a*b;}
int fn_mad(void) {return a*b+a;}
int fn_div(void) {return a/b;}
int fn_call(void) {return fn_constant();}
int fn_fmov(void) {return (int)(x);}
int fn_fadd(void) {return (int)(x+y);}
int fn_fmul(void) {return (int)(x*y);}
int fn_fmad(void) {return (int)(x*y+x);}
int fn_fdiv(void) {return (int)(x/y);}
int fn_sqrt(void) {return (int)sqrt(x);}
int fn_cos(void) {return (int)cos(x);}
int fn_malloc(void) {free(malloc(4)); return 0;}
int fn_new(void) {delete new int; return 0;}
int fn_vector(void) {std::vector<int> v(a,3); return 0;}
int fn_timer(void) {return (int)time_in_seconds();}
int fn_ofstream(void) {std::ofstream f("foo.dat"); return 0;}

int foo(void) {
print_time("constant",fn_constant);
print_time("index",fn_index);
print_time("mov",fn_mov);
print_time("add",fn_add);
print_time("sub",fn_sub);
print_time("not",fn_not);
print_time("and",fn_and);
print_time("or",fn_or);
print_time("xor",fn_xor);
print_time("shl",fn_shl);
print_time("shr",fn_shr);
print_time("addshr",fn_addshr);
print_time("mul",fn_mul);
print_time("mad",fn_mad);
print_time("div",fn_div);
print_time("fmov",fn_fmov);
print_time("fadd",fn_fadd);
print_time("fmul",fn_fmul);
print_time("fmad",fn_fmad);
print_time("fdiv",fn_fdiv);
print_time("call",fn_call);
print_time("sqrt",fn_sqrt);
print_time("cos",fn_cos);
print_time("malloc",fn_malloc);
print_time("new",fn_new);
print_time("vector",fn_vector);
print_time("timer",fn_timer);
print_time("ofstream",fn_ofstream);
return 0;
}

(executable NetRun link)

Here's the output of this program on various machines.  Times are reported in nanoseconds, billionths of a second (1 ns = 10-9 seconds).
x86-64: Intel Q6600, 2.4GHz
x86-32: Intel Pentium 4, 2.8GHz
PowerPC: G5, 2GHz
ia64: Itanium, 800MHz
PowerPC: 604e, 200MHz
constant: 2.92 ns/call
index: 3.34 ns/call
mov: 3.33 ns/call
add: 3.34 ns/call
sub: 3.33 ns/call
not: 2.92 ns/call
and: 3.34 ns/call
or: 3.33 ns/call
xor: 3.33 ns/call
shl: 3.34 ns/call
shr: 3.33 ns/call
addshr: 3.34 ns/call
mul: 3.34 ns/call
mad: 3.33 ns/call
div: 5.00 ns/call
fmov: 2.92 ns/call
fadd: 3.33 ns/call
fmul: 3.34 ns/call
fmad: 3.33 ns/call
fdiv: 12.92 ns/call
call: 5.00 ns/call
sqrt: 24.18 ns/call
cos: 31.69 ns/call
malloc: 36.27 ns/call
new: 41.69 ns/call
vector: 54.81 ns/call
timer: 70.47 ns/call
ofstream: 5760.81 ns/call
constant: 5.33 ns/call
index: 5.17 ns/call
mov: 5.09 ns/call
add: 5.16 ns/call
sub: 5.19 ns/call
not: 5.12 ns/call
and: 5.16 ns/call
or: 5.11 ns/call
xor: 5.16 ns/call
shl: 5.34 ns/call
shr: 5.12 ns/call
addshr: 5.05 ns/call
mul: 5.13 ns/call
mad: 5.56 ns/call
div: 16.78 ns/call
fmov: 5.14 ns/call
fadd: 8.99 ns/call
fmul: 9.33 ns/call
fmad: 10.92 ns/call
fdiv: 16.52 ns/call
call: 9.42 ns/call
sqrt: 17.28 ns/call
cos: 70.73 ns/call
malloc: 63.86 ns/call
new: 83.28 ns/call
vector: 161.34 ns/call
timer: 814.82 ns/call
ofstream: 13558.66 ns/call
constant: 23.83 ns/call
index: 26.88 ns/call
mov: 16.89 ns/call
add: 19.02 ns/call
sub: 17.02 ns/call
not: 17.09 ns/call
and: 18.11 ns/call
or: 19.19 ns/call
xor: 18.35 ns/call
shl: 19.09 ns/call
shr: 18.08 ns/call
addshr: 19.08 ns/call
mul: 18.77 ns/call
mad: 17.52 ns/call
div: 20.85 ns/call
fmov: 27.22 ns/call
fadd: 17.61 ns/call
fmul: 23.83 ns/call
fmad: 31.10 ns/call
fdiv: 20.87 ns/call
call: 19.51 ns/call
sqrt: 35.22 ns/call
cos: 72.50 ns/call
malloc: 169.44 ns/call
new: 186.45 ns/call
vector: 267.62 ns/call
timer: 134.22 ns/call
ofstream: 28818.38 ns/call
constant: 45.14 ns/call
index: 48.92 ns/call
mov: 46.42 ns/call
add: 45.16 ns/call
sub: 45.15 ns/call
not: 55.17 ns/call
and: 45.15 ns/call
or: 45.17 ns/call
xor: 45.14 ns/call
shl: 45.14 ns/call
shr: 48.90 ns/call
addshr: 55.18 ns/call
mul: 65.22 ns/call
mad: 66.47 ns/call
div: 133.09 ns/call
fmov: 61.46 ns/call
fadd: 76.61 ns/call
fmul: 68.96 ns/call
fmad: 68.99 ns/call
fdiv: 132.03 ns/call
call: 54.99 ns/call
sqrt: 141.82 ns/call
cos: 156.80 ns/call
malloc: 421.63 ns/call
new: 441.96 ns/call
vector: 371.20 ns/call
timer: 990.96 ns/call
ofstream: 27243.05 ns/call
constant: 100.11 ns/call
index: 115.05 ns/call
mov: 100.11 ns/call
add: 105.12 ns/call
sub: 105.12 ns/call
not: 105.13 ns/call
and: 105.13 ns/call
or: 105.10 ns/call
xor: 105.12 ns/call
shl: 105.12 ns/call
shr: 105.12 ns/call
addshr: 105.10 ns/call
mul: 105.15 ns/call
mad: 110.05 ns/call
div: 195.22 ns/call
fmov: 155.09 ns/call
fadd: 175.23 ns/call
fmul: 175.11 ns/call
fmad: 175.20 ns/call
fdiv: 320.56 ns/call
call: 165.10 ns/call
sqrt: 781.37 ns/call
cos: 910.89 ns/call
malloc: 1120.97 ns/call
new: 1245.60 ns/call
vector: 617.67 ns/call
timer: 2903.34 ns/call
ofstream: 27621.16 ns/call
Here's what we can see generally:
In general, you can often replace expensive operations with cheaper operations.  For example, instead of doing a divide (10ns), you can do an integer bit shift (<1ns) or a floating-point multiply by reciprocal (x/y = x*(1.0/y)).  Instead of making a function call, use "inline" to have the compiler insert the machine code into the caller. Instead of calling "cos", you might eliminate the operation with an algebraic substitution (there are hundreds of useful trig identities), or replace it with a table lookup (since array indexing is very fast).  Instead of calling new/delete every iteration, you might reuse the allocated buffer and postpone the delete to the end of the program.  And anything you can do to avoid opening and closing files repeatedly will dramatically improve performance!

The compiler is a master of this sort of operation replacement.  For example, if you divide by a constant 8, the compiler will put in a bit shift:
unsigned int a=read_input();
const unsigned int b=8;
int foo(void) {
return a/b;
}

(Try this in NetRun now!)

Giving this assembly for foo:

mov    eax,a
shr    eax,0x3
ret
If you divide by a constant 10, the compiler puts in an amazing fixed-point multiply-by-reciprocal, giving this assembly:
	 mov	edx,0xcccccccd
mov eax,a
mul edx
mov eax,edx
shr eax,0x3
ret
What's happening here?
    a/b = a*(1.0/b)
       = ( a*(235/b) )/235
       = ( a*(235/b) )>> 35
       = (a* 0xcccccccd) >> 32 >> 3
"mul" puts the high bits of the product into edx.  By pulling out only the high bits, we in effect right-shift by 32.  We then need to right-shift by 3 more bits to make up for the 235 scale factor.

Rounded to integer, (235/10) == 0xcccccccd.  Doesn't the roundoff cause problems?  In general, yes, but here we just want an integer output, and with the 235 scale factor, every integer has the right output from this "divide"!

However, keep in mind that the compiler can only pull this trick *if* it knows that you're dividing by 10.  If you divide by some unknown value, the compiler can't help but put in a general "div" instruction.

The other thing to keep in mind is that this technique depends on the vagaries of the current circuit implementation of divide.  Different generations of CPUs have different implementations, and in general everything has been getting faster, but slow things have been catching up with fast things.  Once, "mul" was considered slow, but now it's just as fast as add for most operations.  Today, "div" is slow, but this may change in the future.  Benchmark!