Swapping Operations for Speed: Strength Reduction

CS 301 Lecture, Dr. Lawlor

Not every machine language instruction runs at the same speed. "Heavy" operations, like memory accesses or divide instructions, run substantially slower than "light" operations like addition or register move.
This program measures the time for various primitive operations.

#include <fstream>
int a=10,b=11;
double x=2.3, y=3.4;
int array[100];
int fn_constant(void) {return 0;}
int fn_mov(void) {return a;}
int fn_index(void) {return array[a];}
int fn_add(void) {return a+b;}
int fn_sub(void) {return a-b;}
int fn_not(void) {return ~a;}
int fn_and(void) {return a&b;}
int fn_or (void) {return a|b;}
int fn_xor(void) {return a^b;}
int fn_shl(void) {return a<<b;}
int fn_shr(void) {return a>>b;}
int fn_addshr(void) {return (a+b)>>b;}
int fn_mul(void) {return a*b;}
int fn_mad(void) {return a*b+a;}
int fn_div(void) {return a/b;}
int fn_call(void) {return fn_constant();}
int fn_fmov(void) {return (int)(x);}
int fn_fadd(void) {return (int)(x+y);}
int fn_fmul(void) {return (int)(x*y);}
int fn_fmad(void) {return (int)(x*y+x);}
int fn_fdiv(void) {return (int)(x/y);}
int fn_sqrt(void) {return (int)sqrt(x);}
int fn_cos(void) {return (int)cos(x);}
int fn_malloc(void) {free(malloc(4)); return 0;}
int fn_new(void) {delete new int; return 0;}
int fn_vector(void) {std::vector<int> v(a,3); return 0;}
int fn_timer(void) {return (int)time_in_seconds();}
int fn_ofstream(void) {std::ofstream f("foo.dat"); return 0;}

int foo(void) {
	print_time("constant",fn_constant);
	print_time("index",fn_index);
	print_time("mov",fn_mov);
	print_time("add",fn_add);
	print_time("sub",fn_sub);
	print_time("not",fn_not);
	print_time("and",fn_and);
	print_time("or",fn_or);
	print_time("xor",fn_xor);
	print_time("shl",fn_shl);
	print_time("shr",fn_shr);
	print_time("addshr",fn_addshr);
	print_time("mul",fn_mul);
	print_time("mad",fn_mad);
	print_time("div",fn_div);
	print_time("fmov",fn_fmov);
	print_time("fadd",fn_fadd);
	print_time("fmul",fn_fmul);
	print_time("fmad",fn_fmad);
	print_time("fdiv",fn_fdiv);
	print_time("call",fn_call);
	print_time("sqrt",fn_sqrt);
	print_time("cos",fn_cos);
	print_time("malloc",fn_malloc);
	print_time("new",fn_new);
	print_time("vector",fn_vector);
	print_time("timer",fn_timer);
	print_time("ofstream",fn_ofstream);
	return 0;
}

(executable NetRun link)

Here's the output of this program on various machines. Times are reported in nanoseconds, billionths of a second (1 ns = 10^-9 seconds).

x86-64: Intel Q6600, 2.4GHz

x86-32: Intel Pentium 4, 2.8GHz

PowerPC: G5, 2GHz

ia64: Itanium, 800MHz

PowerPC: 604e, 200MHz

constant: 2.92 ns/call
index: 3.34 ns/call
mov: 3.33 ns/call
add: 3.34 ns/call
sub: 3.33 ns/call
not: 2.92 ns/call
and: 3.34 ns/call
or: 3.33 ns/call
xor: 3.33 ns/call
shl: 3.34 ns/call
shr: 3.33 ns/call
addshr: 3.34 ns/call
mul: 3.34 ns/call
mad: 3.33 ns/call
div: 5.00 ns/call
fmov: 2.92 ns/call
fadd: 3.33 ns/call
fmul: 3.34 ns/call
fmad: 3.33 ns/call
fdiv: 12.92 ns/call
call: 5.00 ns/call
sqrt: 24.18 ns/call
cos: 31.69 ns/call
malloc: 36.27 ns/call
new: 41.69 ns/call
vector: 54.81 ns/call
timer: 70.47 ns/call
ofstream: 5760.81 ns/call

constant: 5.33 ns/call
index: 5.17 ns/call
mov: 5.09 ns/call
add: 5.16 ns/call
sub: 5.19 ns/call
not: 5.12 ns/call
and: 5.16 ns/call
or: 5.11 ns/call
xor: 5.16 ns/call
shl: 5.34 ns/call
shr: 5.12 ns/call
addshr: 5.05 ns/call
mul: 5.13 ns/call
mad: 5.56 ns/call
div: 16.78 ns/call
fmov: 5.14 ns/call
fadd: 8.99 ns/call
fmul: 9.33 ns/call
fmad: 10.92 ns/call
fdiv: 16.52 ns/call
call: 9.42 ns/call
sqrt: 17.28 ns/call
cos: 70.73 ns/call
malloc: 63.86 ns/call
new: 83.28 ns/call
vector: 161.34 ns/call
timer: 814.82 ns/call
ofstream: 13558.66 ns/call

constant: 23.83 ns/call
index: 26.88 ns/call
mov: 16.89 ns/call
add: 19.02 ns/call
sub: 17.02 ns/call
not: 17.09 ns/call
and: 18.11 ns/call
or: 19.19 ns/call
xor: 18.35 ns/call
shl: 19.09 ns/call
shr: 18.08 ns/call
addshr: 19.08 ns/call
mul: 18.77 ns/call
mad: 17.52 ns/call
div: 20.85 ns/call
fmov: 27.22 ns/call
fadd: 17.61 ns/call
fmul: 23.83 ns/call
fmad: 31.10 ns/call
fdiv: 20.87 ns/call
call: 19.51 ns/call
sqrt: 35.22 ns/call
cos: 72.50 ns/call
malloc: 169.44 ns/call
new: 186.45 ns/call
vector: 267.62 ns/call
timer: 134.22 ns/call
ofstream: 28818.38 ns/call

constant: 45.14 ns/call
index: 48.92 ns/call
mov: 46.42 ns/call
add: 45.16 ns/call
sub: 45.15 ns/call
not: 55.17 ns/call
and: 45.15 ns/call
or: 45.17 ns/call
xor: 45.14 ns/call
shl: 45.14 ns/call
shr: 48.90 ns/call
addshr: 55.18 ns/call
mul: 65.22 ns/call
mad: 66.47 ns/call
div: 133.09 ns/call
fmov: 61.46 ns/call
fadd: 76.61 ns/call
fmul: 68.96 ns/call
fmad: 68.99 ns/call
fdiv: 132.03 ns/call
call: 54.99 ns/call
sqrt: 141.82 ns/call
cos: 156.80 ns/call
malloc: 421.63 ns/call
new: 441.96 ns/call
vector: 371.20 ns/call
timer: 990.96 ns/call
ofstream: 27243.05 ns/call

constant: 100.11 ns/call
index: 115.05 ns/call
mov: 100.11 ns/call
add: 105.12 ns/call
sub: 105.12 ns/call
not: 105.13 ns/call
and: 105.13 ns/call
or: 105.10 ns/call
xor: 105.12 ns/call
shl: 105.12 ns/call
shr: 105.12 ns/call
addshr: 105.10 ns/call
mul: 105.15 ns/call
mad: 110.05 ns/call
div: 195.22 ns/call
fmov: 155.09 ns/call
fadd: 175.23 ns/call
fmul: 175.11 ns/call
fmad: 175.20 ns/call
fdiv: 320.56 ns/call
call: 165.10 ns/call
sqrt: 781.37 ns/call
cos: 910.89 ns/call
malloc: 1120.97 ns/call
new: 1245.60 ns/call
vector: 617.67 ns/call
timer: 2903.34 ns/call
ofstream: 27621.16 ns/call

Here's what we can see generally:

Almost every integer arithmetic operation (+-*&|^~<<>>) takes almost exactly the same amount of time--under half a nanosecond on typical x86 machines! In fact, it's difficult to measure the cost of a single arithmetic operation, because the CPU can often overlap simple arithmetic with other instructions.

The classic case of "strength reduction" was replacing "2*x" with "x+x". When mul was slow and add was fast, this helped a lot. Those days seem to be over!

The one exception is that divides (as well as mod) are pretty slow--over 10ns! This is way more than any other integer arithmetic operation.
Array indexing is often very fast, under half a nanosecond.
Calling a function is pretty slow; typically costing several nanoseconds.
Especially on older machines, floating point is a little more expensive than integer, a few nanoseconds per operation. Floating point divides and square roots are expensive, just like integer. Floating-point cosine is even more expensive--70ns!
Memory allocation, whether via malloc/free, new/delete, or std::vector, is very expensive--around 100ns!
Timer (and other operating system) calls are yet more expensive--800ns!
File operations are insanely expensive--tens of thousands of nanoseconds!
With slower-clock-rate CPUs, times of most operations scale up close to linearly (10x slower clock rate -> about 10x longer time).

In general, you can often replace expensive operations with cheaper operations. For example, instead of doing a divide (10ns), you can do an integer bit shift (<1ns) or a floating-point multiply by reciprocal (x/y = x*(1.0/y)). Instead of making a function call, use "inline" to have the compiler insert the machine code into the caller. Instead of calling "cos", you might eliminate the operation with an algebraic substitution (there are hundreds of useful trig identities), or replace it with a table lookup (since array indexing is very fast). Instead of calling new/delete every iteration, you might reuse the allocated buffer and postpone the delete to the end of the program. And anything you can do to avoid opening and closing files repeatedly will dramatically improve performance!

The compiler is a master of this sort of operation replacement. For example, if you divide by a constant 8, the compiler will put in a bit shift:

unsigned int a=read_input();
const unsigned int b=8;
int foo(void) {
	return a/b;
}

(Try this in NetRun now!)

Giving this assembly for foo:

mov    eax,a
shr    eax,0x3
ret

If you divide by a constant 10, the compiler puts in an amazing fixed-point multiply-by-reciprocal, giving this assembly:

	 mov	edx,0xcccccccd
	 mov	eax,a
	 mul	edx
	 mov	eax,edx
	 shr	eax,0x3
	 ret

What's happening here?
    a/b = a*(1.0/b)
       = ( a*(2³⁵/b) )/2³⁵
       = ( a*(2³⁵/b) )>> 35
       = (a* 0xcccccccd) >> 32 >> 3
"mul" puts the high bits of the product into edx. By pulling out only the high bits, we in effect right-shift by 32. We then need to right-shift by 3 more bits to make up for the 2³⁵ scale factor.

Rounded to integer, (2³⁵/10) == 0xcccccccd. Doesn't the roundoff cause problems? In general, yes, but here we just want an integer output, and with the 2³⁵ scale factor, every integer has the right output from this "divide"!

However, keep in mind that the compiler can only pull this trick *if* it knows that you're dividing by 10. If you divide by some unknown value, the compiler can't help but put in a general "div" instruction.

The other thing to keep in mind is that this technique depends on the vagaries of the current circuit implementation of divide. Different generations of CPUs have different implementations, and in general everything has been getting faster, but slow things have been catching up with fast things. Once, "mul" was considered slow, but now it's just as fast as add for most operations. Today, "div" is slow, but this may change in the future. Benchmark!