Optimizing Code: Strength Reduction and Pipelining

Strength Reduction: Replacing Expensive Operations with Cheap Ones

So say you've got a slow operation, like a divide, that you've got to do again and again. A common trick is to replace that slow divide with something faster, like a right-shift (>>), that computes the same value. This technique is called "strength reduction", because you're replacing the "full strength" divide with something simpler and usually more specific.

For example, say you've got a divide by 8 in your code somewhere:

 unsigned int i=...;
 return i/8;

(executable NetRun link)

If you look at the disassembly, the compiler actually implements this not as a divide, but as a right-shift by 3. This works since 2³ == 8, so (i>>3) == (i/8). It's better to do the bitshift because divide is so darn slow.

 push	ebp
 mov	ebp,esp
 sub	esp,0x8
 call	read_input
 shr	eax,0x3
 leave  
 ret

(The compiler has to do a funny compare-and-add in order to make rounding work out properly for negative values, so this is simpler with unsigned values.)

In general, almost anything you want to do with a power of 2 can be replaced by some simpler bitwise operation:

Integer Strength Reduction

Slow Version:	i*2	i*2	i*4	i*8	i*12	i/8	i%8
"Faster" Version:	i+i	i<<1	i<<2	i<<3	(i<<2)+(i<<3)	i>>3	i&7

Note: "Faster" is in quotes because YOU MUST TIME THESE TO SEE IF THEY'RE REALLY FASTER! Compilers and CPUs are complicated beasts, and usually the obvious way is best.

For example, if you can use constants (an actual number, a #define, a "const int", or an "enum") in your operations, the compiler knows all these tricks and more, and the compiler will often automatically apply them. This is good. But for values computed at runtime (e.g., read from a file, or computed on demand), you may have to apply strength reduction using these techniques yourself manually.

Compilers are particularly silly about floating-point optimizations.

Floating-Point Strength Reduction (the compiler won't use these!)

Slow Version:	x/2.0	x/3.7	a+bx+cxx+dxxx	y=sqrt(x*x)	y=x*sqrt(1024.0)	if (sqrt(x)<y) ...
"Faster" Version:	x*0.5	x*(1.0/3.7)	a+x(b+x(c+x*(d)));	y=abs(x)	y=x*32.0	if (x<y*y) ...
Be careful of:		Roundoff!	Dependency chain	Sign change	Get sqrt right!	Negative x, or overflow in y*y

Often mathematical identities (e.g., trig identities like sin(x)²+cos(x)² ==1) can turn out to let you substantially simplify your code. The compiler's usually terrified of calls like sin, cos, and sqrt, and is so scared of roundoff it usually won't even mess with silly stuff like x/2.0, even though x*0.5 is exactly the same value!

Pipelining

Back in 1985, around the era of the Intel 80386, each CPU instruction (like "mov") had a "cycle count", which was just the total number of CPU clock cycles that the instruction took from start to finish. You could look this up right in the Intel documentation. "mov" took one clock cycle, "add" took one clock cycle, "mul" took 30 clock cycles, etc.

This whole cycle-count scheme broke down around 1989 with the 486 processor. The problem is that the 486 was actually able to start working on the next instruction before the first instruction was finished. This idea of working on several things at once is amazingly important and powerful, but it started very small--the 486 could multiply *and* add simultaniously.

The idea is for the CPU to treat its various arithmetic circuits the same way you'd treat a washer and dryer. For the sake of argument, assume you've got:

Five loads of laundry A B C D E
They all need to be washed, dried, and folded
Load E's already clean, and just needs to be dried and folded.
Each load takes an hour to wash, dry, or fold

Your control program reads:
    Wash A
    Dry A
    Fold A
    Wash B
    Dry B
...
    Fold D
    // don't need to wash E
    Dry E
    Fold E

Here's how a 386 would execute this program:

Hour	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Washer	A			B			C			D
Dryer		A			B			C			D		E
Fold			A			B			C			D		E

It's an all-day job, which is clearly stupid, because only one thing is happening at once.

The sensible thing is to make a "pipeline", like an assembly line, where you start washing load B as soon as you move load A over to the dryer. So here's how a 486 would do the same work:

Hour	1	2	3	4	5	6	7
Washer	A	B	C	D
Dryer		A	B	C	D	E
Fold			A	B	C	D	E

That is, you start washing load D at the same time as you start folding load B. This way, everything's busy, which speeds up the overall program substantially.

Notice how long it takes to dry C--it really takes an hour, but imagine you leave that instruction out. You can't start drying D, since it's still in the washer. You could start folding C immediately, but you're busy folding B. So if you leave out "dry C", the overall program is the *same speed*. This means "dry C" is basically free--zero clock cycles!

Back in the 1990's, it seemed like zero-clock instructions were really profound, in a zen kind of way.

Note that in hour 3, we're running the washer, dryer, and folder simultaniously. That is, three instructions are executing at once! This is called "superscalar" execution, because there's more than a single scalar value executing at the same time.

Out-of-Order Execution

To get the best performance on an old 486 or Pentium, you actually had to reorder the instructions in your program so they matched the pipelined execution:
Wash A
Wash B
Dry A
Wash C
Dry B
Fold A
...

This is so annoying that CPU designers began figuring out ways to extract a pipelined schedule from an ordinary program. This mostly boils down to "do work when the input is ready". So Fold A waits until Dry A is complete, but Wash B can go ahead before either one.

In other words, the CPU might execute instructions in a totally different order than you wrote them in the program. This out-of-order execution makes for a really complicated CPU, but it helps get a lot more out of pipelined execution. The 1995 Pentium Pro, Pentium II, and higher are all out of order.

Out-of-order execution helps in several interesting ways. For example, we can actually start drying load E before doing anything else, so an out-of-order CPU might execute:

Hour	1	2	3	4	5	6
Washer	A	B	C	D
Dryer	E	A	B	C	D
Fold		E	A	B	C	D

Somehow, human beings can do this sort of scheduling really easily. But out-of-order is a pretty recent innovation in CPUs--1995 or so is the earliest. The Itanium II is actually still in-order, but it's got really funky instructions.

Modern CPUs actually search between 50 and 200 instructions ahead to find new instructions to execute. Yes, that actually helps!

Here's an Ars Technica article on the design of Intel's Core CPU.

There's a good summary of the "Architecture des Ordinateurs" at this Chinese copy of French coursenotes.