Optimizing Code: Strength Reduction and Pipelining
CS 301 Lecture, Dr. Lawlor
Strength Reduction: Replacing Expensive Operations with Cheap Ones
So say you've got a slow operation, like a divide, that you've got to
do again and again. A common trick is to replace that slow divide
with something faster, like a right-shift (>>), that computes the
same value. This technique is called "strength reduction",
because you're replacing the "full strength" divide with something
simpler and usually more specific.
For example, say you've got a divide by 8 in your code somewhere:
unsigned int i=...;
return i/8;
(executable NetRun link)
If you look at the disassembly, the compiler actually implements this
not as a divide, but as a right-shift by 3. This works since 23 == 8, so (i>>3) == (i/8). It's better to do the bitshift because divide is so darn slow.
push ebp
mov ebp,esp
sub esp,0x8
call read_input
shr eax,0x3
leave
ret
(The compiler has to do a funny compare-and-add in order to make
rounding work out properly for negative values, so this is simpler with
unsigned values.)
In general, almost anything you want to do with a power of 2 can be replaced by some simpler bitwise operation:
Integer Strength Reduction
Slow Version:
|
i*2
|
i*2
|
i*4
|
i*8 |
i*12
|
i/8
|
i%8
|
"Faster" Version:
|
i+i
|
i<<1
|
i<<2
|
i<<3 |
(i<<2)+(i<<3)
|
i>>3
|
i&7
|
Note: "Faster" is in quotes because YOU MUST
TIME THESE TO SEE IF THEY'RE REALLY FASTER! Compilers and CPUs
are complicated beasts, and usually the obvious way is best.
For example, if you can use constants (an actual number, a #define, a
"const int", or an "enum") in your operations, the compiler knows all
these tricks and more, and the compiler will often automatically apply
them. This is good. But for values
computed at runtime (e.g., read from a file, or computed on demand),
you may have to apply strength reduction using these techniques
yourself manually.
Compilers are particularly silly about floating-point optimizations.
Floating-Point Strength Reduction (the compiler won't use these!)
Slow Version:
|
x/2.0
|
x/3.7
|
a+b*x+c*x*x+d*x*x*x
|
y=sqrt(x*x)
|
y=x*sqrt(1024.0)
|
if (sqrt(x)<y) ...
|
"Faster" Version:
|
x*0.5
|
x*(1.0/3.7)
|
a+x*(b+x*(c+x*(d)));
|
y=abs(x)
|
y=x*32.0
|
if (x<y*y) ...
|
Be careful of:
|
|
Roundoff!
|
Dependency chain
|
Sign change
|
Get sqrt right!
|
Negative x, or overflow in y*y
|
Often mathematical identities (e.g., trig identities like sin(x)2+cos(x)2
==1) can turn out to let you substantially simplify your code.
The compiler's usually terrified of calls like sin, cos, and sqrt, and
is so scared of roundoff it usually won't even mess with silly stuff
like x/2.0, even though x*0.5 is exactly the same value!
Pipelining
Back in 1985, around the era of the Intel 80386, each CPU instruction
(like "mov") had a "cycle count", which was just the total number of
CPU clock cycles that the instruction took from start to finish.
You could look this up right in the Intel documentation. "mov" took one
clock cycle, "add" took one clock cycle, "mul" took 30 clock cycles,
etc.
This whole cycle-count scheme broke down around 1989 with the 486
processor. The problem is that the 486 was actually able to start
working on the next
instruction before the first instruction was finished. This idea
of working on several things at once is amazingly important and
powerful, but it started very small--the 486 could multiply *and* add
simultaniously.
The idea is for the CPU to treat its various arithmetic circuits the
same way you'd treat a washer and dryer. For the sake of
argument, assume you've got:
- Five loads of laundry A B C D E
- They all need to be washed, dried, and folded
- Load E's already clean, and just needs to be dried and folded.
- Each load takes an hour to wash, dry, or fold
Your control program reads:
Wash A
Dry A
Fold A
Wash B
Dry B
...
Fold D
// don't need to wash E
Dry E
Fold E
Here's how a 386 would execute this program:
Hour
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
Washer
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
|
|
|
Dryer
|
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
E
|
|
Fold
|
|
|
A
|
|
|
B
|
|
|
C
|
|
|
D
|
|
E
|
It's an all-day job, which is clearly stupid, because only one thing is happening at once.
The sensible thing is to make a "pipeline", like an assembly line,
where you start washing load B as soon as you move load A over to the
dryer. So here's how a 486 would do the same work:
Hour
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
Washer
|
A
|
B
|
C
|
D
|
|
|
|
Dryer
|
|
A
|
B
|
C
|
D
|
E
|
|
Fold
|
|
|
A
|
B
|
C
|
D
|
E
|
That is, you start washing load D at the same time as you start folding
load B. This way, everything's busy, which speeds up the overall
program substantially.
Notice how long it takes to dry C--it really takes an hour, but imagine
you leave that instruction out. You can't start drying D, since
it's still in the washer. You could start folding C immediately,
but you're busy folding B. So if you leave out "dry C", the
overall program is the *same speed*. This means "dry C" is basically
free--zero clock cycles!
Back in the 1990's, it seemed like zero-clock instructions were really profound, in a zen kind of way.
Note that in hour 3, we're running the washer, dryer, and folder
simultaniously. That is, three instructions are executing at
once! This is called "superscalar" execution, because there's
more than a single scalar value executing at the same time.
Out-of-Order Execution
To get the best performance on an old 486 or Pentium, you actually had
to reorder the instructions in your program so they matched the
pipelined execution:
Wash A
Wash B
Dry A
Wash C
Dry B
Fold A
...
This is so annoying that CPU designers began figuring out ways to
extract a pipelined schedule from an ordinary program. This
mostly boils down to "do work when the input is ready". So Fold A
waits until Dry A is complete, but Wash B can go ahead before either
one.
In other words, the CPU might execute instructions in a totally
different order than you wrote them in the program. This
out-of-order execution makes for a really complicated CPU, but it helps
get a lot more out of pipelined execution. The 1995 Pentium Pro,
Pentium II, and higher are all out of order.
Out-of-order execution helps in several interesting ways. For
example, we can actually start drying load E before doing anything
else, so an out-of-order CPU might execute:
Hour
|
1
|
2
|
3
|
4
|
5
|
6
|
Washer
|
A
|
B
|
C
|
D
|
|
|
Dryer
|
E
|
A
|
B
|
C
|
D
|
|
Fold
|
|
E
|
A
|
B
|
C
|
D
|
Somehow, human beings can do this sort of scheduling really
easily. But out-of-order is a pretty recent innovation in
CPUs--1995 or so is the earliest. The Itanium II is actually
still in-order, but it's got really funky instructions.
Modern CPUs actually search between 50 and 200 instructions ahead to
find new instructions to execute. Yes, that actually helps!
Here's an Ars Technica article on the design of Intel's Core CPU.
There's a good summary of the "Architecture des Ordinateurs" at this Chinese copy of French coursenotes.