OOOE: How
does it work?
Basically, the processor is able to perform instructions out of the order they are given in order to save time. Wikipedia gives a good outline of the process (http://en.wikipedia.org/wiki/Out-of-order_execution):
1. Instruction fetch.
2. Instruction dispatch to an
instruction queue (also called instruction buffer or reservation stations).
3. The instruction waits in the
queue until its input operands are available. The instruction is then allowed
to leave the queue before earlier, older instructions.
4. The instruction is issued to the
appropriate functional unit and executed by that unit.
5. The results are queued.
6. Only after all older instructions
have their results written back to the register file, and then this result is
written back to the register file.
This is called the graduation or retire stage.
So
with this type of execution, we save lots of time. But there are some drawbacks: ÒThe dependencies of all instructions need to be determined,
and instructions cannot execute before or at the same time as instructions on
which they are dependent.Ó (http://en.wikibooks.org/wiki/Microprocessor_Design/Out_Of_Order_Execution).
Side note-
Register renaming is used to avoid false dependencies that would make using
OOOE impossible. As weÕve talked
about in class, register renaming is where there are more physical registers
then defined by the architecture.
Hyperthreading Example
Wikibooks gave a
pretty understandable example:
ÒIn a hyperthreaded system, it
appears to the operating system that there are two separate processors, when
there is only one processor. The OOOE engine feeds instructions from the two
separate execution threads to the execution cores, to try and keep all of the
cores simultaneously busy. In general hyperthreading
increases performance, although in some instances this additional complexity
actually decreased performance.Ó
History
The first machine to use OOOE was the CDC 6600, which actually used a scoreboard. That was in 1964 and almost 3 years later, the IBM 360/91 supported full OOOE. Now we jump to 1990, when IBM made the first OOOE microprocessor called the POWER1; although, it only worked on floating point instructions. The 90Õs were when OOOE really took hold. ÒThroughout the 1990s out-of-order execution became more common, and was featured in the IBM/Motorola PowerPC 601 (1993), Fujitsu/HAL SPARC64 (1995), Intel Pentium Pro(1995), MIPS R10000 (1996), HP PA-8000 (1996), AMD K5 (1996) and DEC Alpha 21264 (1998). Notable exceptions to this trend include the Sun UltraSPARC, HP/Intel Itanium,Transmeta Crusoe
, Intel Atom, and the IBM POWER6.Ó –Wikipedia.Branch Prediction
History
Originally branch
prediction wasn't needed, because the earliest computers took multiple cycles
to perform any instruction. However, as computer evolved pipelining
became important to speeding up a programs performance and this created a need
for computer to figure out how a program branches or jumps. Branches are points
in programs when the execution could follow two or more paths, such as
if-then-else statements or for loops.
The basic idea of
branch prediction started when Tom McWilliams and Curt Widdoes
introduced two-bit predictors. Two-bit predictors used a simple counter from 0
to 3 to predict whether or not a branch instruction is taken or not. If the
counter is at 0 or 1 the predictor was predicting that the branch instruction
was not taken. If it was 2 or 3 it predicted
that the instruction will be taken and the the next
instruction should come from the taken path. If the prediction was wrong the
correct instructions are fetched and the counter is changed by one depending on
whether or not the branch was taken or not. This method was accurate for
predicting branches in loops. (http://everything2.com/title/2-bit+predictor)
IBM was creating the
IBM 7030 also known as the STRETCH. The STRETCH included many different
organizational techniques including precoding, memory
operand prefetch, out-of-order prediction, branch misprediction recovery, and precise interrupts. The STRETCH
pre-executed all branches that used the index registers,
this caused a significant number of branch mispredictions.
This branch mispredictions
caused the STRETCH to perform poorly. (http://www.computer-history.info/Page4.dir/pages/IBM.7030.Stretch.dir/index.html)
Burroughs B4900 used
a branch prediction history state, stored in memory instructions during a
programs execution. This scheme used a 4-state branch prediction by using the
branch opcodes to represent the branch operator
types. This particular scheme predicted branches correctly 93% of the time.
Types of Branches |
Conditional |
Unconditional |
Direct |
If-then-else for
loops (benz, bez, ect) |
Precedural
calls (jal) gotos (j) |
Indirect |
|
Return (jr) function
pointers (jalr) |
(http://www.cs.virginia.edu/~skadron/cs654/slides/bpred.ppt)
RISC, MIPS, and
SPARC originally only did the not taken branch prediction, which means any time
a program could branch it chose not to. This didn't matter because the original
processors could only fetch one instruction per clock cycle. As the processors
evolved they kept the old branch predictor, since it was a bad branch predictor
it caused the processors to slow down because when the branch mispredicted the processor lost four clock cycles to
resolve the bad prediction.
In the Intel Pentium
III and Pentium 4 had a Branch Prediction Unit. The Pentium III
uses local branch prediction, which keeps a history buffer to
predict branches, has about a 90% accuracy rate. The Pentium 4 just uses a more
efficient branch prediction algorithm, and has about 94% accuracy rate. (http://www.slcentral.com/articles/01/8/p4/)
Today Intel has Intel Core i7 uses an array of
branches to store the results of the branches as the program runs and uses an
algorithm to attempt to determine the next branch. The new algorithm uses a
two-level predictor first-level stores recent branch history and the
second-level has slower access, but stores more branch history. This is good
for branch prediction in large volumes of code, like databases. (http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-4.html)
AMD processors use a
two level predictor like Intel. AMD uses a mixture between a global branch
predictor and a saturated counter. The global branch predictor uses the history
of all of the jumps in the program. The saturated counter like the two-bit
predictor, it keeps a count to decide whether or not the program needs to
branch.
http://www.omnipelagos.com/entry?n=branch_predictor
Rationale
The main idea behind
branch prediction is stop the program from coming to a halt every time a
conditional jump could occur. Since about 15-25% of a program are statements
that can potentially branch it is important find a way to speed up the
programs, because otherwise programs would waste clock cycles doing very
little. (http://www.cs.virginia.edu/~skadron/cs654/slides/bpred.ppt) So to take advantage of the fact that most processors can do
multiple instructions per clock cycle the branch predictor tries to figure out
whether or not a conditional jump is taken, which significantly increases the
performance of the program.
There are two main
types of branch prediction, Static and Dynamic. Static branch prediction the
basic branch prediction is done before runtime. While dynamic branch
predictions the predictions may change throughout the program.
It is important to
create an efficient and accurate branch predictor. A good branch predictor
allows the program to increase the number of instructions available for the
scheduler. It increases the level of parallelism in the program, which allows
for more instructions to be spread out over a shorter time. It also allows more
instructions to be completed while waiting for a branch to resolve.
http://www.cs.virginia.edu/~skadron/cs654/slides/bpred.ppt
http://www.slcentral.com/articles/01/8/p4/
Advantages
The main advantage
to using branch prediction is to accurately pipeline instructions, which allows
for the instructions to be completed faster. It does this through allowing more
instructions to be completed at a single clock cycle, in turn dramatically increasing
the efficiency of the processor.
http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts/
Disadvantages
One of the major
problems with branch prediction is what if the system branches when it doesn't
need to. Well this is called Branch misprediction and
it causes affects the processor in two ways. First it causes the processor to
waste its time by executing instructions that are on the mispredicted
path. Second the mispredicted paths instructions must
be flushed, also called bubbles, into the pipeline until the proper
instructions overwrite the pipeline. If main penalty for mispredicting
a branch is it causes a significant number of cycles to be wasted anywhere from
1 (very simple branch predictors) to about 40 (very complicated branch
predictors) cycles to be used.
http://www.cs.nyu.edu/goldberg/pubs/gchp02.pdf
http://http://www.slcentral.com/articles/01/8/p4/