Super-Scalar Execution: Multiple Instructions/Clock

CS 441 Lecture, Dr. Lawlor

Here's the obvious way to compute the factorial of 12:
int i, fact=1;
for (i=1;i<=12;i++) {
fact*=i;
}
return fact;

(Try this in NetRun now!)

Here's a modified version where we separately compute even and odd factorials, then multiply them at the end:

int i, factO=1, factE=1;
for (i=1;i<=12;i+=2) {
factO*=i;
factE*=i+1;
}
return factO*factE;

(Try this in NetRun now!)

This modification makes the code "superscalar friendly", so it's possible to execute the loop's multiply instructions simultaniously.  Note that this isn't simply a loop unrolling, which gives a net loss of performance, it's a higher-level transformation to expose parallelism in the problem.

Hardware
Obvious
Superscalar
Savings
Discussion
Intel 486, 50MHz,
1991
5000ns
5400ns
-10%
Classic non-pipelined CPU: many clocks/instruction.  The superscalar transform just makes the code slower, because the hardware isn't superscalar.
Intel Pentium III,
1133MHz,
2002
59.6ns
50.1ns
+16%
Pipelined CPU: the integer unit is fully pipelined, so we get one instruction per clock cycle.  The P3 is also weakly superscalar, but the benefit is small.
Intel
Pentium 4,
2.8Ghz,
2005
22.6ns
15.0ns
+33%
Virtually all of the improvement here is due to the P4's much higher clock rate.
Intel Q6600, 2.4GHz,
2008
16.7ns
9.4ns +43%
Lower clock rate, but fewer pipeline stages leads to better overall performance.
Intel Sandy Bridge i5-2400 3.1Ghz
2011
11.8ns
5.3ns
+55%
Higher clock rate and better tuned superscalar execution.  Superscalar transform gives a substantial benefit--with everything else getting faster, the remaining dependencies become more and more important.