int i, fact=1;
for (i=1;i<=12;i++) {
fact*=i;
}
return fact;
Here's a modified version where we separately compute even and odd factorials, then multiply them at the end:
int i, factO=1, factE=1;
for (i=1;i<=12;i+=2) {
factO*=i;
factE*=i+1;
}
return factO*factE;
This modification makes the code "superscalar
friendly", so it's possible to execute the loop's multiply instructions
simultaniously. Note that this isn't simply a loop unrolling,
which gives a net loss of performance, it's a higher-level
transformation to expose parallelism in the problem.Hardware |
Obvious |
Superscalar |
Savings |
Discussion |
Intel 486, 50MHz, 1991 |
5000ns |
5400ns |
-10% |
Classic non-pipelined CPU: many clocks/instruction. The superscalar transform just makes the code slower, because the hardware isn't superscalar. |
Intel Pentium III, 1133MHz, 2002 |
59.6ns |
50.1ns |
+16% |
Pipelined CPU: the integer
unit is fully pipelined, so we get one instruction per clock
cycle. The P3 is also weakly superscalar, but the benefit is small. |
Intel Pentium 4, 2.8Ghz, 2005 |
22.6ns |
15.0ns |
+33% |
Virtually all of the improvement here is due to the P4's much higher clock rate. |
Intel Q6600, 2.4GHz, 2008 |
16.7ns |
9.4ns | +43% |
Lower clock rate, but fewer pipeline stages leads to better overall performance. |
Intel Sandy Bridge i5-2400 3.1Ghz 2011 |
11.8ns |
5.3ns |
+55% |
Higher clock rate and better tuned superscalar execution.
Superscalar transform gives a substantial benefit--with everything else
getting faster, the remaining dependencies become more and more
important. |