void add_to_floats(float *dest,int n) {
int i;
for (i=0;i<n;i++) dest[i]+=1.0;
}
void add_to_ints(int *dest,int n) {
int i;
for (i=0;i<n;i++) dest[i]+=1;
}
#define n 1000
float some_floats[n]; int some_ints[n];
int add_floats(void) { add_to_floats(some_floats,n); return 0;}
int add_ints(void) { add_to_ints(some_ints,n); return 0;}
int foo(void) {
int i; double per_call;
for (i=0;i<n;i++) some_floats[i]=0.0f;
per_call=time_function(add_ints);
printf("%.2f ns/int\n",1.0e9*per_call/n);
per_call=time_function(add_floats);
printf("%.2f ns/float\n",1.0e9*per_call/n);
return 0;
}
Here's the performance of this code on various hardware (all on NetRun):
Hardware |
ns/int |
ns/float |
ns/clock |
instructions per loop |
clocks per instruction | Discussion |
Intel 486, 50MHz, 1991 |
181ns/int |
763ns/float |
20ns/clock |
4 (int) 8 (float) |
3 (int) 5 (float) |
Classic non-pipelined CPU: many clocks/instruction. |
MIPS R5000, 180MHz, 1996 |
51ns/int |
114ns/float |
5.5ns/clock |
9 (int) 11 (float) |
1 (int) 2 (float) |
Classic fully pipelined CPU: one
instruction/clock. Note how there are more instructions than
CISC, but each instruction runs faster! |
PowerPC G4, 768MHz, 2001 |
8.3 ns/int |
10.2 ns/float |
1.3ns/clock |
8 (int) 9 (float) |
0.8 (int) 0.87 (float) |
Superscalar RISC CPU: multiple
instructions per clock cycle. The PowerPC wasn't amazingly good
at doing superscalar work yet, but it was superscalar. |
Intel Pentium III, 1133MHz, 2002 |
3.58 ns/int |
2.87 ns/float |
0.88ns/clock |
4 (int) 7 (float) |
1 (int) 0.5 (float) |
Superscalar x86 CPU: the integer
unit is fully pipelined, so we get one instruction per clock
cycle. But floating point runs *simultaneously* with the integer
stuff! |
Intel Pentium 4, 2.8Ghz, 2005 |
1.22 ns/int |
1.57 ns/float |
0.36ns/clock |
4 (int) 6 (float) |
0.84 (int) 0.73 (float) |
Modern CPUs are able to run even
integer code superscalar. Yet some code actually takes more clock
cycles per instruction on the Pentium 4, due to its deep pipeline. |
Intel Q6600, 2.4GHz, 2008 |
0.84 ns/int |
0.85 ns/float |
0.42ns/clock |
4 (int) 6 (float) |
0.5 (int) 0.33 (float) |
We're even more superscalar, executing 2 or 3 instructions per clock cycle. |