Programming for High Performance

One of the biggest reasons to even think about computer architecture or assembly language is performance: if you understand what the machine is doing, it's a lot easier to do it faster.

What is Fast?

Speed is in the eye of the beholder--in this case, the user. But there's a big disconnect between computer speed and human speed:

Units		Seconds	Examples
nanosecond	ns	10^-9	Execute a few instructions. Call a function. 1 clock cycle, at 1GHz clock rate. Light travels about 1 foot.
microsecond	us	10^-6	Print to the screen (cout). Call the operating system. Blink an LED.
millisecond	ms	10^-3	Send data across a fast network. Start seek on fast hard drive. Bullet travels about 1 foot.
second	s	1	Humans can respond to input.
hour	hr	3.6x10³	Sysadmin arrives to reboot server. Write trivial script or one-off program. Glacier travels up to 1 foot.
week	wk	0.6x10⁶	Write simple single-person program. Run typical supercomputer batch job.
year	yr	0.22x10⁹	Write typical multi-person program. Mean time between hardware failures. Continent drifts about 1 inch.

Because computers are so fast, and humans are so slow:

Often you can make things *seem* faster by starting them sooner. Don't wait until the user clicks the "Run" button, you can start a preemptive calculation as soon as the mouse is *over* the button, or even moving *toward* the button!
For slow programs, often batch processing is more user-friendly. In batch mode you first collect all the data you'll need, then compute it all; this lets the human wander away, rather than babysit an interactive program.
Sometimes it's faster to run a slow program right now, than spend the time to write a fast program. This means it might be rational to leave most of the program logic in Microsoft Excel.
Sometimes it's faster overall to write a program in a slow language (like Python), than to write a more complex program in a fast language (like C++ or assembly).

Measure First

The simplest tool for figuring out what's slow is a timer: a function that returns the current real time ("wall-clock" time). It's a really bad idea to try to optimize code without being able to measure its performance, because intuition just isn't reliable (at least, mine isn't). At least half of the "optimizations" I try either don't help measurably, or actually slow the program down!

NetRun has a builtin function called "time_in_seconds", which returns a double giving the number of seconds. Here's the implementation (from project/lib/inc.c):

/** Return the current time in seconds (since something or other). */
#if defined(WIN32)
#  include <sys/timeb.h>
double time_in_seconds(void) { /* This seems to give terrible resolution (60ms!) */
        struct _timeb t;
        _ftime(&t);
        return t.millitm*1.0e-3+t.time*1.0;
}
#else /* UNIX or other system */
#  include <sys/time.h> //For gettimeofday time implementation
double time_in_seconds(void) { /* On Linux, this is microsecond-accurate. */
        struct timeval tv;
        gettimeofday(&tv,NULL);
        return tv.tv_usec*1.0e-6+tv.tv_sec*1.0;
}
#endif

As usual, there's one version for Windows, and a different version for everything else. There are TONS of other ways to get some notion of time in your programs:

double seconds=time_in_seconds(); Only exists in NetRun. Return the number of seconds since *something*. This is a double, implemented as above.
long seconds=time(0); Returns the number of seconds since January 1, 1970 (the notional birthday of UNIX!). It's in <time.h>, and exists everywhere. Sadly, this is an integer: marginally OK for calling "srand", but almost useless for timing fast code unless you are *very* patient.
double seconds=clock()*(1.0/CLOCKS_PER_SEC); Also standard, and from <time.h>. CLOCKS_PER_SEC is like a million, but clock() seems to be quantized to milliseconds or worse on my machines. Sadly,on most machines "clock()" returns CPU time, not wall clock time, so you can't time disk or network or anything non-CPU bound. Also look out! On a 32-bit machine, clock() may overflow and wraparound as often as every hour (4000 seconds * 1 million CLOCKS_PER_SEC = 4 billion clocks...)

Use the C++11 std::chrono interface:

#include <chrono>

double get_time() {
	return 1.0e-9*std::chrono::duration_cast<std::chrono::nanoseconds>(
		std::chrono::high_resolution_clock::now().time_since_epoch()
	).count();
}

double foo() {
	double t1=get_time();
	double t2=get_time();
	return t2-t1;
}

(Try this in NetRun now!)

The "rdtsc" instruction returns the number of CPU clock cycles since the machine started up. The cycle count's low bits go into eax, and the high bits go into edx. The only problem with this is "clock throttling", where newer CPUs vary their clock speed (for energy efficiency when on battery power, or for thermal management, etc). This variation shows up in rdtsc, and may vary between cores on a multi-core machine!

The usual way to use a timer like this is to call the timer before your code, call it after your code, and subtract the two times to get the "elapsed" time:

#include <fstream>
int foo(void) {
	double t_before=time_in_seconds();
		std::ofstream f("foo.dat");  /* code to be timed goes here */
	double t_after=time_in_seconds();
	double elapsed=t_after - t_before;
	std::cout<<"That took "<<elapsed<<" seconds\n";
	return 0;
}

(executable NetRun link)

The problems are that:

time_in_seconds returns seconds, but it's only accurate to microseconds (not nanoseconds!)
time_in_seconds takes about 800ns to return (which is considered pretty fast!)
Your code might get interrupted by the OS, other processes, hardware events, etc

Problems 1 and 2 can be cured by running many iterations--"scaling up" many copies of the problem until it registers on your (noisy, quantized) timer. So instead of doing this, which incorrectly claims "x+=y" takes 700ns:

int x=3, y=2;
double t1=time_in_seconds();
x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)*1.0e9<<" ns\n";
return x;

(Try this in NetRun now!)

You'd do this, which shows the real time, 0.5ns!

int x=3, y=2, n=1000000;
double t1=time_in_seconds();
for (int i=0;i<n;i++) x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)/n*1.0e9<<" ns\n";
return x;

(Try this in NetRun now!)

(If you get 0.0 nanoseconds, the compiler has converted the loop into x+=n*y. Add "volatile" to the variable declarations to scare it away.)

Problem 3 can be cured by running the above several times, and throwing out the anomalously high values (which were probably interrupted).

Your other main tool for performance analysis is a "profiler". This is a library that keeps track of which function is running, and totals up the number of function calls and time spent in each one. The "Profile" checkbox in NetRun will run a profiler, and show you the top functions it found in your code.

Both timing and profiling have some problems, though:

Timing and profiling both take time themselves, which skews your measurement of the code. This is often amusingly refered to as a Heisenberg effect--stuff behaves differently when you're watching it!
Timing and profiling both fail on very short pieces of code. Sometimes you'll have to repeat the code many times for it to take a measureable length of time.

NetRun has a nice little built-in function called print_time that takes a string and a function. It times the execution of (many iterations of) the function, then divides by the number of iterations to print out exactly how long each call to the function takes. The "Time" checkbox in NetRun just calls print_time on the default foo function.

To use the NetRun timer functionality on your own code, hit "Download this file as a .tar archive".

CS 301 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.