Performance Modeling: Amdahl, AMAT, and Alpha-Beta

Amdahl's Law

Users care about speed. We want the code to run faster, so define:
    speedup = old runtime / new runtime
Higher speedup (lower new runtime) is better.

If we normalize the old runtime to be "1", and define the new runtime as a combination of serial and parallel:
   new runtime = serial code + parallel code

We can then define:
    N: number of CPUs running in parallel
    P: fraction of program that runs in parallel

Then assuming perfect speedup on the parallel code:
new runtime = serial part + parallel part = (1-P) + P/N
speedup = 1 / ((1-P) + P/N) = N/((1-P)*N + P)

This is Amdahl's Law, which tells you two things:

Even if 80% of the code is parallel (P=0.8), on a quad-core machine (N=4), you still spend *half* your time waiting on the serial part of the code!
If N is large, P better be very close to 1, or the (1-P) factor will dominate the new runtime. You can't leave very much serial code at high processor counts.
While serial code is running, N-1 processors are sitting there waiting, and only one is working.
Often, N (hardware) is less important than P (software).

Average Memory Access Time (AMAT)

A single-level cache is pretty easy to model mathematically. Each access is either a hit or a miss, so average memory access time (AMAT) is:
AMAT = time spent in hits + time spent in misses = hit rate * hit time + miss rate * miss time

For example, if a hit takes 0.5ns and happens 90% of the time, and a miss takes 10ns and happens 10% of the time, on average you spend 0.4ns in hits and 1.0ns in misses, for a total of 1.4ns average access time.

Multi-level caches are more complex, since there are actually two ways to define the hit rate at each level:

"absolute hit rate" is the fraction of all memory accesses that hit in this level of the cache.
"relative hit rate" is the fraction of memory accesses that missed all previous levels, but hit this one.

For example, if L1, L2, and RAM have absolute hit rates of 95%, 4%, and 1%, the L2 cache's relative hit rate is 80%, because of the 5% of all access that miss in L1, the L2 cache handles most of them, and only a few go out to RAM.

I personally prefer absolute hit rates, because it makes the math easier:
    AMAT = time spent in L1 + time spent in L2 + time spent in RAM
    AMAT = absolute hit rate L1 * hit time L1 + absolute hit rate L2 * hit time L2 + absolute hit rate RAM * hit time RAM

For example, if L1, L2, and RAM hit rates are 95%, 4%, and 1%; and hit times are 1ns, 10ns, and 100ns,
    AMAT = 0.95ns + 0.4ns + 1ns = 2.35ns

The other way to figure AMAT is using relative hit and miss rates.
    AMAT = rel hit rate L1 * hit time L1 + rel miss rate L1 * (rel hit rate L2 * hit time L2 + rel miss rate L2 * (hit time RAM))
For our figures above,
    AMAT = 95%*1ns + 5% * (80% * 10ns + 20% * (100ns) ) = 0.95ns + 5% * (8ns + 20ns) = 0.95ns + 1.4ns = 2.35ns

The relative rates do have the advantage of clearly showing you the "miss penalty": for example, a miss in L1 above costs you 28ns on average.

Alpha-Beta

Most network cards require some fixed amount of time per message (alpha, fairly large), plus some amount of time per byte of the message (beta, typically much smaller):
tnet = a + b * L;

Where

tnet: network time. The total time, in seconds, spent on the network.
a: network latency. The time, in seconds, to send a zero-length message. For TCP running on gigabit Ethernet, this is something like 50us/message, which is absurd (it's the time from the CPU to the northbridge to the PCI controller to the network card to the network to the switch to the other network card to the CPU interrupt controller through the OS and finally to the application). Fancier, more expensive networks such as Myrinet or Infiniband have latencies as low as 5us/message (in 2005; Wikipedia now claims 1.3us/message). Opening a new TCP connection might take hundreds of milliseconds(!), especially if you need to resolve the DNS name. Network latency is often written as alpha or α.
b: 1/(network bandwidth). The time, in seconds, to send each byte of the message. For 100MB/s gigabit ethernet (1000Mb/s is megabits), this is 10ns/byte. For 4x Infiniband, which sends 1000MB/s, this is 1ns/byte. (Network 1/bandwidth is often written as beta or β).
L: number of bytes in message.

The bottom line is that shorter messages are always faster. *But* due to the per-message overhead, they may not be *appreciably* faster. For example, for Ethernet a zero-length message takes 50us. A 100-byte message takes 51us (50us/message+100*10ns/byte). So you might as well send 100 bytes if you're going to bother sending anything!

In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
a = b * L
or
L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which is amazingly long! Shorter messages are "alpha dominated", which means you're waiting for the network latency (speed of light, acknowledgements, OS overhead). Longer messages are "beta dominated", which means you're waiting for the network bandwidth (signalling rate). Smaller messages don't help much if you're alpha dominated!

The large per-message cost of most networks means many applications cannot get good performance with an "obvious" parallel strategy, like sending individual floats when they're needed. Instead, you have to figure out beforehand what data needs to be sent, and send everything at once.

The same alpha-beta performance model applies to GPU kernels (alpha is the kernel startup time, 5 to 10 us; beta is the time per instruction in the kernel, typically 0.01ns or less), OpenMP threads (alpha is the time to create the threads, 10us or so; beta is the time per iteration, typically a few ns), and many other hardware accesses.