Performance Modeling: Amdahl, AMAT, and Alpha-Beta

CS 641 Lecture, Dr. Lawlor

Amdahl's Law

Users care about speed.  We want the code to run faster, so define:
    speedup = old runtime / new runtime
Higher speedup (lower new runtime) is better.

If we normalize the old runtime to be "1", and define the new runtime as a combination of serial and parallel:
   new runtime = serial code + parallel code

We can then define:
    N: number of CPUs running in parallel
    P: fraction of program that runs in parallel

Then assuming perfect speedup on the parallel code:
  new runtime = serial part + parallel part = (1-P) + P/N
  speedup = 1 / ((1-P) + P/N) = N/((1-P)*N + P)

This is Amdahl's Law, which tells you two things:

Average Memory Access Time (AMAT)

A single-level cache is pretty easy to model mathematically.  Each access is either a hit or a miss, so average memory access time (AMAT) is:
    AMAT = time spent in hits + time spent in misses = hit rate * hit time + miss rate * miss time

For example, if a hit takes 0.5ns and happens 90% of the time, and a miss takes 10ns and happens 10% of the time, on average you spend 0.4ns in hits and 1.0ns in misses, for a total of 1.4ns average access time.

Multi-level caches are more complex, since there are actually two ways to define the hit rate at each level:
For example, if L1, L2, and RAM have absolute hit rates of 95%, 4%, and 1%, the L2 cache's relative hit rate is 80%, because of the 5% of all access that miss in L1, the L2 cache handles most of them, and only a few go out to RAM.

I personally prefer absolute hit rates, because it makes the math easier:
    AMAT = time spent in L1 + time spent in L2 + time spent in RAM
    AMAT = absolute hit rate L1 * hit time L1 + absolute hit rate L2 * hit time L2 + absolute hit rate RAM * hit time RAM

For example, if L1, L2, and RAM hit rates are 95%, 4%, and 1%; and hit times are 1ns, 10ns, and 100ns,
    AMAT = 0.95ns + 0.4ns + 1ns = 2.35ns

The other way to figure AMAT is using relative hit and miss rates.
    AMAT = rel hit rate L1 * hit time L1 + rel miss rate L1 * (rel hit rate L2 * hit time L2 + rel miss rate L2 * (hit time RAM))
For our figures above,
    AMAT = 95%*1ns + 5% * (80% * 10ns + 20% * (100ns) ) = 0.95ns + 5% * (8ns + 20ns) = 0.95ns + 1.4ns = 2.35ns

The relative rates do have the advantage of clearly showing you the "miss penalty": for example, a miss in L1 above costs you 28ns on average.

Alpha-Beta

Most network cards require some fixed amount of time per message (alpha, fairly large), plus some amount of time per byte of the message (beta, typically much smaller):
    tnet = a + b * L;

Where
The bottom line is that shorter messages are always faster.  *But* due to the per-message overhead, they may not be *appreciably* faster.  For example, for Ethernet a zero-length message takes 50us.  A 100-byte message takes 51us (50us/message+100*10ns/byte).  So you might as well send 100 bytes if you're going to bother sending anything!

In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
    a = b * L
or
    L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which is amazingly long!  Shorter messages are "alpha dominated", which means you're waiting for the network latency (speed of light, acknowledgements, OS overhead).  Longer messages are "beta dominated", which means you're waiting for the network bandwidth (signalling rate).  Smaller messages don't help much if you're alpha dominated!

The large per-message cost of most networks means many applications cannot get good performance with an "obvious" parallel strategy, like sending individual floats when they're needed.  Instead, you have to figure out beforehand what data needs to be sent, and send everything at once.

The same alpha-beta performance model applies to GPU kernels (alpha is the kernel startup time, 5 to 10 us; beta is the time per instruction in the kernel, typically 0.01ns or less), OpenMP threads (alpha is the time to create the threads, 10us or so; beta is the time per iteration, typically a few ns), and many other hardware accesses.