Performance Modeling: Amdahl, AMAT, and Alpha-Beta
CS 641 Lecture, Dr. Lawlor
Amdahl's Law
Users care about speed. We want the code to run faster, so define:
speedup = old runtime / new runtime
Higher speedup (lower new runtime) is better.
If we normalize the old runtime to be "1", and define the new runtime as a combination of serial and parallel:
new runtime = serial code + parallel code
We can then define:
N: number of CPUs running in parallel
P: fraction of program that runs in parallel
Then assuming perfect speedup on the parallel code:
new runtime = serial part + parallel part = (1-P) + P/N
speedup = 1 / ((1-P) + P/N) = N/((1-P)*N + P)
This is Amdahl's Law, which tells you two things:
- Even if 80% of the code is parallel (P=0.8), on a quad-core machine
(N=4), you still spend *half* your time waiting on the serial part of the
code!
- If N is large, P better be very close to 1, or the (1-P) factor
will dominate the new runtime. You can't leave very much serial
code at high processor counts.
- While serial code is running, N-1 processors are sitting there waiting, and only one is working.
- Often, N (hardware) is less important than P (software).
Average Memory Access Time (AMAT)
A single-level cache is pretty easy to model mathematically. Each
access is either a hit or a miss, so average memory access time (AMAT) is:
AMAT = time spent in hits + time spent in misses = hit rate * hit time + miss rate * miss time
For example, if a hit takes 0.5ns and happens 90% of the time, and a
miss takes 10ns and happens 10% of the time, on average you spend 0.4ns
in hits and 1.0ns in misses, for a total of 1.4ns average access time.
Multi-level caches are more complex, since there are actually two ways to define the hit rate at each level:
- "absolute hit rate" is the fraction of all memory accesses that hit in this level of the cache.
- "relative hit rate" is the fraction of memory accesses that missed all previous levels, but hit this one.
For example, if L1, L2, and RAM have absolute hit rates of 95%, 4%, and
1%, the L2 cache's relative hit rate is 80%, because of the 5% of all
access that miss in L1, the L2 cache handles most of them, and only a
few go out to RAM.
I personally prefer absolute hit rates, because it makes the math easier:
AMAT = time spent in L1 + time spent in L2 + time spent in RAM
AMAT = absolute hit rate L1 * hit time L1 + absolute
hit rate L2 * hit time L2 + absolute hit rate RAM * hit time RAM
For example, if L1, L2, and RAM hit rates are 95%, 4%, and 1%; and hit times are 1ns, 10ns, and 100ns,
AMAT = 0.95ns + 0.4ns + 1ns = 2.35ns
The other way to figure AMAT is using relative hit and miss rates.
AMAT = rel hit rate L1 * hit time L1 + rel miss rate
L1 * (rel hit rate L2 * hit time L2 + rel miss rate L2 * (hit time RAM))
For our figures above,
AMAT = 95%*1ns + 5% * (80% * 10ns + 20% * (100ns) ) = 0.95ns + 5% * (8ns + 20ns) = 0.95ns + 1.4ns = 2.35ns
The relative rates do have the advantage of clearly showing you the
"miss penalty": for example, a miss in L1 above costs you 28ns on
average.
Alpha-Beta
Most network cards require some fixed amount of time per message (alpha, fairly large), plus some amount of time per byte of the message (beta, typically much smaller):
tnet = a + b * L;
Where
- tnet: network time. The total time, in seconds, spent on the network.
- a:
network latency. The time, in seconds, to send a
zero-length message. For TCP running on gigabit Ethernet, this
is something like 50us/message, which is absurd (it's the time from the
CPU to the northbridge to the PCI controller to the network card to the
network to the switch to the other network card to the CPU interrupt
controller through the OS and finally to the application).
Fancier, more
expensive networks such as Myrinet or Infiniband have latencies as low as 5us/message
(in 2005; Wikipedia now claims 1.3us/message). Opening a new TCP
connection might take hundreds of milliseconds(!), especially if you
need to resolve the DNS name. Network latency is often written as
alpha or α.
- b: 1/(network bandwidth). The time, in seconds, to send
each byte of the message. For 100MB/s gigabit ethernet (1000Mb/s
is megabits), this is 10ns/byte. For 4x Infiniband, which sends 1000MB/s, this is 1ns/byte. (Network 1/bandwidth is often written as beta or β).
- L: number of bytes in message.
The bottom line is that shorter messages are always faster. *But*
due to the per-message overhead, they may not be *appreciably*
faster. For example, for Ethernet a zero-length message takes
50us. A 100-byte message takes 51us
(50us/message+100*10ns/byte). So you might as well send 100 bytes
if you're going to bother sending anything!
In general, you'll hit 50% of the network's achievable bandwidth when sending messages so
a = b * L
or
L = a / b
For Ethernet, this breakeven point is at L = 5000 bytes/message, which
is amazingly long! Shorter messages are "alpha dominated", which
means you're waiting for the network latency (speed of light,
acknowledgements, OS overhead). Longer messages are "beta
dominated", which means you're waiting for the network bandwidth
(signalling rate). Smaller messages don't help much if you're
alpha dominated!
The large per-message cost of most networks means many applications cannot
get good performance with an "obvious" parallel strategy, like sending
individual floats when they're needed. Instead, you have to
figure out beforehand what data needs to be sent, and send everything at once.
The same alpha-beta performance model applies to GPU kernels (alpha is
the kernel startup time, 5 to 10 us; beta is the time per instruction
in the kernel, typically 0.01ns or less), OpenMP threads (alpha is the
time to create the threads, 10us or so; beta is the time per iteration,
typically a few ns), and many other hardware accesses.