Fast Multiplication/Exponentiation Algorithm
CS 463
Lecture, Dr. Lawlor
There are quite a few different ways to do multiplication if the
hardware doesn't support what you want. For example, you might
want finite field multiplication (which has a different "add"
operation), big integer arithmetic (calling an "add big integers"
function), or modular arithmetic (reducing by the modulus each
time).
The simplest approach is just to loop over one of the operands and
add:
var multiply_slow(var A,var B) {
var product=0;
for (var i=0;i<B;i++) product+=A;
return (int)product;
}
This has terrible performance, since it's O(B). You can do better
with the famous "fast multiplication algorithm", also known as the "Peasant
Algorithm" (it was evidently used by Russian peasants).
var multiply_fast_IF(var A,var B) {
var product=0;
for (unsigned int bit=0;B>=(1ul<<bit);bit++)
if (B&(1<<bit))
product+=A<<bit;
return product;
}
This is now O(ln B), exponentially better than before. If you
really care about performance, you can swap A and B if A is
smaller. Replacing "+" with "*" gives the fast
exponentiation by squaring trick, which turns out to be
critical to getting good performance from RSA.
On most modern computers, the "if" above gives fairly poor
performance (the branch prediction is... unpredictable). We
can do about 30% better using a bit mask to zero out the
A<<bit part. Here, we use the old "shift up to the sign
bit, then shift down across all bits" trick to splat the mask from a
single bit across all the bits.
var multiply_fast_AND(var A,var B) {
var product=0;
var Bbit=B; // == B>>bit
for (unsigned int bit=0; Bbit!=0; bit++)
{
long mask=Bbit<<(8*sizeof(var)-1); // move low bit to high (sign) bit
mask=mask>>(8*sizeof(var)-1); // signed shift: splat across all bits
product+=(A<<bit)&mask; // optionally include in product
Bbit=Bbit>>1; // move to next bit of B
}
return product;
}
Any reasonably big modern CPU has a hardware multiply instruction,
of course. It's typically very fast, only a few clock cycles
latency, and the same speed regardless of the values being
multiplied.
var multiply_hardware(var A,var B) {
return A*B;
}
(Try
this in NetRun now!)
Here are the timings for a 64-bit Sandy Bridge desktop:
Value 1:
slow multiply: 2.36 ns/call
fast multiply (if): 3.54 ns/call
fast multiply (and): 2.36 ns/call
hardware multiply: 1.77 ns/call
Value 10:
slow multiply: 10.32 ns/call
fast multiply (if): 8.13 ns/call
fast multiply (and): 5.51 ns/call
hardware multiply: 1.77 ns/call
Value 100:
slow multiply: 168.24 ns/call
fast multiply (if): 13.26 ns/call
fast multiply (and): 9.15 ns/call
hardware multiply: 1.77 ns/call
Value 1000:
slow multiply: 1906.62 ns/call
fast multiply (if): 19.61 ns/call
fast multiply (and): 12.82 ns/call
hardware multiply: 1.77 ns/call
Value 10000:
slow multiply: 19272.32 ns/call
fast multiply (if): 25.16 ns/call
fast multiply (and): 17.68 ns/call
hardware multiply: 1.77 ns/call
Note how the slow multiply is bad and getting worse, the fast
algorithms grow logarithmically slowly, and the hardware is constant
time.
In the official description of AES, we do Galois field
multiplication instead of normal multiplication in two places:
filling the S-box, and in the affine part of MixColumns. So
which multiplication algorithm does a typical high-performance
implementation like in PolarSSL AES?
None! The S-box affine transformation is just baked into the
S-box table's values, and the multiply portion of MixColumns' affine
transformation is baked into separate per-row copies of the
substitution table, then Galois added (XORed) at runtime. The
"aes_gen_tables" function (for the !POLARSSL_AES_ROM_TABLES case)
barely does multiplication either, instead computing a table of
exponentials and logs, then computing exp[log[v1]+log[v2]].
There's a definite tradeoffs involved in using tables instead of
doing arithmetic.
- A table is often bigger, weirder, and harder to verify than
code.
- A table can be much faster than code, mostly because you can
precombine an arbitrary number of steps, all baked into the same
table. Table lookup has been getting more expensive in
recent years as memory gets relatively slower and farther away
from the processor, though caches hide this.
- Yet caches can introduce security leakage of their own; see
this cache
timing attack on AES.