Probability & Statistics
CS 463
Lecture, Dr. Lawlor
One difficult aspect of deciphering encrypted data is recognizing
that you've actually got plaintext. For example, here are some
decryption results from my attempts to solve hw1_0:
key=0x13739 results in:
* UR]]UN KR[M ]XUM VN J UXLJU /JR[KJWT\ \]XUNW VRUURWP VJLQRWN `X^UM QJ_N KNNW NW]R[NUb \Q[NMMNM Kb JW RVYX[]NM ,[X]XKJU]R\UJ_XWRJW JW\`N[RWP VJLQRWN(
key=0x13795 results in:
A little bird told me a local Fairbanks stolen milling machine would have been entirely shredded by an imported Crotobaltislavonian answering machine?!
key=0x13799 results in:
B!mjuumf!cjse!upme!nf!b!mpdbm!Gbjscbolt!tupmfo!njmmjoh!nbdijof!xpvme!ibwf!cffo!foujsfmz!tisfeefe!cz!bo!jnqpsufe!Dspupcbmujtmbwpojbo!botxfsjoh!nbdijof@"
key=0x1379d results in:
C"nkvvng"dktf"vqnf"og"c"nqecn"Hcktdcpmu"uvqngp"oknnkpi"ocejkpg"yqwnf"jcxg"dggp"gpvktgn{"ujtgffgf"d{"cp"korqtvgf"Etqvqdcnvkuncxqpkcp"cpuygtkpi"ocejkpgA#
key=0x137a1 results in:
D#olwwoh#elug#wrog#ph#d#orfdo#Idluedqnv#vwrohq#ploolqj#pdfklqh#zrxog#kdyh#ehhq#hqwluho|#vkuhgghg#e|#dq#lpsruwhg#Furwredowlvodyrqldq#dqvzhulqj#pdfklqhB$
key=0x137a5 results in:
E$pmxxpi$fmvh$xsph$qi$e$psgep$Jemvferow$wxspir$qmppmrk$qeglmri${syph$lezi$fiir$irxmvip}$wlvihhih$f}$er$mqtsvxih$Gvsxsfepxmwpezsrmer$erw{ivmrk$qeglmriC%
Clearly, we want the program to automatically find the key=0x13795
result. There are several ways to do this:
- Permitted values. Decide which byte values could
be in the plaintext, and throw out decrypts that include any
other byte values. This is by far the fastest and simplest
approach, and above I'm actually filtering out the many
decryption keys that result in some binary output data (above
127, or non-whitespace below 32). This approach is very
fast, especially since you can throw out a possible key as soon
as it spits out bad data, which usually happens very early in
the decryption process. The downside? If we assume
the weird characters like ] ! " # $ never occur, we will
throw out any decrypt where the do occur. This can even be
gamed, by mixing blocks of binary garbage into the actual
plaintext.
- Permitted sets. This works as above, but we
reduce the chance of false negatives by checking several
bytes. For example, "!m" rarely occurs in English, but "!
" or "!!" are possible. This is just a generalization of
the above across multiple bytes, and suffers from the same
flaws. In particular, now we'll be thrown off by things
like proper names or foreign words.
- Likely values or sets. Since a binary choice is
error-prone and easy to subvert, we can instead compute the probability
that a message is actually English given, that it contains this
value or set of values. We can then save decrypts that
have a high enough probability, or rank decrypts by their
probability. This is essentially a generalization of the
histogram approach.
This last approach requires some statistical knowledge. For
example, something like 12%
of the letters in English words are 'e', so:
P(this letter is 'e' | message
is English) = 0.12
(The bar | means "given that", making this a conditional
probability.)
This is unfortunately not the probability we want. For
example, it tells us that the message "Hello" has less than a (12%)5
probability, or 0.003%, from the space of all possible 5-letter
English messages. Instead, we actually want the conditions
facing the other way:
P(message is English | this
letter is 'e') = ?
We know
conditional
probability says:
P(A and B) = P(A|B)P(B) = P(B|A)P(A)
So solving for the conditional probability:
P(A|B) = P(B|A) * P(A) / P(B)
Substituting A="message is English" and B="this letter is 'e'"
gives:
We can compute this from:
- P('e' | English) comes straight from the letter frequency
tables, here 12%. For binary data with uppercase,
lowercase, spaces, and punctuation, this is a smaller value.
- P('e') is 1/26, assuming all letters are equally likely in the
ciphertext. For binary data, this would be 1/256.
- P(English) is 1/(size of key space), assuming you only get
English with the right key.
Rearranging, note that this is:
P(English | 'e') = P(English) * P('e' | English)
/ P('e')
or
Chance of English *= chance of this letter in
English / chance of letter overall
This of course can and should be repeated with pairs of letters,
triplets, etc. You're only limited by the quantity of
plaintext and the care to which you'd like to do your statistical
frequency analysis.