MATH 368, Summer 2020 Dr. Andreas Alpers Project 2
Submit your solutions to andreas.alpers@liverpool.ac.uk no later than 8 May 2020. Use the subject line ‘MATH368 P3’ in your email.
Student IDs:
The general task
Last weekend, when I walked through Liverpool I found a letter which seems to contain a secret message. Assuming that the text is written in English and that a simple substitution cipher was used, I want you to decrypt this text for me using Markov Chain Monte Carlo.
QGWCQEOYXVOTOYQSYEMYVOWBOEYQXYQSYOWSDYQPYDMAYXTDYEMYVOUUYIOUMZYASYWIMBOYASYM
EUDYSLDYQGWCQEOYWUUYXVOYNOMNUOYUQBQECYPMTYXMRWDYQGWCQEOYXVOTOYQSYEMYJMAEXTQO
SYQXYQSYEMXYVWTRYXMYRMYEMXVQECYXMYLQUUYMTYRQOYPMTYWERYEMYTOUQCQMEYXMMYQGWCQE
OYWUUYXVOYNOMNUOYUQBQECYUQPOYQEYNOWJOYDMAYGWDYSWDYQYWGYWYRTOWGOTYIAXYQYWGYEM
XYXVOYMEUDYMEOYQYVMNOYSMGORWDYDMAYZQUUYKMQEYASYWERYXVOYZMTURYZQUUYIOYWSYMEO
The simple substitution cipher is a cipher that has been in use for many hundreds of years. It consists of substituting every plaintext character for a different ciphertext character. In our case, we can assume that the plain and ciphertext contain only 27 characters, namely the upper case letters A,B,C,. . . ,Z and ‘ ’ (space); there is no linebreak, punctuation, or anything else. Interpret the above message as a string of length 379 (there appears no ‘ ’ in the ciphertext, but it could well be that one of the characters stands for ‘ ’). In other words, we are looking for a permutation σ ofthealphabetalph:={A,B,C,…,Z, }suchthatσ(ciphertext)=plaintext.
The steps
Let’s break this project down into several steps.
S1: Download a large English text from Project Gutenberg (http://www.gutenberg.org) and obtain from it the following:
(a) freq(c), the relative frequencies of the characters c ∈ alph in English language (i.e., set freq(c) to be the number of occurrences of c in your English text and normalize this such that c∈alph freq(c) = 1).
(b) B(x,y), a matrix that records the number of character transitions from x ∈ alph to y ∈ alph in your English text (in other words, B(x,y) holds the value how often x is followed by y in the English text). By dividing each row B(x, ·) by y∈alph B(x, y) obtain the matrix TransB.
S2: For numerical stability we want to work with the logarithm version of the Metropolis- Hastings algorithm discussed in class. Therefore, compute logfreq(c) := log(freq(c)) and logTransB(x,y) := log(TransB(x,y)) for all c,x,y ∈ alph (with log denoting the natural logarithm).
Page 1 of 2
S3: To any string f of n characters from alph we can associate a plausibility
n−1
P (f ) := logfreq(f (1))+ logTransB(f (i), f (i + 1)),
i=1
with f(i) denoting the ith character in f. Implement in matlab the Metropolis-Hastings algorithm that starts with the secret message f and proposes in each iteration a new f∗, where from the old iterate f all characters c1 are replaced by c2 and, vice versa, all characters c2 are replaced by c1 (the c1,c2 ∈ alph are randomly chosen). The acceptance probability for accepting f∗ should be exp(min{0,P(f∗)−P(f)}) since we work here with logarithms.
S4: Run the Metropolis-Hastings algorithm several times (each run consisting of at least 106 Metropolis-Hastings iterations, i.e., new proposals of f∗). Return the most plausible f (the f with largest P(f) value) obtained over all iterations and runs of the algorithm. If your code contains no bug, you should obtain the decrypted message (or a good approximation to it).
What you should submit
For a successful completion of the project, submit (only) answers to the following Q1-Q4 and attach the matlab code requested in Q5.
Q1: What is the best decryption of the secret message obtained by your MCMC method and what P(f) value does it have? Can you guess the author of the secret message?
Q2: What are the five most frequent characters in War and Peace? (Give them in decreas- ing order.)
Q3: What are the five most frequent character transitions in War and Peace? (Give them in decreasing order.)
Q4: There are several ways how the author of the secret message could have prevented that the particular MCMC approach from this project successfully decrypts the text. Name one and explain why it would have failed.
Q5: Submit your complete, commented matlab code for decrypting the text (including your MCMC implementation and code for analyzing English language texts).
Hints for debugging (if you struggle hard to get the correct solution)
1. The English texts that definitely work are, e.g., L. Tolstoi’s War and Peace and H. Melville’s The Whale. You might want to analyze only the main body of the text (exclude the table of content and the technical comments inserted by Project Gutenberg). Convert all letters to upper case.
2. To improve convergence, it can help to replace −∞ values in logTransB by a small value, say, −12.
3. It may help to find a bug in your code if you try to decrypt a message that you encrypted yourself (hence, where you know the solution).
4. My, not very optimized implementation, needs about 14 seconds (on a standard laptop) to perform a run of the algorithm consisting of 106 iterations.
Page 2 of 2