In this assignment, we will use ideas from probability and information theory to carry out statistical analyses.
The assignment should be handed in as a report with copies of the code you used. Make sure you include a copy of all your code, including that for generating the figures, and that the code is commented with the appropriate question number.
A Jupyter notebook is the recommended way to carry out the assignment. Use
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
at the start of your notebook to import the modules needed and to have figures appearing in the notebook rather than in a separate window. Remember that you should press Shift and Return to evaluate a cell, and you can use Markdown cells to type text to answer the questions in the assignment (followed by Shift and Return). There are some useful keyboard shortcuts.
To print from Jupyter, you can create a Print Preview and then use your web browser to print this preview. Don’t worry if extra space is added to the printed version.
Remember that there are numerous examples of Python on the web if there is a coding task that you find difficult. Learning how to find the examples most useful for you is an important skill to develop.
You can work in pairs, but not in groups larger than two. Within a pair, you can have the same computer code, but you must have different write-ups. Please mark on your script the examination number of the person with whom you are collaborating.If there are more than two submitted assignments that have sections of the same computer code then we will have to give zero marks to those assignments.
Marks for each section are given in bold.
Remember you must type plt.show() for the figures you have plotted to appear.
The data needed
All the data you need for the assignment should first be downloaded.
Probability
The probability of a particular outcome of an event can be estimated by the ratio of the number of times the outcome occurs to the total number of events.
The two files dna1.npy and dna2.npy contain DNA sequences. Load the files using, for example,
dna1= str(np.load('dna1.npy'))
Any .npy file must always contain an array, and we use str here to convert the array into a string, which is the more natural object for a DNA sequence.
Question 1.1: We would like to estimate the probability of each nucleotide in each DNA sequence.
Write some code that uses a for loop to systematically go through a sequence, extracts a segment (a slice in Python jargon) of the sequence (here the segment is just one letter, but we will consider longer segments later), and then checks if, say, “A” is in the segment. By recording the number of successful checks, you can count the total number of “A” nucleotides and so estimate the probability of a nucleotide in the sequence being an “A” nucleotide.
Check your result using the count method available for all strings.
Repeat this estimation of the probability for the other nucleotides and for both DNA sequences. Print the results. [5]
Question 1.2: One of the sequences is from the yeast genome and one is synthetic and generated assuming each letter in the sequence is independent of the previous letters. Adapt your code to extract segments of four letters and calculate the probability of finding “TTTT” in each sequence. Note that “TTTTT” has two counts of “TTTT”. Refreshing the meaning of independence in probability theory and, considering the probability of occurrence of “TTTT”, which sequence is real and which synthetic? Explain your reasoning. [6]
Entropy
The entropy of a random variable is a measure of the uncertainty of the random variable. The entropy tells us, in some sense, the “width” of the probability distribution of the random variable. For a random variable, X, with probability distribution P(X), the entropy is defined as
H(X)=−
∑ |
x |
P(X=x)log2P(X=x)
where the logarithm is to the base 2 and the entropy is then said to be measured in bits. The sum is over all values of X. For example, if X can take four different possibilities, such as “A”, “C”, “G”, and “T”, then x runs from “A” to “T”, and there are 4 terms in the sum: P(X=A), P(X=C), P(X=G), and P(X=T). The entropy does not depend on the actual values taken by Xbut only on their probabilities.
The log2 of 0 will cause errors in Python, but xlog2(x)→0 as x→0. In your code, check for zeros in the probability distribution and do not include these elements in calculations of entropies.
Question 2.1: Write a script or a function in Python to calculate the entropy of a random variable given its probability distribution [2].
Question 2.2: In the file annualprobs.npy, six probability distributions are defined as an array of arrays. Load these distributions using
d= np.load('annualprobs.npy')
and type
d.shape
to see how the data is structured.
Each distribution represents the probability of an event occurring for each month of the year (and so has 12 numbers). The first element of the array, d[0], gives the first probability distribution of the event occurring, the second element of the array, d[1] gives the second probability distribution of the event, and so on.
Use plt.bar to plot these distributions, label the x-axis with the first letters of the months of the year, and title each figure with the number of the distribution and its entropy in bits. [2]
From what you have learned from your results, which type of probability distributions would you expect to have the highest entropy and which distributions to have the lowest entropy and why? [5]
Measuring the dependence between two random variables
We can use entropies to measure the degree of statistical dependence between two random variables. Consider two random variables X and Y. The conditional entropy H(Y|X) measures the uncertainty in Y given that we know X (the | symbol is read as “given that”). If knowing Xtells us nothing about the value of Y, then H(Y|X)=H(Y) — the uncertainty in Y is the same as if we did not know X; if knowing X tells us everything about Y, for example if Y=2X then knowing X allows us to predict Y with certainty, then H(Y|X)=0. Consequently, a measure of the degree of dependence between X and Y is:
H(Y)−H(Y|X)
which equals zero if knowing X tells us nothing about Y and equals H(Y) if knowing X means that we know Y.
This quantity is often referred to as the mutual information between X and Y: it measures how much the uncertainty in Y has decreased if we know X. To quantify the degree of dependence, it is useful to have a measure that has a minimum of zero and a maximum of one, and we therefore often use
H(Y)−H(Y|X) |
H(Y) |
as a normalized mutual information.
Four datasets are given as an array of arrays in XandY.npy. Each dataset consists of measurements of an X variable and simultaneous measurements of a Y variable. The first element of the array, d[0] if you use d= load('XandY.npy'), gives the measured values of Xand the second element, d[1], gives the corresponding measured values of Y for the first dataset; the third, d[2], and fourth, d[3], elements give the values of X and Y for the second dataset, and so on.
Question 3.1: Write a general function to calculate the correlation coefficient between two arrays X and Y. Remember that the correlation coefficient is defined as
[
∑ |
i |
(xi−E[x])(yi−E[y])]∕[
∑ |
i |
(xi−E[x])2
∑ |
i |
(yi−E[y])2]
1 |
2 |
where E[x] is the mean of X and E[y] is the mean of Y. Here xi is the i’th element of the array X and i runs from 0 to the total number of elements in X. The sum therefore runs over each element in X and the corresponding element in Y. [2]
Question 3.2: Using np.histogram2d and plt.imshow, plot a two dimensional histogram of one of the datasets with a 10×10 grid to bin the data. You should use np.rot90 to rotate the output of np.histogram2d to appropriately display the data (compare with a scatter plot). Setting aspect='auto' in plt.imshow creates a square-shaped plot. [3]
Question 3.3: Write a function to calculate the mutual information between two random variables X and Y normalized by the entropy of Y. You will need to calculate H(X), H(Y) and H(Y|X). To do so you will need to estimate P(X,Y), and then find P(X) and P(Y).
The conditional entropy can be shown to be
H(Y|X)=
∑ |
x,y |
P(X=x,Y=y)log2[
P(X=x) |
P(X=x,Y=y) |
]
where the sum x is over all values of X and the sum y is over all values of Y.
We must therefore estimate P(X=x,Y=y) if we wish to calculate the mutual information and will use np.histogram2d to do so. The np.histogram2d function bins data into a grid of bins. The height of the histogram at a particular grid point gives the number of counts in that bin and is approximately proportional to the probability that X and Y have the values associated with that bin. If a new data point was measured, then it is more likely to have the X and Y values of bins with high numbers of counts rather than the X and Y values of bins with low numbers of counts. By normalizing the histogram appropriately so that each bin represents the proportion of events that occurred in that bin, we can interprete the normalized value of a particular point on the grid as the probability that X and Y will have values that are in the bin associated with that point when we make a new measurement.
Use part of the output of np.histogram2d to estimate P(X=xi,Y=yj), which here is the probability that X lies in the i’th bin on the x-axis and Y lies in the j’th bin on the y-axis.
Given your estimate of P(X=xi,Y=yj), you can find P(X) (and P(Y) similarly). By definition
P(X=xi)=
∑ |
j |
P(X=xi,Y=yj),
i.e. the probability of X lying in the i’th bin is equal to the probability that X lies in the i’th bin and Y lies in the zeroth bin plus the probability that X lies in the i’th bin and Y lies in the first bin plus the probability that X lies in the i’th bin and Y lies in the second bin, etc. Note, however that
nxy, xedges, yedges= np.histogram2d(x, y, (n,m))
divides the x-axis into n bins and the y-axis into m bins, but that nxy is returned as a nxmmatrix so that the x changes down the columns and y changes across the rows. The second argument of np.sum allows addition over either the columns or rows of a matrix.
Finally, remember to check that you are not including any zero probabilities when calculating entropies.
For the first data set show that the correlation coefficient is approximately 0.52 and that the normalized mutual information is approximately 0.09. [10]
Question 3.4: Plot each dataset as a scatterplot and title the figure with both the value of the correlation coefficient and the value of the normalized mutual information. Considering all datasets, explain why both measures of dependence change with each dataset in the way you observe. [7]