留学生辅导 In this lab sheet we will work through the foundations of statistical mode

In this lab sheet we will work through the foundations of statistical modelling. How we can combine prior knowledge with data to update our knowledge. We will see how we can formulate knowledge using probability distributions and perform updates using statistical inference. In specific we will look at two examples, Bernoulli trials and Gaussian linear regression. We will also talk about the concept of conjugate priors allowing us to perform inference in closed form. The aim of this sheet is to try and communicate a foundation that is needed to digest the material that comes later on. For many of you this will be things that you have seen before. Do not be scared of the sheer size of the lab sheet we have been very verbose with derivations and explanations.
This lab-sheet consists of two parts, in the first we introduce probabilistic modelling through a very simple example that you have most likely seen before namely figuring out if a coin is biased1. What we will use this example for is try to get the theory and the philosophy behind what we are trying to do into mathematically precise terms. We will then proceed and do exactly the same thing again but now with a slightly more interesting example namely that of fitting a line to a set of data-points. Your task is to see the similarity between both of these two examples, how they are actually exactly the same. The central concept comes directly from Laplace Demon Laplace, 1814 there exists infinitely many equally valid explanations of a finite sample set. Therefore as machine learners we do not deal in truth we deal in assumptions and beliefs. What we do is to use data to find support or rejections of our belief to reach an updated belief. In short, truth is not something that you learn, truth is something that you accept and in this unit we are interested in learning from data. We aim to make use of the knowledge that has already been discovered and from this create new knowledge by using data. Importantly we want to be able to quantify what we know and as you will see this you can do with probabilities by considering them as quantification of beliefs.
The second part of the lab-sheet consists of derivations of the Gaussian identities. Most of the models that we will use in this unit are in one way or another based on the Gaussian distribution. In the second part of the lab sheet we will work through and prove the identities that we are going to use. Now these things you will not have to be able to derive but if you are mathematically inclined and want to see what all the simple structures of Gaussian’s comes its quite nice to go through this but trust me after this unit you will know these formulas by heart.
1 Bernoulli Trial

Copyright By PowCoder代写 加微信 powcoder

We have been given the task to observe a system which has a binary outcome, you can think of the example of modelling a coin toss. We will parametrise the system using a single parameter that describes how often we get each outcome, with the coin analogy how biased the coin is. We will refer to a single outcome of the system as x and if we run the system for N iterations we will refer to all the data x. We will first create a
1Yes these coins do really exists https://izbicki.me/blog/how-to-create-an-unfair-coin-and-prove-it-with-math. html

function to generate the data by sampling from a distribution with known parameters, the machine learning task is then to recover the parameters of the distribution from only the data-points.
In order to phrase this as a learning problem we need to formulate the probabilistic objects that constitutes the model of how the data have been generated. We will formulate the likelihood function that will express our belief in specific types of data if we know the parametrisation of the system. Say that if I think that I have an unbiased coin then seeing 100 heads in a row would be a very unlikely outcome while seeing 50 heads and 50 tails would be highly likely. We will then parametrise the prior which will encode our belief in the parameter value of the system, i.e. how likely do we believe the coin to be biased before we see data. Having these two objects we have completely specified our model and can generate data. Now the inference procedure starts where we try to reach the posterior distribution. The posterior distribution is the distribution that encapsulate both our prior belief about the system with the evidence in the data.
1.1 Likelihood
The first thing we need to think of is what is the likelihood function. For a binary system it makes sense to use a Bernoulli Distribution as likelihood function,
p(x|μ)Bern(x|μ) = μx(1 − μ)1−x.
This distribution takes a single parameter μ which completely parametrises our likelihood. This means, we have a conditional distribution and the system is completely specified by μ. If we know μ we can generate output that is similar to what the actual system does, this means we can predict how the system behaves. Now we want to use examples of the systems predictions training data find μ. So far we have only set the likelihood for one data-point, what about when we see lots of them? Now we will make our first assumption, we will assume that each output of the system is independent. This means that we can factorise the distribution over several trials in a simple manner,
p(x|μ) = 􏰍 Bern(x|μ) = 􏰍 μx(1 − μ)1−x.
Now we have the likelihood for the whole data-set x. Think a bit about what this assumption implies. It
means that,
1. each data-point is independent as our likelihood function is invariant to any permutation of the data.
2. it assumes that each of the data-points are generated by the same distribution. For the coin example this means that by tossing the coin you do not change the actual coin by the act of tossing it2.
In order to say something about the system we need to have a prior belief about what we think the parameter μ is. Now our prior knowledge comes into play, what do I know about the system? If our system is the outcome of a coin toss then we have a lot of prior knowledge, most coins that I toss are unbiased so I have quite a good idea of what I think it should be. Another way of seeing this is that I would need to see a lot of coin tosses saying something else for me to believe that a coin is not biased. If it is not a coin toss but something that I have no experience in my prior might be different. Once we have specified the prior we can just do Bayes’ rule and get the posterior,
p(μ|x) = p(x|μ)p(μ). p(x)
Now comes the first tricky part, what should the distribution be for the prior? If you choose your prior or likelihood wrong3 then this computation might not even be analytically possible to perform. So this is when
2this is probably an OK assumption for metal coins while for chocolate coins this assumption is too simplistic
3its a bit more complicated than this but you have to believe me on this right now but we will see more of this later in the unit

we will use conjugacy. First lets note that the posterior distribution is proportional to the likelihood times the prior (or the joint distribution),
p(μ|x) ∝ p(x|μ)p(μ).
The second thing we will think of is that it does make sense that if I have a prior of a specific form (say a Gaussian) then I would like the posterior to be of the same form (i.e. Gaussian). So this means that we should choose a prior such that when multiplied with the likelihood it leads to a distribution that is of the same form of the prior. So why is this useful? This is very useful because it means we do not have to compute the denominator in Baye’s rule, the only thing we need to do is to multiply the prior with the likelihood and then identify the parameters that of the distribution, we can do this as we know its form.
So how do we know which distributions are conjugate to what? Well this is something that we leave to the mathematicians most of the time, we simply exploit their results. There is a list on Wikipedia of conjugate priors here URL. Its like a match making list for distributions. For the Bernoulli distribution the conjugate prior to its only parameter μ is the Beta distribution,
Beta(μ|a, b) = Γ(a + b) μa−1(1 − μ)b−1, Γ(a)Γ(b)
where Γ(·) is the gamma function. The role of the Gamma function is to normalise this to make sure it becomes a distribution not just any function. Now we have choosen our prior we are ready to derive the posterior.
1.3 Posterior
To get to the posterior we are going to multiply the likelihood and the prior together,
p(μ|x) ∝ =
p(x|μ)p(μ) N
􏰍 Bern(x|μ)Beta(μ|a, b) i=1
􏰍N Γ(a + b)
μa−1(1 − μ)b−1 (3) μ􏰊i xi (1 − μ)􏰊i(1−xi) Γ(a + b) μa−1(1 − μ)b−1 (4)
μxi (1 − μ)1−xi
Γ(a)Γ(b) Γ(a)Γ(b)
Γ(a + b) μ􏰊i xi (1 − μ)􏰊i(1−xi)μa−1(1 − μ)b−1 (5) Γ(a)Γ(b)
Γ(a + b) μ􏰊i xi+a−1(1 − μ)􏰊i(1−xi)+b−1. (6) Γ(a)Γ(b)
Now comes the trick with conjugacy, we know the form of the posterior. This means we can just identify the parameters of the posterior and in this case it is trivial,
􏰌xi +a−1 􏰌(1−xi)+b−1 ii
an −1 bn −1
p(μ|x) ∝ μ􏰑 􏰐􏰏 􏰒(1 − μ)􏰑 􏰐􏰏 􏰒.
Now what is left is to make sure that the expression is actually a probability distribution such that it integrates to one. This means that we need to solve the following,
1 = Z p(μ|x)dμ = Z μan−1(1 − μ)bn−1dμ.
In this case this is trivial as we know the normaliser of the beta distribution which means that,
Z = Γ(an + bn) Γ(an )Γ(bn )

This mean that my posterior is,
p(μ|x)=Beta(μ|an,bn)= Γ(􏰊ixi +a)Γ(􏰊i(1−xi)+b)μ i (1−μ) i
So thats it, we have the posterior and now we can fix our parameters for the prior a and b and then compute the posterior and get an and bn after seeing n data-points. So lets write code that simulates one of these experiments.
1.4 Implementation
Common practice if you want to test something is to generate data from your model with known parameters, throw away the parameters and then see if you can recover the parameter. What we first then want to do is to sample a large set of binary outcomes. We can do this by using the Binomial4 in numpy. So we start of with setting μ to 0.2 and then generate 200 values from this distribution, i.e. running the system 200 iterations or tossing a coin 200 times. Then we define our prior by setting the parameters a and b. Now we can compute our posterior, we know its form, we both derived it above, but in most cases you just write it down, that is what you will do for linear regression. Now we can plot the posterior when we see more and more examples and see what will happen. Below is the image that it generated for me,
4 https://en.wikipedia.org/wiki/Binomial_distribution
Γ(􏰊ixi+a+􏰊i(1−xi)+b) 􏰊 xi+a−1 􏰊(1−xi)+b−1

import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt
def posterior(a,b,X):
a_n = a + X.sum()
b_n = b + (X.shape[0]-X.sum())
return beta.pdf(mu_test,a_n,b_n)
# parameters to generate data
mu = 0.2 N = 100
# generate some data
X = np.random.binomial(1,mu,N)
mu_test = np.linspace(0,1,100)
# now lets define our prior
a = 10 b = 10
# p(mu) = Beta(alpha,beta)
prior_mu = beta.pdf(mu_test,a,b)
# create figure
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(111)
# plot prior
ax.plot(mu_test,prior_mu,’g’)
ax.fill_between(mu_test,prior_mu,color=’green’,alpha=0.3)
ax.set_xlabel(‘$\mu$’)
ax.set_ylabel(‘$p(\mu|\mathbf{x})$’)
# lets pick a random (uniform) point from the data # and update our assumption with this
index = np.random.permutation(X.shape[0])
for i in range(0,X.shape[0]):
y = posterior(a,b,X[:index[i]])
plt.plot(mu_test,y,’r’,alpha=0.3)
y = posterior(a,b,X)
plt.plot(mu_test,y,’b’,linewidth=4.0)
# ignore this
plt.tight_layout()
plt.savefig(path, transparent=True)
return path

Figure 1: The green distribution is the prior distribution over μ and the red distributions are the updated belief when we see more and more data-points and the blue is the final posterior when we have seen all the points.
Now when we have built our model we can try out lots of different things, implement the tasks below and evaluate what they mean
Reflections
1. what happens if you choose a prior that is very confident at a place far from the true value?
2. what happens if you choose a prior that is very confident at the true value?
3. in the plot above we cannot see the order of the red lines, to do so create a plot where the x-axis
is the number of data-points that you have used to compute the posterior and the y-axis is
the distance of the posterior mean to the prior mean.
4. How much does the order of the data-points matter? Redo the plot above but with many
different random permutations of the data. Now plot the mean and the variance of each iteration. What can you see?
1.5 Summary
The intention of this lab was to show the mechanism that allows us to learn from data, how we can take our prior beliefs and update them by learning from data. Even though a coin-toss might be a silly example5 it importantly contains most of the elements that we will use to build models over the next couple of weeks. The prior is the object that contains our belief about the system, this is what we can get from domain experts, the likelihood describes how strong the evidence is for a specific model setting and the posterior gives us the updated belief once we have taken into account how well all models describe our data weighted by our beliefs.
5not if you are playing cricket where if you win the toss you always choose to bat

2 Linear Regression
Now we will take the methodology that we used in the bernoulli trial one step further and apply it to a more interesting problem. We will perform regression by fitting a function to a set of observed data. The key thing to see is that we are using exactly the same structure to learn, we formulate our beliefs and we integrate it with observed data to get an updated belief.
We observe a data-set D = {x ̃i, yi}Ni=1 where we assume the following relationship between the variates,
yi =f(x ̃i). (7)
Our task is to infer the function f(·) from D. To simplify things we are going to limit the hypothesis space to be only of linear functions. This means that we can write Eq 7 as,
T 􏰔 w 1 􏰕 T 􏰔 x ̃ i 􏰕
yi=w1x ̃i+w0=wxi= w0 1 (8)
where we have rewritten the input variate by appending a one so that we can write everything on matrix form. The task that we will perform in this lab is to infer the function parametrised by w by observing D.
Now we will make our first assumption, we will assume that the data we observed is not instantiations of the “true” underlying function but rather have been corrupted by addative noise. This means that we get the following model,
yi = wTxi + ε. (9)
Now can we make an argument what form this noise will take? One assumption would be to say that the noise is independent of where in the input space we evaluate the function, this is called homogenous noise. Furthermore we could argue that we know the form of the noise, one idea would be to say that the noise is Gaussian,
ε ∼ N(0,β−1),
with precision β. Now this would mean is that if we could directly observe the noise we could formulate a
likelihood as,
Now we can use Eq.9 and rewrite,
p(ε) = N (ε|0, β−1). (10) yi = wTxi + ε (11)
yi − wTxi = ε (12) If we now combine this new expression of the noise with the assumption of the stochastic form in Eq.10 we
p(ε) = N (ε|0, β−1) (13) =N(yi −wTxi|0,β−1) (14) = N (yi|wTxi, β−1). (15)
The last step can be done because we can “translate” a Gaussian distribution Figure 2.
What we have just done is to formulate a likelihood function. It describes how likely an observed output location is to have come from the a specific parametrisation of the function. Or more precisely this is the function that quantifies how much evidence a specific data point provides for a specific model.

Figure 2: Figure showing how we can re-parametrise a Gaussian by translation. In the first case we have a Gaussian with mean zero over the difference between the observed output yi and the output of the function wTxi which is the same as a Gaussian over the output yi with mean wTxi where in this case the latter equates to 1.
Now the formulation is for a single data-point but if we assume that the noise is independent we can easily formulate the likelihood for a set of data,
p(y|w, X) = 􏰍 p(yi|w, xi), (16)
where X = [x1,…,xN]T and y = [y1,…,yN]T. Think back to the Bernoulli trial, its exactly the same thing. One important difference is how we motivated the construction of the likelihood, it came through making an assumption of the structure of the noise. Concretely we made three assumptions,
1. the noise is addative
2. the noise is Gaussian
3. the noise is independent of the input location.
These assumptions needs to be justified on a problem to problem basis. Say for example that we are measuring the vibrations in a car as a function of speed. It might be the case that the noise in the sensor measuring the vibrations also depends on the speed of the car then we should not use a homoscedastic noise model. Here to keep things simple we will only look at artificial data but later on in the module we will look at examples where we have prior knowledge.
So now we have our likelihood function and if we knew the parameters of the function w we would be able to generate data. However, we want to infer the these parameters from the data and to do so we need to formulate our beliefs over different parametrisations using a prior distribution.
In order to specify our prior distribution we will again use the concept of conjugacy. If we look at the likelihood in Eq.16 it is itself a Gaussian distribution6. We will also assume that the parameters of the noise
6check this for yourself by writing up the product of the individual terms

β is known. Now we can again look-up what the conjugate prior is for a Gaussian with known variance7] as it turns out this in itself is another Gaussian distribution. That Gaussians are conjugate to themselves is called self-conjugacy. Therefore to exploit conjugacy we will use a Gaussian prior for the parameters of the function such that,
p(w) = N (w0, S0), (17) S0 = λI. (18)
The structure of the prior covariance tells us that we assume that the two parameters w1 and w0 are independent with equal variance. Think about this assumption, does this make sense for a line? Well the great thing about distributions is that you can sample from it and generate the results that you can reason about. The code below will generate sample lines where w0 = 0 and S0 = I. The resulting samples can be seen in Figure 3.
import numpy as np
import matplotlib.pyplot as plt
def plot_line(ax, w): # input data
X = np.zeros((2,2))
X[0,0] = -5.0
X[1,0] = 5.0
X[:,1] = 1.0
# because of the concatenation we have to flip the transpose
y = w.dot(X.T)
ax.plot(X[:,0], y)
# create prior distribution
tau = 1.0*np.eye(2)
w_0 = np.zeros((2,1))
# sample from prior
n_samples = 100
w_samp = np.random.multivariate_normal(w_0.flatten(), tau, size=n_samples)
# create plot
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot(111)
for i in range(0, w_samp.shape[0]):
plot_line(ax, w_samp[i,:])
# save fig
plt.tight_layout()
plt.savefig(path, transparent=True)
return path
7 https://en.wikipedia.org/wiki/Conjugate_prior

Figure 3: Samples from the prior with the prior covariance set to identitiy and prior mean of 0. Reflections
1. what is the most likely line according to your prior belief? 2. what is the least likely line according to your prior belief? 3. is there any lines that have zero probability in this belief?
2.3 Posterior
Now we have encoded our prior belief and we have formulated our likelihood function and its time to formulate the posterior distribution. We will do exactly the same thing as we did in the Bernoulli trial but the math will be a bit mo

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com