代写代考 COMP9417 – Machine Learning Tutorial: Regression II

COMP9417 – Machine Learning Tutorial: Regression II
Weekly Problem Set: Please submit questions 1a, 1b and 3 on Moodle by 11:55am Tuesday 1st March, 2022. Please only submit these requested questions and no others.
Question 1. Maximum Likelihood Estimation (MLE)
In this question we will first review and then work through a few examples of parameter estimation using the MLE technique. The following introduction can be skipped if you are comfortable with the MLE concept already.

Copyright By PowCoder代写 加微信 powcoder

The setting is as follows: we sample n observations (data), which we denote by X1 , X2 , . . . , Xn , and we assume that the data is independently drawn from some probability distribution P . The shorthand for this is:
X1,…,Xn ∼ P,
where i.i.d. stands for independent and identically distributed. In practice, we never have access to P, we are just able to observe samples from P (namely X1,…,Xn), which we will use to learn something about P . In the simplest case, we assume that P belongs to a parametric family. For example, if we assume that P belongs to the family of normal distributions, then we are assuming that P has a probability density function (pdf) of the form
1 􏰆 (x−μ)2􏰇 2
pθ(x)= √2πσ2 exp − 2σ2 , θ=(μ,σ ), μ∈R,σ>0,
where we will usually refer to μ as the mean, and σ2 as the variance, and we combine all unkown parameters into a single parameter vector θ that lives in some parameter space Θ. In this particular example, Θ = R × [0, ∞). Under this assumption, if we knew θ, then we would know P , and so the learning problem reduces to learning the best possible parameter θ∗, hence the name parametric.
Continuing with this example, we need a way of quantifying how good a particular choice of θ is. To do this, we first recall the fact that for independent sets A, B, C, it holds that P (A and B and C) = P(A)P(B)P(C). Therefore, we have:
ProbofobservingX1,…,Xn =ProbofobservingX1 ×···×ProbofobservingXn
= pθ (X1 ) × · · · × pθ (Xn ) n
= 􏰒pθ(Xi) i=1

We call L(θ) the likelihood, and it is a function of the parameter vector θ. We interpret this quantity as the probability of observing the data when using a particular choice of parameter. Obviously, we want to choose the parameter θ that gives us the highest possible likelihood, i.e. we wish to find the maximum likelihood estimator
θˆMLE :=argmaxL(θ). θ∈Θ
Since this is just an optimization problem, we can rely on what we know about calculus to solve for the MLE estimator.
(a) Assume that X1 , . . . , Xn ∼ N (μ, 1), that is, we already know that the underlying distribution is
Normal with a population variance of 1, but the population mean is unkown. Compute μˆMLE. Hint: itisoftenmucheasiertoworkwiththelog-likelihood,i.e.tosolvetheoptimisation:
θˆMLE :=argmaxlogL(θ), θ∈Θ
which gives exactly the same answer as solving the original problem (why?).
The log-likelihood here is
􏰋n1􏰆1 􏰇􏰌 log L(m) = log 􏰒 √2π exp −2(Xi − m)2
(Xi − m)2.
= − 2 log(2π) − 2 Differentiating with respect to m and setting equal to zero yields:
∂m log L(m) = (Xi − m) = 0 =⇒ μˆMLE = n Xi = X ̄.
To see that this is indeed a minimum, we should perform a second derivative test, which yields:
∂m2 logL(m) = −n < 0. (b) Assume that X1 , . . . , Xn ∼ Bernoulli(p), compute pˆMLE . Recall that the Bernoulli distribution is discrete and has probability mass function: P(X = k) = pk(1 − p)1−k, k = 0, 1 p ∈ [0, 1]. Note here that θ = p and the parameter space is Θ = [0, 1]. We construct the log-likelihood in the usual way: 􏰋n􏰌 log L(p) = log 􏰒 pXi (1 − p)1−Xi = nX log p + n(1 − X) log(1 − p) Then, differentiating and setting to zero yields: ∂p log L(p) = 0 =⇒ pˆMLE = n i.i.d. 2 2 (c) optional: Assume that X1,...,Xn ∼ N(μ,σ ). Compute (μˆMLE,σˆMLE). The log-likelihood here is 􏰋n􏰌 2 􏰒1 􏰆(Xi−m)2􏰇 √2πs2 exp − 2s2 = − 2 log(2π) − 2 log(s2) − 2s2 logL(m,s )=log To solve for the two MLE estimates simultaneously, we need to differentiate the log-likelihood with respect to each of the two parameters and setting each to zero, which will yield two equa- tions (i.e. a system of equations). Solving these equations simultaneously yields the correct solution. Differentiating with respect to m first and setting equal to zero yields: (Xi −m)=0 =⇒ μˆMLE =X. Note that in this case, the first equation does not depend on the second parameter, so we can ∂mlogL(m,s )= s2 solve it directly (this is not always the case). Next, differentiating with respect to s2 ∂n1􏰑n 1􏰑n logL(m,s2)=− − (X −m)2 =0 =⇒ σˆ2 = (X −m)2 =0. i MLE n i Note, in order to be completely rigorous when finding a local maximum (tˆ , tˆ ) of a function 12 F (t1, t2), we need to check the following three condtions: To solve this, we need to refer to the first of the two equations, which tells us that m = X is optimal, and so 􏰋1􏰑n 􏰌 (μˆ,σˆ2)=X, (X−X)2. MLEMLE ni i=1 (Xi − m)2. 1. The first order partial derivatives at (tˆ , tˆ ) are zero. 12 2. At least one second-order partial derivative is negative 3. The determinant of the Hessian matrix is positive. We will not verify these conditions here, and this is beyond the scope of the course, but it is important to note that we are leaving out some details. Question 2. Bias and Variance of an Estimator In the previous question, we discussed the MLE as a method of estimating a parameter. But there are an infinite number of ways to estimate a parameter. For example, one could choose to use the sample median instead of the MLE. It is useful to have a framework in which we can compare estimators in a systematic fashion, which brings us to two central concepts in machine learning: bias and variance. Assume that the true parameter is θ, and we have an estimate θˆ. Note that an estimator is just a function of the observed (random) data (i.e. we can always write θˆ = θˆ(X)) and so is itself a random variable! We can therefore define: bias(θˆ) = E(θˆ) − θ, var(θˆ) = E(θˆ− E(θˆ))2. The lab this week explores these concepts as well, and you are encouraged to do the lab exercise as you complete this question to get a full picture. A short summary of the lab in words: • bias: tells us how far the expected value of our estimator is from the truth. Recall that an estimator is a function of the data sample we observe. The expectation of an estimator can be thought of in the following manner: imagine instad of having a single data sample, we have an infinite number of data samples. We compute the same estimator on each sample, and then take an average. This is the expected value of the estimator. • varaiance: how variable our estimator is. Again, if we have an infinite number of data samples, we would be able to compute the estimator an infinite number of times, and check the variation in the estimator across all samples. A good estimator should have low bias and low variance. (a) Find the bias and variance of μˆMLE where X1,X2,...,Xn ∼ N(μ,1). We have already found that μˆMLE = X. Therefore bias(μˆMLE) = bias(X) 􏰋1􏰑n 􏰌 =E n Xi −μ i=1 =n E(Xi)−μ =n μ−μ i=1 and we say X is an unbiased estimator for μ. Next, we have var(μˆMLE) = var(X) i=1 = n1 , where in the third equality we have used the independence of the Xi’s, and in the final equality we have used the fact that Xi ∼ N (μ, 1). Xi var(Xi) (b) FindthebiasandvarianceofpˆMLE whereX1,X2,...,Xn ∼ Bernoulli(p). (c) The mean squared error (MSE) is a metric that is widely used in statistics and machine learning. For an estimator θˆ of the true parameter θ, we define its MSE by: MSE(θˆ) := E(θˆ− θ)2. Show that the MSE obeys a bias-variance decomposition, i.e. we can write MSE(θˆ) := bias(θˆ)2 + var(θˆ). Page 5 We have already found that pˆMLE = X, and it can easily be shown that var(Xi) = p(1 − p). We therefore get that bias(pˆMLE) = 0, var(pˆMLE) = p(1 − p). n MSE(θˆ) = E(θˆ − θ)2 = E [ θˆ − E ( θˆ ) + E ( θˆ ) − θ ] 2 = E[(θˆ− E(θˆ))2 + 2(θˆ− E(θˆ))(E(θˆ) − θ) + (E(θˆ) − θ)2] = E(θˆ − E(θˆ))2 + 2E[(θˆ − E(θˆ))(E(θˆ) − θ)] + E[E(θˆ) − θ]2 = E(θˆ − E(θˆ))2 + 2(E(θˆ) − E(θˆ))(E(θˆ) − θ)] + [E(θˆ) − θ]2 =E(θˆ−E(θˆ))2 +0+[E(θˆ)−θ]2 = var(θˆ) + bias(θˆ)2. Question 3. Probabalistic View of Least-Squares regression In the tutorial last week, we viewed the least-squares problem purely from an optimisation point of view. We specified the model we wanted to fit, namely: yˆ = w T x as well as a loss function (MSE), and simply found the weight vector w that minimized the loss. We proved that when using MSE, the best possible weight vector was given by wˆ = (XT X)−1XT y. In this question, we will explore a different point of view, which we can call the statistical view. At the heart of the statistical view is the data generatinc process (DGP), which assumes that there is some true underlying function that generates the data, which we call f, but we only have access to noisy observations of f . That is, we observe y = f(x) + ε, ε is some random noise. For example, assume your y’s represent the daily temperature in Kensington. Any thermometer - even the most expensive - is prone to measurement error, and so what we actually observe is the true tem- perature (f(x)) plus some random noise ε. Most commonly, we will assume that the noise is normally distributed with zero mean, and variance σ2. Now, consider the (strong) assumption that f(x) is linear, which means that there is some true β∗ such that f(x) = xT β∗. Therefore, we have that and therefore, y=xTβ∗ +ε, ε∼N(0,σ2), y|x ∼ N(xT β∗,σ2). What this says is that our response (conditional on knowing the feature value x) follows a normal distribution with mean xT β∗ and variance σ2. We can therefore think of our data as a random sample of observations coming from this distribution, which in turn allows us to estimate unknown parameters via maximum likelihood, just as we did in the previous questions. (a) You are given a dataset D = {(x1, y1), . . . , (xn, yn)} and you make the assumption that yi|xi = xTi β∗+εi forsomeunknownβ∗ andεi ∼N(0,σ2),wherealltheεi’sareindependentofeachother. Write down the log-likelihood for this problem as well as the maximum likelihood estimation objective and solve for the MLE estimator βˆMLE. Under this assumption, we have that yi|xi ∼ N(xTi β∗),i = 1,...,n, or we can write this in matrix notation as: y|X ∼ N(Xβ∗,σ2I). We can compute the likelihood of the data as L(β) = P (y|X, β). It is important to interpret this probability properly: it is the probabilty of seeing the responses y given that we have features X and we are assuming the underlying vector is β. L(β) will give us a different value of the likelihood for different choices of β, and we want to choose the best β, i.e. the one that maximizes L(β). Now, let’s write out what log L(β) is in detail: log L(β) = log P (y|X, β) 􏰋n􏰌 =log 􏰒P(yi|xi,β) i=1 = 􏰑logP(yi|xi,β) 􏰑n 􏰆 1 􏰆 ( y i − x Ti β ) 2 􏰇 􏰇 log √2πσ2 exp − 2σ2 = =−2log(2πσ2)−2σ2 = −n log(2πσ2) − 1 2 2σ2 We therefore get that the MLE estimator of β is βˆMLE =argmaxlogL(β) 􏰊n 2 1 2􏰛 =argmax − log(2πσ )− 2∥y−Xβ∥2 􏰊n 2 1 2􏰛 =argmin log(2πσ )+ 2∥y−Xβ∥2 β2 2σ = arg min ∥y − Xβ∥2 = (XT X)−1XT y (yi−xTi β)2 ∥y − Xβ∥2. The second equality holds because maximising an objective is the same as minimizing the negative of the same objective, and the third equality holds since the minimizer is unaffected by the first term. Note that we have shown that the MLE in this case is exactly identical to the least squares estimator, this is not true in general. For example, had we used a different loss function (not MSE), or a different assumption about the noise, or a different distribution (other than the normal) then we would not get equivalence between LS and MLE. We can think of this as a probabalistic justification for doing least squares. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com