COMP9417 – Machine Learning Tutorial: Regression II
Weekly Problem Set: Please submit questions 1a, 1b and 3 on Moodle by 11:55am Tuesday 1st March, 2022. Please only submit these requested questions and no others.
Question 1. Maximum Likelihood Estimation (MLE)
In this question we will first review and then work through a few examples of parameter estimation using the MLE technique. The following introduction can be skipped if you are comfortable with the MLE concept already.
Copyright By PowCoder代写 加微信 powcoder
The setting is as follows: we sample n observations (data), which we denote by X1 , X2 , . . . , Xn , and we assume that the data is independently drawn from some probability distribution P . The shorthand for this is:
X1,…,Xn ∼ P,
where i.i.d. stands for independent and identically distributed. In practice, we never have access to P, we are just able to observe samples from P (namely X1,…,Xn), which we will use to learn something about P . In the simplest case, we assume that P belongs to a parametric family. For example, if we assume that P belongs to the family of normal distributions, then we are assuming that P has a probability density function (pdf) of the form
1 (x−μ)2 2
pθ(x)= √2πσ2 exp − 2σ2 , θ=(μ,σ ), μ∈R,σ>0,
where we will usually refer to μ as the mean, and σ2 as the variance, and we combine all unkown parameters into a single parameter vector θ that lives in some parameter space Θ. In this particular example, Θ = R × [0, ∞). Under this assumption, if we knew θ, then we would know P , and so the learning problem reduces to learning the best possible parameter θ∗, hence the name parametric.
Continuing with this example, we need a way of quantifying how good a particular choice of θ is. To do this, we first recall the fact that for independent sets A, B, C, it holds that P (A and B and C) = P(A)P(B)P(C). Therefore, we have:
ProbofobservingX1,…,Xn =ProbofobservingX1 ×···×ProbofobservingXn
= pθ (X1 ) × · · · × pθ (Xn ) n
= pθ(Xi) i=1
We call L(θ) the likelihood, and it is a function of the parameter vector θ. We interpret this quantity as the probability of observing the data when using a particular choice of parameter. Obviously, we want to choose the parameter θ that gives us the highest possible likelihood, i.e. we wish to find the maximum likelihood estimator
θˆMLE :=argmaxL(θ). θ∈Θ
Since this is just an optimization problem, we can rely on what we know about calculus to solve for the MLE estimator.
(a) Assume that X1 , . . . , Xn ∼ N (μ, 1), that is, we already know that the underlying distribution is
Normal with a population variance of 1, but the population mean is unkown. Compute μˆMLE. Hint: itisoftenmucheasiertoworkwiththelog-likelihood,i.e.tosolvetheoptimisation:
θˆMLE :=argmaxlogL(θ), θ∈Θ
which gives exactly the same answer as solving the original problem (why?). i.i.d.
(b) Assume that X1 , . . . , Xn ∼ Bernoulli(p), compute pˆMLE . Recall that the Bernoulli distribution is discrete and has probability mass function:
P(X = k) = pk(1 − p)1−k, k = 0, 1 p ∈ [0, 1]. i.i.d. 2 2
(c) optional: Assume that X1,…,Xn ∼ N(μ,σ ). Compute (μˆMLE,σˆMLE).
Question 2. Bias and Variance of an Estimator
In the previous question, we discussed the MLE as a method of estimating a parameter. But there are an infinite number of ways to estimate a parameter. For example, one could choose to use the sample median instead of the MLE. It is useful to have a framework in which we can compare estimators in a systematic fashion, which brings us to two central concepts in machine learning: bias and variance. Assume that the true parameter is θ, and we have an estimate θˆ. Note that an estimator is just a function of the observed (random) data (i.e. we can always write θˆ = θˆ(X)) and so is itself a random variable! We can therefore define:
bias(θˆ) = E(θˆ) − θ, var(θˆ) = E(θˆ− E(θˆ))2.
The lab this week explores these concepts as well, and you are encouraged to do the lab exercise as you complete this question to get a full picture. A short summary of the lab in words:
• bias: tells us how far the expected value of our estimator is from the truth. Recall that an estimator is a function of the data sample we observe. The expectation of an estimator can be thought of in the following manner: imagine instad of having a single data sample, we have an infinite number of data samples. We compute the same estimator on each sample, and then take an average. This is the expected value of the estimator.
• varaiance: how variable our estimator is. Again, if we have an infinite number of data samples, we would be able to compute the estimator an infinite number of times, and check the variation in the estimator across all samples.
A good estimator should have low bias and low variance. Page 2
(a) Find the bias and variance of μˆMLE where X1,X2,…,Xn ∼ N(μ,1).
(b) FindthebiasandvarianceofpˆMLE whereX1,X2,…,Xn ∼ Bernoulli(p).
(c) The mean squared error (MSE) is a metric that is widely used in statistics and machine learning. For an estimator θˆ of the true parameter θ, we define its MSE by:
MSE(θˆ) := E(θˆ− θ)2.
Show that the MSE obeys a bias-variance decomposition, i.e. we can write
MSE(θˆ) := bias(θˆ)2 + var(θˆ).
Question 3. Probabalistic View of Least-Squares regression
In the tutorial last week, we viewed the least-squares problem purely from an optimisation point of view. We specified the model we wanted to fit, namely:
yˆ = w T x
as well as a loss function (MSE), and simply found the weight vector w that minimized the loss. We
proved that when using MSE, the best possible weight vector was given by
wˆ = (XT X)−1XT y.
In this question, we will explore a different point of view, which we can call the statistical view. At the heart of the statistical view is the data generatinc process (DGP), which assumes that there is some true underlying function that generates the data, which we call f, but we only have access to noisy observations of f . That is, we observe
y = f(x) + ε, ε is some random noise.
For example, assume your y’s represent the daily temperature in Kensington. Any thermometer – even the most expensive – is prone to measurement error, and so what we actually observe is the true tem- perature (f(x)) plus some random noise ε. Most commonly, we will assume that the noise is normally distributed with zero mean, and variance σ2. Now, consider the (strong) assumption that f(x) is linear, which means that there is some true β∗ such that f(x) = xT β∗. Therefore, we have that
and therefore,
y=xTβ∗ +ε, ε∼N(0,σ2), y|x ∼ N(xT β∗,σ2).
What this says is that our response (conditional on knowing the feature value x) follows a normal distribution with mean xT β∗ and variance σ2. We can therefore think of our data as a random sample of observations coming from this distribution, which in turn allows us to estimate unknown parameters via maximum likelihood, just as we did in the previous questions.
(a) You are given a dataset D = {(x1, y1), . . . , (xn, yn)} and you make the assumption that yi|xi = xTi β∗+εi forsomeunknownβ∗ andεi ∼N(0,σ2),wherealltheεi’sareindependentofeachother. Write down the log-likelihood for this problem as well as the maximum likelihood estimation objective and solve for the MLE estimator βˆMLE.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com