Problem 2.1
CS5487 Problem Set 2
Parameter Estimation – Maximum Likelihood
Antoni Chan Department of Computer Science City University of Hong Kong
Maximum Likelihood Estimation
The Poisson distribution and flying bombs
During World War II, the Germans fired V-1 and V-2 flying bombs (long range missiles) at London. Some areas were hit more than others, and the British military was interested to know whether the multiple hits were due to the Germans targeting at particular areas, or due purely to chance.
To analyze this problem, the British statistician R.D. Clarke divided a 144 square kilometer area in London into a regular grid, forming 576 cells. Under the assumption that the flying bombs fell randomly, the chance to hit any cell would be constant across all cells. Hence, the hit counts of the cells are i.i.d samples from a common random variable x.
A natural distribution for modeling the number of events (bomb hits) occurring within a fixed time period is the Poisson distribution, given by
p(x = k| ) = 1 e k. (2.1) k!
where k 2 {0, 1, 2, 3, · · · } is a counting number. The parameter is the average number of events, and the mean and variance are the same E[x] = var(x) = .
(a) Derive the maximum-likelihood estimate of , given a set of i.i.d. samples {k1, · · · , kN }. (b) Show that the ML estimator is unbiased, and the estimator variance is .
The following table lists the number of cells that were observed to have k hits (this is Clarke’s actual data!).
number of hits (k) 0 1 2 3 4 5 and over number of cells with k hits 229 211 93 35 7 1
(c) Using the above data, calculate the ML estimate ˆ for the Poisson distribution.
(d) Use the estimate ˆ to predict the expected number of cells with k hits, for k 2 {0, 1, 2, 3, 4, 5+}.
Compare the expected counts with the observed data. What conclusions can you make?
………
N
8
Problem 2.2 MLE for the exponential density
Let x have an exponential density,
p(x|✓) = (✓e ✓x, x 0 (2.2)
0, otherwise.
(a) Plot p(x|✓) versus x for ✓ = 1. Plot p(x|✓) versus ✓ for x = 2.
(b) For a set of samples {x1,···xN} drawn i.i.d from p(x|✓), show that the maximum-likelihood estimate for ✓ is
ˆ1 ✓=1PN x.
N i=1 i
(c) Now suppose that we reparameterize our exponential density using = 1/✓,
(1 x p(x| )= e , x 0
0, otherwise.
Derive the maximum-likelihood estimate of . How is ˆ related to ✓ˆ?
……… Problem 2.3 Invariance property of MLE
(2.3)
(2.4)
Prove the invariance property of maximum-likelihood estimators – let ✓ be the parameter of a probability distribution p(x|✓), and = g(✓) be a reparameterization p(x| ), where g(✓) is a di↵erentiable function. Show that if ✓ˆ is the ML estimate of ✓, then ˆ = g(✓ˆ) is the ML estimate of .
……… Problem 2.4 MLE for a Laplace distribution
Let x have a Laplace distribution (also called a double exponential)
1 |x μ| p(x|μ, ) = e
(1 x μ
2 e , x<μ
=
(a) Derive the maximum-likelihood estimates for μ and , given a dataset of iid samples {x1, · · · , xn}.
Hint: When finding the estimate for μ, the following property might be useful: for any 2 numbers a,b 2 R with a b, then |a μ|+|b μ| b a, with equality when a μ b. Furthermore, without loss of generality, we can assume that the samples are ordered such that x1 x2 · · · xn.
.........
2
1 μ x
2 e , x μ
(2.5)
9
Problem 2.5 MLE for a univariate Gaussian
Derive the ML estimate for a univariate Gaussian for sPamples {x1, · · · , xN }. In particular, (a) Show that the ML estimate of the mean is μˆ = 1 N xi.
N i=1
(b) Show that the ML estimate of the variance is ˆ2 = 1 PN (xi μ)2.
N i=1 .........
Problem 2.6 MLE for a multivariate Gaussian
In this problem you will derive the ML estimate for a multivariate Gaussian. Given samples {x1,··· ,xN},
(a) Derive the ML estimate of the mean μ of a multivariate Gaussian. (b) Derive the ML estimate of the covariance ⌃.
You may find the following vector and matrix derivatives helpful:
• @ aTx=a,forvectorsx,a2Rd. @x
• @ xTAx=Ax+ATx,forvectorx2Rd andmatrixA2Rd⇥d. @x
• @ @X
• @ @X
log|X| = X T, for a square matrix X.
tr(AX 1) = @ tr(X 1A) = (X T AT X T ), for matrices A, X.
@X
Hint: remember ⌃ is symmetric!
......... Problem 2.7 MLE for a multinomial distribution
In this problem we will consider the ML estimate of the parameters of a multinomial distribution. Consider a discrete random variable x such that p(x = j) = ⇡j,j 2 {1,...,K}. Suppose we draw N independent observations from x and form a random vector c = [c1, . . . , cK ]T where cj is the number of times that the observed value is j (i.e. c is the histogram of the sample of observations). Then, c has a multinomial distribution
N! YK cj
p(c1,...,cK) = QK ci! ⇡j . (2.6)
i=1 j=1
(a) Derive the ML estimator for the parameters ⇡j , j = 1, . . . , K . (Hint: notice that these parame- ters are probabilities, which makes this an optimization problem with a constraint. If you know about Lagrange multipliers feel free to use them. Otherwise, note that minimizing a function f (a, b) under the constraint a + b = 1 is the same as minimizing the function f (a, 1 a)).
(b) Is the estimator derived in (a) unbiased? What is its variance? Is this a good estimator? Why?
.........
10
Regression
Problem 2.8 Least-squares regression and MLE
In this problem we will consider the issue of linear regression and the connections between maximum likelihood and least squares solutions. Consider the polynomial function of x 2 R,
(x) = ⇥1,x,x2,··· ,xK⇤T 2 RD, ✓ = ⇥✓0,··· ,✓K⇤T 2 RD. (2.8) Given an input x, instead of observing the actual function value f(x,✓), we observe a noisy version
y,
y = f (x, ✓) + ✏, (2.9) where ✏ is an Gaussian random variable of zero mean and variance 2. Our goal is to obtain the
best estimate of the function given iid samples D = {(x1, y1), . . . , (xn, yn)}. (a) Formulate the problem as one of least squares, i.e define
and find the value of ✓ that minimizes the sum-squared-error, Xn 2
XK k=0
xk✓k = (x)T✓, (2.7)
f(x,✓)=
where we define the feature transformation (x) and the parameter vector ✓ (both of dimension
D = K + 1) as
2y3 261 ... 137 6.17 ⇥ ⇤ 6x1 x1n7
y=4.5, = (x1),···, (xn) =64 . . 75 y n x K1 . . . x Kn
(2.10)
(2.11)
i=1
(yi (xi)T✓)2 = y T✓ .
(b) Formulate the problem as one of ML estimation, i.e. write down the likelihood function p(y|x, ✓), and compute the ML estimate, i.e. the value of ✓ that maximizes p(y1, · · · , yn|x1, · · · , xn, ✓). Show that this is equivalent to (a).
Hint: the vector derivatives listed in Problem 2.6 might be helpful.
......... Problem 2.9 Weighted least-squares and MLE
The advantage of the statistical formulation of least-squares regression is that it makes the assump- tions explicit. We will now challenge some of these assumptions.
Assume that instead of a fixed variance 2 we now have a variance that depends on the sample point, i.e.
yi =f(xi,✓)+✏i, (2.12) 11
where ✏i ⇠ N(0, i2). This means that our sample is independent but no longer identically dis- tributed. It also means that we have di↵erent degrees of confidence in the di↵erent measurements (yi, xi).
(a) Formulate the problem as one of ML estimation, and compute the ML estimate, i.e., the value of ✓ that maximizes p(y1,··· ,yn|x1,··· ,xn).
(b) Consider the weighted least squares problem where the goal is to minimize the weighted sum- squared error
Xn
wi(yi (xi)T✓)2 =(y T✓)TW(y T✓) (2.13) i=1
where W is a diagonal matrix with diagonal entries wi. Compute the optimal ✓ for this situation. What is the equivalent maximum likelihood problem? Rewrite the model (2.9), making explicit all the assumptions that lead to the new problem. What is the statistical interpretation of W and wi? What is an alternative interpretation of the weight wi (e.g., if wi is restricted to be a positive integer)?
.........
Problem 2.10 Robust regression and MLE
The L2 norm is known to be prone to large estimation error if there are outliers in the training sample. These are training examples (yi,xi) for which, due to measurement errors or other extra- neous causes, the residual error |yi (xi)T ✓| is much larger than for the remaining examples (the inliers). In fact, it is known that a single outlier can completely derail the least squares solution, a highly undesirable behavior. It is also well known that other norms lead to much more robust estimators. One of such distance metrics is the L1-norm
Xn
L1= |yi (xi)T✓|= y T✓ 1. (2.14)
i=1
(a) In the maximum likelihood framework, what is the statistical assumption that leads to the L1 norm? Once again, rewrite the model (2.9), making explicit all the assumptions that lead to the new problem. Can you justify why this alternative formulation is more robust? In particular, provide a justification for i) why the L1 norm is more robust to outliers, and ii) why the associated statistical model copes better with them.
Unlike the previous least-squares formulations, the minimization of (2.14) does not have a closed- form solution. Instead, we must solve it numerically when we are given the data. To facilitate this, we need to cast our L1 minimization problem into the standard form of a linear program (an optimization problem with a linear objective function and linear inequality constraints). This will allow us to use a standard optimization toolbox to perform the minimization numerically (e.g., linprog in Matlab). Our original optimization problem is
Xn
min |yi (xi)T ✓|. (2.15)
✓ i=1
12
(b) Introduce an auxiliary variable t 2 Rn. Verify that the optimization problem in (2.15) is equivalent to
Xn min ti
✓,t i=1
s.t. |yi (xi)T ✓| ti
(2.16)
What is the role of each auxiliary variable ti?
(c) Define x = ✓t 2 RD+n. Show that (2.16) can be turned into a standard linear program,
where
f = 0D 2 RD+n,
1n
min fTx x
s.t. Ax b
A = T In 2 R2n⇥(D+n),
T In
b = y 2 R2n, y
(2.17)
(2.18)
and 0n is the vector of n zeros, 1n is the vector of n ones, and In is the n ⇥ n identity matrix. We now have a form of our original L1 optimization problem that can be solved numerically using linprog in the MATLAB optimization toolbox. Note: we could probably get faster performance if we develop a custom solver for our original problem in (2.15).
.........
Estimator Bias and Variance
Problem 2.11 Simple estimator
Let D = {x1,··· ,xN} be a set of i.i.d samples from a Gaussian r.v. x. Suppose we define an estimator of the mean as the first point in the set,
(a) Show that the estimator μˆ is unbiased. (b) Derive the variance of the estimator.
(c) Why is this an undesirable estimator?
μˆ = x1.
.........
(2.19)
13
Problem 2.12 Bias of the ML estimator for variance of a Gaussian
The ML estimator of the variance of a Gaussian is 1 XN
where μˆ is the ML estimator of the mean.
(a) Suppose that the true distribution has mean μ and variance 2, show that
Ex1,···,xN[ ˆ2]=✓1 1◆ 2, N
and hence ˆ2 is a biased estimator.
(b) Using (a), propose an unbiased estimator of the variance.
.........
Problem 2.13 Bias-Variance Tradeo↵
ˆ2 = N (xi μˆ)2, i=1
(2.20)
(2.21)
In general, there is always a tradeo↵ between bias and variance. We can reduce the bias only by increasing the variance, and vice versa. In this problem, we will consider an example of this.
Suppose we want a faster decay of the variance of the mean estimator for a Gaussian. We could scale the original mean estimator to form a new estimator μ ̃,
↵ Xn
μ ̃ = N Xi. (2.22)
i=1
(a) Show that the mean value of the estimator is E[μ ̃] = ↵μ, and hence the bias and variance are
↵2 2
Bias(μ ̃) = (1 ↵)μ, Var(μ ̃) = N . (2.23)
Hence, selecting ↵ < 1 will decrease the variance but also causes μ ̃ to be biased! .........
Problem 2.14 Weak law of large numbers
The weak law of large numbers states that the sample mean of a large number of i.i.d. random variables is very close to the true mean, with high probability. Let’s prove it!
Let {x1, · · · , xn} be a set of n i.i.d. random variables with mean μx and variance x2. Define the sample mean as,
Mn = 1(x1 +···+xn) n
(a) Show that the mean and variance of the sample mean are
(2.24)
(2.25)
E[Mn] = μx,
x2 var(Mn) = n .
14
(b) The Chebyshev inequality states that for a random variable x with mean μ and variance 2, 2
P(|x μ| c) c2, forallc>0. (2.26) In other words, if a r.v. has small variance, then the probability it will take a value far from
the mean is small.
Use the Chebyshev inequality and your result above to show that, for every ✏ > 0,
P(|Mn μx| ✏) ! 0, as n ! 1. (2.27)
In other words, there is high probability that Mn will fall within the interval [μx ✏, μx + ✏], for large n. That is, the bulk of the distribution of Mn is concentrated around μx, and converges toμx asn!1.
………
15