COMP90051 StatML 2020S2 Q&A 01 – Solutions
July 30, 2021
Exercise 1: What’s the smallest probability event space including event {1}, for outcome space {1, 2, 3, 4, 5, 6}?
This is a fun thought experiment, but note, not particularly practical: a more practical probability space
would have an event per possible subset of the outcome space. What are event spaces for again? They define
the allowable inputs into a probability distribution function: if a subset of outcomes is not defined in your event
space, then it is invalid to ask what is the probability of it occurring. Why all the rules for events? Because we
have rules for probabilities! E.g., we know that for two events f, g ∈ F that are disjoint (meaning f ∩ g = ∅)
we know for any probability Pr(f ∪ g) = Pr(f) + Pr(g) but for that sentence to even make sense, we need the
union f ∪ g to be a valid event since we want to plug it in to the probability!
Now back to the question: we have that the outcome space is Ω = {1, 2, 3, 4, 5, 6}. But we’re asked to
construct the event space F containing subsets of Ω. We’re told {1} ∈ F so far. What else?
We need Ω itself, as the axioms dictate this. We also need all complements:
� Complement of {1} is {2, 3, 4, 5, 6}.
� Complement of Ω is ∅ = {}.
So far we have F = {{}, {1}, {2, 3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}}.
Last there are unions, fortunately, taking unions of any of combination of events so far produce no new
events. Therefore the above F is the minimum event space needed.
Exercise 2: Given vectors x1, . . . ,xn that have already had their sample subtracted (i.e., sample mean is zero),
show that the sample covariance matrix Σ̂ = 1
n−1
∑n
i=1 xix
′
i is positive semi-definite.
For Σ̂ to be PSD, we need to show that for any non-zero vector u of the same dimension as the data, that
u′Σ̂u ≥ 0. We can distribute/collect terms through the sum, then notice we end up with a squared dot product.
Squaring always results in non-negative values, which summed together is always non-negative.
u′Σ̂u = u′
(
1
n− 1
n∑
i=1
xix
′
i
)
u
=
1
n− 1
n∑
i=1
u′xix
′
iu
=
1
n− 1
n∑
i=1
(u′xi)
2
≥ 0 .
1
Exercise 3: Derive an MLE for a Gaussian’s unknown variance, with known mean.
Let’s do the scalar data case, noting the vector case doesn’t change significantly. Where to start? We’re
asked to derive an MLE, that means we’re being frequentist and there must be a likelihood function modelling
the data. Let’s write the unknown parameter σ2 as θ to highlight that we’re wanting to estimate it.
p(xi) =
1
√
2πθ
exp
(
−
(xi − µ)2
2θ
)
p(x1, . . . , xn) =
n∏
i=1
p(xi) ,
where the first line is the likelihood of a single observation, and the second uses an implicit assumption that
the data observations are i.i.d. to write the joint likelihood of all the data.
Now the MLE principle says that our estimator θ̂ should be found by maximising this expression,
θ̂ ∈ arg max
θ>0
p(x1, . . . , xn) .
To emphasise a point: even though the likelihood involves lots of symbols (data, mean, variance) the only
‘decision variable’ for finding the MLE—the only unknown parameter—is the variance θ.
It’s much more convenient to work with the log-likehood, since it turns the product into a sum:
L(θ) = −
n
2
log(2π)−
n
2
log(θ)−
1
2θ
n∑
i=1
(xi − µ)2 .
The necessary condition that θ̂ is an MLE – that it maximises L(θ) – is that the (partial) derivative of the
log-likelihood with respect to this parameter is zero:
∇θL = −
n
2θ̂
+
1
2θ̂2
n∑
i=1
(xi − µ)2
=
1
2θ̂
(
−n+
1
θ̂
n∑
i=1
(xi − µ)2
)
= 0 .
To solve for θ = θ̂, we can rule out θ̂ = 0 since variances should be positive (unless they’re degenerate zero;
they can never be negative), hence we have that
−n+
1
θ̂
n∑
i=1
(xi − µ)2 = 0 .
Or in other words θ̂ = 1
n
∑n
i=1(xi − µ)
2. Now we haven’t definitively proved that this is a maximiser of the
log-likelihood, to do that we’d look at the sign of the second derivative at this critical/stationary point (to rule
out that this is a minimiser/saddle point), but we won’t typically be that careful in this subject. Indeed as
we’ll soon see, we can hardly ever go through this whole process of doing exact optimisation anyway!
(Aside: Note that this estimator is biased. If we have a denominator of n − 1 we get the conventional
unbiased estimator like in Exercise 2.)
2