FOUNDATIONS OF ML: PROBABILITY EXERCISES
1. What is this exercise sheet for?
In the lectures and labs you will have encountered the following:
(1) discrete probability distributions such as the Binomial and Multinomial;
(2) continuous probability distributions like the exponential and the Gauss-
ian;
(3) Bayes’ rule where you relate the probability of A when conditioning on
B to the probability of B when conditioning on A
You will get to work out what the expected value of a random variable is and
how to quantify the variability in its observed values. 2. Exercises
1) For any random variable X, the mean is E[X] = ∑
real number a, what are (i) E(X − a) and (ii) E(aX) in terms of EX?
xP(X = x). For any Solution. I will combine (i) and (ii) into a single solution as follows.
E[aX + b] = ∑ (ax+b)P(X = x) ∑x ∑
= axP(X = x)+ bP(X = x) xx
Var[aX + b] =
= E[a2X2 + 2aX + b2] − (aE[X] + b)2
∑∑
= a xxP(X=x)+b xP(X=x) = aE[X]+b·1.
2) For any random variable X, the variance the the expectation of the square of the deviations from the mean. Show that
Var (X) ≜ E[(X − E[X])2] = EX2 − (EX)2.
Express (i) Var (X − a), and (ii) Var (aX) in terms of Var (X).
Solution We will use the idea that while X is a random variable, E[X]
is a number.
E[(X − E[X])2 ] =
= E[(X2)] − 2E[X] (E[X]) + (E[X])2, as 2E[X] is a constant
= E[X2 ] − E[X]2 .
To express Var(aX + b) in terms of Var (X):
E[(X2 − 2XE[X] + (E[X])2 ]
E[(aX + b)2] − (E[aX + b])2]
= a2E[X2] + 2abE[X] + b2 − (a2(E[X])2 + 2abE[X] + b2)
= a2 (E[X2 ] − E[X]2 )
= a2Var(X)
1
x
2
FOUNDATIONS OF ML: PROBABILITY EXERCISES
3) For two random variables, X, Y
E(X+Y) = EX+EY.
Solution Since X, Y independent, P(X, Y) = P(X)P(Y). More explic- itly, P(X = x, Y = y) = P(X = x)P(Y = y). Thus, to show expectation values (means) add,
E[X+Y] = = = = = =
∑ ∑ P(X=x,Y=y)(x+y) ∑x∑y ∑∑
Cov [X, Y] = =
= = = =
xy
∑ ∑ P(X = x)P(Y = y)(xy−xEY −yEX+EXEY) ∑x∑y ∑∑
P(X=x,Y =y)x+ P(X=x,Y =y)y xy xy
∑∑ ∑∑
P(X = x)P(Y = y)x+ P(X = x)P(Y = y)y xy xy
∑∑∑∑
P(X = x)x P(Y = y) + P(X = x) P(Y = y)y x∑y∑xy
E[X] yP(Y=y)+E[Y] xP(X=x) E[X] + E[Y ],
where the passage from one equation to its successor is justified for the following reasons: opening out the the parenthesis, independence of X, Y, noting the range of the summed variables, definition of means and finally, normalisation of probabilities (sum to one).
4) Introduce centred random variables by subtracting the means from vari- ables X and Y, i.e., X → X = X − EX and Y → Y = Y − EY. If X and Y are statistically independent,
(a) show that their covariance
Cov (X,Y) = E[(X − EX)(Y − EY)] = E[XY] = 0.
H int: Write out the definition of E[Z] for a random variable Z in terms of the probability-weighted sums (or integrals) over the values of z ∈ Z, where z = x ̃y ̃. Use p(X = x, Y = y) = p(X = x)p(Y = y). Solution Write out the definition of Cov[X, Y]:
∑∑
Cov [X, Y] = E[(X−E[X])(Y−E[Y])] =
and substitute P(X = x,Y = y) = P(X = x)P(Y = y).
P(X = x, Y = y)(x−EX)(y−EY),
P(X = x)P(Y = y)xy− P(X = x)x P(Y = y)EY xyxy
∑∑∑∑
− P(X = x)EX P(Y = y)y+ P(X = x)P(Y = y)EXEY (a) ∑x∑yxy
∑ P(X = x) ∑P(Y = y)xy−EXEY −EXEY +EXEY (b) xy
x P(X = x)x EXEY − EXEY 0.
y P(Y = y)y−EXEY
In going from (a) to (b), I have noted that EX and EY are num-
bers (not random variables) and I have used (for any constant a)
∑ aP(i) = a ∑ P(i) = a since the probability distributions for ii
i = x, y are normalised. (b) Using this result, show
Var(X + Y) = E(X + Y)2 = EX2 + EY2,
FOUNDATIONS OF ML: PROBABILITY EXERCISES 3 (which is reminiscent of Pythagoras’ theorem).
(c) Hence, for independent random variables X, Y Var (X + Y) = Var (X) + Var (Y).
Solution To show Var[X + Y] = E[(X − E[X])2 + (Y − E[Y])2]
Var[X + Y] =
= Var[X] + Var[Y] − 2E[(Y − E[X])(Y − E[Y])]
E[(X − E[X])2 + (Y − E[Y])2 − 2(Y − E[X])(Y − E[Y])] = Var[X] + Var[Y] − 2Cov [X, Y].
5) In the first lecture on Probability and S tatistics (W eek 4 andIntro_Probability_1) it was noted that the slope of a straight line fit (linear regression) can be
re-written as
w1 = Cov(X,Y) = σXY . Var(X) σXX
Verify that this is true. Also, show that the intercept w0 of a straight line fit is given as
w0 =E[Y]−Cov(X,Y)E[X]. Var(X)
6) (Bishop, Ex 2.1) Verify that the Bernoulli distribution Bern(x|θ) = θx(1 − θ)1−x,
for x ∈ {0,1} with p(X = 1) = θ ∈ [0,1], satisfies (a) the normalisation condition
∑ Bern(X = x|θ) = 1; x∈{0,1}
(b) has moments E[X] = θ, the mean and (c) variance Var [X] = θ(1 − θ).
Solution For each part the explicit use of the values 0 and 1 taken by the binary variable simplifies all expressions.
(a) Normalisation:
∑ Bern(X = x|θ)
x∈{0,1}
(b) Mean:
E[X] =
= =
= Bern(X = 0|θ) + Bern(X = 1|θ) = θ0(1 − θ)1−0 + θ1(1 − θ)1−1
= (1−θ)+θ=1.
∑
x∈{0,1}
0.Bern(X = 0|θ) + 1.Bern(X = 1|θ) θ1(1 − θ)1−1 = θ.
xBern(X = x|θ)
4
FOUNDATIONS OF ML: PROBABILITY EXERCISES
(c) Variance:
Var [X] = E[(X − E[X])2]
∑
= (x − θ)2Bern(X = x|θ) x∈{0,1}
= θ2Bern(X = 0|θ) + (1 − θ)2Bern(X = 1|θ) = θ2(1−θ)+(1−θ)2θ = θ(1−θ).
7) (T his problem setting can be applied to the lab exercise of tossing up a globe and registering W or L depending on where the index finger touches. T his is a binary random variable.) Let us assume that each binary event in a length N se- quence of binary outcomes D := (x1, x2, x3, · · · , xN) is drawn indepen- dently from the same underlying distribution. The likelihood function p(D|θ) is a function of θ the model parameter:
∑N ( N )
∑N ( N )
m=0 m
θm(1−θ)N−m = (1−θ)N
m=0 m
(θ/1−θ)m = (1−θ)N(1+(θ/1−θ))N = 1
∏N p(D|θ) =
n=1
p(xn|θ) =
∏N n=1
θxn (1 − θ)1−xn .
Maximising the likelihood to find the value of θ that best describes the data D is equivalent to maximising the log of the likelihood (since the logarithm is a monotonic function). Show that the maximum likeli- hood estimator (MLE) θML, which is obtained by setting to zero the derivative with respect to θ of the log-likelihood function
∑N n=1
lnp(D|θ) =
{xn lnθ + (1 − xn)ln(1 − θ)},
1 ∑N
is
8) (Bishop 2.3) Verify that the Binomial distribution
(N)
θm(1 − θ)N−m
is normalised. (H int: Factor (1−θ)N out of the sum and use the binomial expansion theorem.)
θML = N
xn. Solution. H int. Use (d/dθ) ln θ = (1/θ).
n=1 Note that this is also the sample mean.
Solution
Binom (m|θ, N) =
m
9) (Binomial distribution.) Look up the derivation of the mean and variance of a Binomial distribution (say, from Wikipedia). Instead of perform- ing such an explicit computation, you should extend the proof from 2 variables to N of what you proved in the second problem above: that the mean of the sum of 2 independent random variables X1 and X2 is
FOUNDATIONS OF ML: PROBABILITY EXERCISES 5
the sum of the means, and the variance of X1 + X2 is the sum of the variances of X1 and X2.
If N random variables X1, X2, . . . , XN are independent,
∑N ∑N
E[ Xn]= E[Xn]
n=1
n=1 ∑N
n=1
i,j i̸=j
n=1 ∑N
Var [ n=1
Xn]
=
n=1 ∑N
n=1
Var [Xn].
Since the Binomial distribution is the distribution of N independent Bernoulli variables, each representing one binary variable (or a toss of a coin), you should find for X ∼ Binom(·|θ, N)
E[X]=Nθ, andVar[X]=Nθ(1−θ).
Solution. IfNrandomvariablesX1,X2,…,XN areindependent,
Cov(Xi,Xj) = 0 for all i ̸= j ∑N ∑N
E[ Xn]= E[Xn]
Var [
Xn] = =
Var [Xn] + Var [Xn].
Cov(Xi, Xj)
n=1 n=1
∑N ∑N ∑
Since the Binomial distribution is the distribution of N independent Bernoulli variables, each representing one binary variable (or a toss of a coin), you should find for X ∼ Binom(·|θ, N)
E[X]=Nθ, andVar[X]=Nθ(1−θ).
10) (T he Bayesian section of the lab on catching a globe should be relevant here. W hat is estimated and plotted there is not a single numerical estimate of the fraction of water, but a distribution of values that are consistent with the data.) The conjugate prior of a Binomial distribution is the Beta distribution
Beta(a,b)= ∫1 0
θa−1(1 − θ)b−1 θa−1(1 − θ)b−1dθ
.
Let us say we have observed data D = {nH, nT } of nH heads and nT tails in N = nH + nT tosses of a coin, and the probability of obtaining a head is θ. If the prior probability of the parameter θ is p(θ) then the posterior probability is
p(θ|nT , nH) = p(nH, nT |θ)p(θ) p(nH, nT )
6 FOUNDATIONS OF ML: PROBABILITY EXERCISES
using Bayes’ rule. If the prior distribution (the probability of the proba- bility of heads) is taken to be Beta(a, b), show that the posterior distri- bution (now also indexed by the parameters of the Beta)
p(θ|nT , nH, a, b) ∝ θnH+a−1(1 − θ)nT +b−1.
Interpret the role of a and b in terms of pseudocounts. Look up the integral in the denominator of the definition of the Beta distribution (Wikipedia will do) and relate the result to binomial coefficients.
11) The probability density function (pdf) of a Gaussian random variable z of zero mean (E[z] = 0) and unit standard deviation (E[z2] = 0) is
1 z2 p(z) = √2π exp(− 2 ).
If we define y = 3z − 4, what would the mean and standard deviation of y be? (Use the result from ex. 1). How will the pdf p(y) be expressed mathematically?
12) The generic form of a d-dimensional multivariate normal distribution is 1(1)
p(x;μ,Σ)= (2π)d/2det(Σ)exp −2(x−μ)⊤Σ−1(x−μ) . For a specific d = 2 example with
(−1) (1−2) μ= 1 ,Σ= −2 6 ,
show that the pdf p(x1, x2) is
1 [ 3×21 x2 x2 3] p(x1,x2)=4πexp −( 2 + 4 +x1x2+2×1+ 2 +4) .