CS计算机代考程序代写 algorithm UNIVERSITY COLLEGE LONDON

UNIVERSITY COLLEGE LONDON

Faculty of Engineering Sciences

BENG0095: Assignment 2

Dr. NAME (dariush. .uk)

Tuesday 14th December 2021

Guidelines

• Release Date: Tuesday, 14th December 2021

• Due Date: Tuesday, 4th January 2022 at 4.00 p.m.

• Weighting: 25% of module total

• This paper consists of TWO questions. Answer ALL TWO questions.

• You must format your submission as a pdf using one of the following:

– LATEX: A template, ‘Assignment2 BENG0095 SolutionTemplate.tex’, is provided
on the module’s Moodle page for this purpose.

– MS Word and its equation editor.

• You must submit your pdf via the module’s Moodle page using the ‘BENG0095 (2021/22)
Individual Coursework 2: Submission’ submission portal

• You should preface your report with a single page containing, on two lines:

– The module code and assignment title: ‘BENG0095: Assignment 2’

– Your candidate number: ‘Candidate Number: [YOUR NUMBER]’

• Within your report you should begin each sub-question on a new page.

• Please be aware not all questions carry equal marks.

• Marks for each part of each question are indicated in square brackets.

• Unless a question specifies otherwise, then please make use of the Notation section as a
guide to the definition of objects.

• Please express your answers as succinctly as possible, remember to detail your working,
and state clearly any assumptions which you make.

1

Notation & Formulae

Inputs:

x = [1, x1, x2, …, xm]
T ∈ Rm+1

Outputs:

y ∈ R for regression problems

y ∈ {0, 1} for binary classification problems

Training Data:

S = {(x(i), y(i))}ni=1

Input Training Data:

The design matrix, X, is defined as:

X =




x(1)T

x(2)T

·
·

x(n)T


 =




1 x
(1)
1 · · x

(1)
m

1 x
(2)
1 · · x

(2)
m

· · · · ·
· · · · ·
1 x

(n)
1 · · x

(n)
m




Output Training Data:

y =



y(1)

y(2)

·
·

y(n)




Data-Generating Distribution:

S is drawn i.i.d. from a data-generating distribution, D

Page 2

1. Each year n students of the Engineering Faculty are issued with a survey regarding their
favourite topics. Each survey contains two questions which take binary responses: The first
asks students whether they like machine learning or not, while the second asks studentss
whether they like maths or not. Not all students answer the questions. The results in terms
of number of respondents are given below:

Answers to 2nd Question:
Like Maths Dislike Maths Not Answered

Answers to Like Machine
Learning

n11 n10 n1h

1st Question: Dislike Machine
Learning

n01 n00 n0h

Not Answered nh1 nh0 nhh

Table 1: Number of survey responses

Let us characterise a random variable X1 with outcomes associated with machine learning
preference given by x1 ∈ {0, 1}, (denoting dislike and like respectively) and a random
variable X2 with outcomes associated with maths preference given by x2 ∈ {0, 1}, (denoting
dislike and like respectively).

We wish to learn the joint distribution, pX (x1, x2) (where X = [X1,X2]T ) characterised by:

pX (x1 = 1, x2 = 1) = α11

pX (x1 = 1, x2 = 0) = α10

pX (x1 = 0, x2 = 1) = α01

pX (x1 = 0, x2 = 0) = α00

Here α11, α10, α01, α00 ∈ [0, 1], and (α11 + α10 + α01 + α00) = 1.

(a) [6 marks]
If we ignore all surveys which contain missing data then derive the maximum likelihood
estimatiors of α11, α10, α01, α00 in terms of the contents of Table 21.

[NB:

You should handle parameter constraints using the following re-parameterisation trick:
αkl =

eηkl∑1,1
k̃=0,l̃=0

e
η
k̃l̃

, where ηkl ∈ R.]

Page 3

Now, let us try to learn the parameterisation of the joint distribution using the infor-
mation gleaned from the missing data.

With this in mind we may write the log-likelihood function as:

L = log

[
αn0000 α

n10
10 α

n01
01 α

n11
11

×


 ∑

x2∈{0,1}

pX (x1 = 0, x2)


n0h ×


 ∑

x2∈{0,1}

pX (x1 = 1, x2)


n1h

×


 ∑

x1∈{0,1}

pX (x1, x2 = 0)


nh0 ×


 ∑

x1∈{0,1}

pX (x1, x2 = 1)


nh1 ]

Where: ñ = n− nhh.

In forming this log-likelihood function we have ‘summed over’ the missing, or hidden,
data, which we treat as the outcome of hidden variables.

Maximising this function looks like a job for the EM algorithm. Using similar reasoning
to that given in the lecture slides when we encountered this algorithm (albeit in a
different context), we can obtain the following lower bound on the function:

L ≥ n00 logα00 + n10 logα10 + n01 logα01 + n11 logα11

+ n0h

x2∈{0,1}

q0h(x2) log

(
pX (x1 = 0, x2)

q0h(x2)

)

+ n1h

x2∈{0,1}

q1h(x2) log

(
pX (x1 = 1, x2)

q1h(x2)

)

+ nh0

x1∈{0,1}

qh0(x1) log

(
pX (x1, x2 = 0)

qh0(x1)

)

+ nh1

x1∈{0,1}

qh1(x1) log

(
pX (x1, x2 = 1)

qh1(x1)

)

Here q0h(·), q1h(·) are pdf’s over x2, and qh0(·), qh1(·) are pdf’s over x1.

The EM algorithm seeks to learn the parameters: α00, α10, α01, α11, q0h(x2), q1h(x2),
qh0(x1), qh1(x1) (for x2 ∈ {0, 1} and x1 ∈ {0, 1}) iteratively. After the t-th iteration
of the algorithm the estimates of these parameters are denoted by: αt00, α

t
10, α

t
01, α

t
11,

qt0h(x2), q
t
1h(x2), q

t
h0(x1), q

t
h1(x1).

Page 4

(b) [6 marks]
Demonstrate that the result of the t-th E-step of the EM algorithm, based on the lower
bound given above, is given by the following expressions:

qtih(x2 = j) =
αt−1ij

αt−1i0 + α
t−1
i1

for: i, j = 0 or 1

qthi(x1 = j) =
αt−1ji

αt−10i + α
t−1
1i

for: i, j = 0 or 1

(c) [7 marks]
Demonstrate that the result of the t-th M-step of the EM algorithm is given by the
following expressions:

[NB:

Again, you should handle parameter constraints using the following re-parameterisation
trick: αkl =

eηkl∑1,1
k̃=0,l̃=0

e
η
k̃l̃

, where ηkl ∈ R.]

αtij =
1

(
nij + nihq

t
ih(x2 = j) + nhjq

t
hj(x1 = i)

)
for: i, j = 0 or 1

(d) [6 marks]
You are presented with survey data from a particular year:

Answers to 2nd Question:
Like Maths Dislike Maths Not Answered

Answers to Like Machine
Learning

150 25 78

1st Question: Dislike Machine
Learning

32 13 62

Not Answered 45 25 82

Given this data and assuming an initialisation such that α011, α
0
10, α

0
01, α

0
00 are set equal

to the estimates derived in part (a) of this question, then calculate parameter es-
timates at convergence (to 2 decimal places) of the EM algorithm, i.e. calculate
αt11, α

t
10, α

t
01, α

t
00 to two decimal places as t becomes sufficently large. In addition,

for this level of accuracy, state what value of t constitutes ‘large’ in this case.

Page 5

2. Assume an unlabelled dataset,
{
x(i) ∈ Rm

}n
i=1

, with sample mean, x = 1
n

∑n
i=1 x

(i), and an

orthonormal basis set
{
u[j] ∈ Rm

}d
j=1

, where d < m. We have investigated the ‘Projected Variance Maximisation’ approach to PCA in which we are interested in finding the d-dimensional subspace spanned by { u[j] ∈ Rm|u[i] · u[j] = δij ∀i, j }d j=1 for which the sum of the sample variance of the data projected onto this subspace is max- imised. This leads to a formulation of the PCA problem as: argmax {u[j]} 1 n d∑ j=1 n∑ i=1 ( u[j] · x[i] − u[j] · x )2 (1) subject to: u[i] · u[j] = δij ∀i, j We then solved this problem using the eigendecomposition of the sample covariance matrix, C. (a) [2 marks] Show that we can re-write the objective of (1) in terms of the sample covariance matrix, C: C = 1 n X̃T X̃ where X̃ is the centred data matrix: X̃ =   (x(1) − x)T (x(2) − x)T · · (x(n) − x)T   Now consider a non-linear feature mapping, φ, of the input vectors such that, x(i) → φ(x(i)), where φ(x(i)) ∈ Rk. Let us conduct PCA in this transformed space. In order to form the sample covariance matrix, C̃, in this new feature space it appears that we need to first calculate the vectors φ̃(x(i)), where: φ̃(x(i)) = φ(x(i))− 1 n n∑ j=1 φ(x(j)) before calculating C̃ directly as follows: C̃ = 1 n Φ̃ T Φ̃ where Φ̃ is the centred data matrix after feature mapping: Φ̃ =   φ̃(x(1)) φ̃(x(2)) · · φ̃(x(n))   Page 6 and then solving the following eigenvector equations: C̃ũ[j] = λ[j]ũ[j] for: j = 1, ..., k (2) subject to: ũ[j] · ũ[j] = 1 ∀j where ũ[j], λ[j] are the eigenvectors and associated eigenvalues of C̃ which characterise the feature-mapped principal components. However there is an indirect alternative, which involves the Gram matrix, K, whose elements are given by Kab = φ(x (a)) · φ(x(b)) which we will now investigate: (b) [5 marks] In order to acieve an indirect centring of the data, demonstrate that K̃, whose elements are given by K̃ab = φ̃(x (a)) · φ̃(x(b)), is equivalent to the following matrix: K− 1nK−K1n + 1nK1n (3) Where 1n is the n× n matrix whose entries are all 1n . (c) [5 marks] Now demonstrate that C̃ and K̃ share the same non-zero eigenvalues, and in so doing express the j-th eigenvector of K̃ (which we denote by ṽ[j]) in terms of the j-th eigen- vector of C̃ (denoted by ũ[j]) up to a constant of proportionality, as follows: (Assume an ordering of eigenvectors according to decreasing size of eigenvalue as is usual). ṽ[j] = αΦ̃ũ[j] where α ∈ R is some constant of proportionality. (d) [5 marks] Now express ũ[j], up to a constant of proportionality, β, in terms of ṽ[j] and Φ̃. (e) [4 marks] Set β = 1, then derive appropriate normalisation constraints on ṽ[j], such that the normalisation constraints on the eigenvectors of C̃ are respected, i.e.: ũ[j] · ũ[j] = 1 ∀j In so doing calculate an appropriate setting for α. (f) [4 marks] Our approach has involved using K̃ and ṽ[j] to indirectly perform our analysis, rather than working directly with C̃ and ũ[j]. Explain the value of this approach for large k, with particular reference to the projection of a feature-mapped test vector, φ̃(x∗) onto the feature-mapped principal components. Page 7