CS代考 COMP0142(L6): 24 hrs Open Book Test 2

Guidelines
UNIVERSITY COLLEGE LONDON Faculty of Engineering Sciences
Department of Computer Science
COMP0142(L6): 24 hrs Open Book Test 2

Copyright By PowCoder代写 加微信 powcoder

Thursday, 15th April 2021
• Due Date & Time: Friday, 16th April 2021 at 12 noon
• Weighting: 30% of module total
• This paper consists of TWO questions. Answer ALL TWO questions.
• You must format your submission as a pdf via the module’s Moodle page using one of the following:
– LATEX: A template, ‘COMP0142 L6 Solution Template.tex’, is provided on the module’s Moodle page for this purpose.
– MS Word and its equation editor.
• You should preface your report with a single page containing, on two lines:
– The module code and assignment title: ‘COMP0142(L6): Open Book Test 2’ – Your candidate number: ‘Candidate Number: [YOUR NUMBER]’
• Within your report you should begin each sub-question on a new page.
• Please be aware not all questions carry equal marks.
• Marks for each part of each question are indicated in square brackets.
• Unless a question specifies otherwise, then please make use of the Notation section as a guide to the definition of objects.
• Please express your answers as succinctly as possible, remember to detail your working, and state clearly any assumptions which you make.
COMP0142(L6) 1 TURN OVER

Inputs: x=[1,×1,x2,…,xm]T ∈Rm+1
y ∈ R for regression problems
y ∈ {0, 1} for binary classification problems
Training Data:
S = {(x(i), y(i))}ni=1
Input Training Data:
The design matrix, X, is defined as:
 (1)T  (1) (1) x 1×1··xm
 (2) (2) x  1 x · · xm 
X=·=· · ·· · 
 ·   · · · · ·  x(n)T 1 x(n) · · x(n)
Output Training Data:
y(2) y= ·   · 
Data-Generating Distribution:
S is drawn i.i.d. from a data-generating distribution, D
COMP0142(L6) 2 TURN OVER

1. You are given a data set, S, which records instances of input attributes, x = [x1, x2]T ∈ R2, denoted as {x(i) = [x(i), x(i)]T }n , and their labels y ∈ {0, 1}, denoted as {y(i)}n . The
input attributes and labels are paired so that S = {(x(i), y(i))}ni=1, where n = 24:
x1 x2 y -0.01 5.19 0 1.32 1.74 1 3.36 4.55 0 0.95 1.9 1 2.4 2.74 0 0.22 2.75 0 -0.32 5.28 0 3.7 6.9 0 1.63 1.67 1 1.21 4.09 0 1.73 4.42 0 1.03 4.91 0 0.9 7.37 0 5.78 5.84 0 1.2 5.52 0 1.46 0.7 1 1.25 2.33 1 1.82 2.27 1 2.18 1.42 1 -1.61 3.95 0 5.53 4.43 0 0.91 4.34 0 1.83 0.81 1 2.79 2.04 0
Table 1: S
For your convenience you have also been provided with a csv file of the data, which can be
found in the briefing area for this problem set, named ‘OpenBookTest2 Q1 Data.csv’].
You are asked to build a binary classifier in order to clasify the y label of a point, given its x attribute. You decide to proceed by constructing a Naive Bayes probabilistic classifier as follows:
First you assume that classes are distributed according to a Bernoulli random variable, Y, with outcomes y ∼ Bern(θ), with pmf characterised by pY (y = 1) = θ.
Then you consider the following Exponential model for the class conditional probability distributions for the random variables, X1, X2, the outcomes of which are given by instances of particular input attributes, x1, x2 respectively:
COMP0142(L6) 3 TURN OVER

Where: λ01,λ02,λ11,λ12 >0. [NB:
x1|(y = 0) ∼ Exp(λ01) x2|(y = 0) ∼ Exp(λ02) x1|(y = 1) ∼ Exp(λ11) x2|(y = 1) ∼ Exp(λ12)
If a random variable Z, with outcomes z ∈ [0, ∞), is Exponential distributed, z ∼ Exp(λ), then the associated pdf is given by:
􏲔λe−λz if: x ≥ 0 pZ(z;λ)= 0 if:x<0 For some λ > 0.]
As usual you then classify a particular point with input attribute vector, x, as belonging to
class 1 if pY (y = 1|x) ≥ 0.5, and as belonging to class 0 otherwise.
(a) i) Fit θ and the parameters of Model 1 using S and maximum likelihood estimation. Express your final results to 2 decimal places.
ii) Fully describe the resulting discriminant boundary, giving its functional form and providing a sketch (hand-drawn sketches are fine so long as they are inserted into the pdf of your solutions at the appropriate point).
Upon reflection you realise that Model 1 is not appropriate for this data set. A more appropriate model for the class conditional probability distributions is given by:
x|(y=0)∼N(μ ,σ2) 1 01 01
x|(y=0)∼N(μ ,σ2) 2 02 02
x|(y=1)∼N(μ ,σ2) 1 11 11
x|(y=1)∼N(μ ,σ2) 2 12 12
Where: μ01, μ02, μ11, μ12 ∈ R and σ01, σ02, σ11, σ12 ∈ (0, ∞).
COMP0142(L6) 4 TURN OVER

(b) Briefly explain why Model 2 rather than Model 1 is a more appropriate choice to model this data set.
(c) i) Fit the parameters of Model 2 using S and maximum likelihood estimation. Express your final results to 2 decimal places.
ii) Fully describe the resulting discriminant boundary, giving its functional form and providing a sketch (hand-drawn sketches are fine so long as they are inserted into the pdf of your solutions at the appropriate point).
(d) Briefly explain how the discriminant boundary would change if Model 2 was changed such that: σ01 = σ11 and σ02 = σ12.
[Total for Question 1: 25 marks]
COMP0142(L6) 5 TURN OVER

2. The n pupils of a school are issued with a survey regarding their school meals. Each survey contains two questions which take binary responses: The first asks pupils whether they like cabbage or not, while the second asks pupils whether they like walnuts or not. Not all pupils answer the questions. The results in terms of number of respondents are given below:
Answers to 1st Question:
Like Walnuts Like Cabbage n11 Dislike Cabbage n01
Not Answered nh1
Answers to 2nd Question:
Dislike Walnuts
n10 n00 nh0
Not Answered
n1h n0h nhh
Table 2: Number of survey responses
Let us characterise a random variable X1 with outcomes associated with cabbage preference given by x1 ∈ {0, 1}, (denoting dislike and like respectively) and a random variable X2 with outcomes associated with walnut preference given by x2 ∈ {0, 1}, (denoting dislike and like respectively).
We wish to learn the joint distribution, pX (x1, x2) (where X = [X1, X2]T ) characterised by:
pX (x1 = 1, x2 = 1) = α11 pX (x1 = 1, x2 = 0) = α10 pX (x1 = 0, x2 = 1) = α01 pX (x1 = 0, x2 = 0) = α00
Hereα11,α10,α01,α00 ∈[0,1],and(α11+α10+α01+α00)=1.
(a) If we ignore all surveys which contain missing data then derive the maximum likelihood
estimatiors of α11, α10, α01, α00 in terms of the contents of Table 2.
You should handle parameter constraints using the following re-parameterisation trick:
αkl = 􏲗1,1 eηk ̃ ̃l , where ηkl ∈ R.]
k ̃=0, ̃l=0
COMP0142(L6)
6 TURN OVER

Now, let us try to learn the parameterisation of the joint distribution using the infor- mation gleaned from the missing data.
With this in mind we may write the log-likelihood function as:
L = log 􏲖 pX (x(i), x(i))
αn00 αn10 αn01 αn11 00 10 01 11
× Where: n ̃ = n − nhh.
In forming this log-likelihood function we have ‘summed over’ the missing, or hidden, data, which we treat as the outcome of hidden variables.
Maximising this function looks like a job for the EM algorithm. Using similar reasoning to that given in the lecture slides when we encountered this algorithm (albeit in a different context), we can obtain the following lower bound on the function:
L≥n00logα00 +n10logα10 +n01logα01 +n11logα11
 n0h 􏲕 pX(x1 =0,x2)
pX(x1 =1,x2) 
x2 ∈{0,1} 
􏲕 pX(x1,x2 =1)
nh0 pX(x1,x2 =0) ×
x1 ∈{0,1} x1 ∈{0,1}
+ n0h + n1h + nh0 + nh1
x2 ∈{0,1} 􏲕
x2 ∈{0,1} 􏲕
x1 ∈{0,1} 􏲕
q0h(x2) log q1h(x2) log qh0(x1) log qh1(x1) log
􏲐pX(x1 =0,x2)􏲑 q0h(x2)
􏲐pX(x1 =1,x2)􏲑 q1h(x2)
􏲐pX(x1,x2 =0)􏲑 qh0(x1)
􏲐pX(x1,x2 =1)􏲑 qh1(x1)
Here q0h(·), q1h(·) are pdf’s over x2, and qh0(·), qh1(·) are pdf’s over x1.
The EM algorithm seeks to learn the parameters: α00, α10, α01, α11, q0h(x2), q1h(x2), qh0(x1), qh1(x1) (for x2 ∈ {0,1} and x1 ∈ {0,1}) iteratively. After the t-th iteration of the algorithm the estimates of these parameters are denoted by: α0t0, α1t0, α0t1, α1t1, qt (x2), qt (x2), qt (x1), qt (x1).
COMP0142(L6)
7 TURN OVER
0h 1h h0 h1

(b) State the t-th E-step of the EM algorithm based on the lower bound given above. In other words, state expressions for: qt (x2), qt (x2), qt (x1), qt (x1) in terms of the
0h 1h h0 h1
parameter estimates for α11, α10, α01, α00 from the previous iteration of the algorithm,
i.e. αt−1, αt−1, αt−1, αt−1. 11 10 01 00
Again, you should handle parameter constraints using the following re-parameterisation
(c) Derive the t-th M-step of the EM algorithm. In other words, derive expressions for αt ,αt ,αt ,αt in terms of qt (x2),qt (x2),qt (x1),qt (x1).
11 10 01 00 0h 1h h0 h1 [NB:
trick: αkl = 􏲗1,1 eηk ̃ ̃l , where ηkl ∈ R.]
k ̃=0, ̃l=0
(d) You are presented with particular survey data from a school:
Answers to Like Cabbage 1st Question: Dislike Cabbage
Not Answered
50 72 18 35 106 43 90 25 10
Given this data and assuming an initialisation such that α101, α100, α01, α00 are set equal to the estimates derived in part (a) of this question, then calculate parameter es- timates at convergence (to 2 decimal places) of the EM algorithm, i.e. calculate α1t1,α1t0,α0t1,α0t0 to two decimal places as t becomes sufficently large. In addition, for this level of accuracy, state what value of t constitutes ‘large’ in this case.
[Total for Question 2: 25 marks]
COMP0142(L6) 8 END OF PAPER
Answers to 2nd Question:

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com