—— SOLUTIONS ——
May 2019 1.
a. Write down Bayes’ Rule, giving an expression for the posterior, P (ωj |x), in terms of the likelihood, prior and evidence. Where ωj represents category j, and x is a sample.
Marking scheme 3 marks.
Copyright By PowCoder代写 加微信 powcoder
P(ωj|x) = p(x|ωj)P(ωj) p(x)
b. Briefly describe what is meant by a “Minimum Error Rate classifier”. [2 marks]
A classifier that minimises the number of misclassified exemplars.
Marking scheme 2 marks.
c. Explain how Bayes’ Rule can be used to determine the class of an exemplar so as to minimise error rate.
[2 marks] Calculate P(ωj|x) for each class ωj. Classify exemplar x in class ωi
for which P(ωi|x) > P(ωj|x)∀j ̸= i. Marking scheme
SEE NEXT PAGE
—— SOLUTIONS ——
d. Show, mathematically, that the k-nearest-neighbour classifier provides
an estimate of P(ωj|x) Answer
From Bayes theorem:
P(ωj|x) = p(x|ωj)P(ωj) p(x)
Assume we have a dataset with n samples, where nj samples come from class ωj. To classify an unknown sample x, place a volume V around x that captures k samples, this gives estimates of:
the likelihood: p(x|ωj) = kj
the prior: P(ωj) = nj . n
Substituting these values into Bayes rule:
njV the evidence: p(x) = k
kj nj k P(ωj|x)= j j =njVn=j
p(x|ω )P(ω )
Hence, we can estimate the posterior probability that x is in class ωj
by counting the fraction of samples within a region centred on x that are labelled ωj.
Marking scheme 5 marks
Table 1, below, shows samples from two classes:
SEE NEXT PAGE
—— SOLUTIONS ——
class xT ω1 (5,1)
ω2 (2,1) ω2 (4,2)
e. Apply the k-nearest-neighbour classifier to the data given in Table 1 to determine the class of a new sample with value xT = (4, 0). Use Euclidean distance to measure the distance between samples, and use:
i. k=1 ii. k=3
class xT ω1 (5,1)
ω2 (2,1) ω2 (4,2)
Euclidean Distance to (4,0)
i) For k=1. The nearest-neighbour is (3,0). This is in class 2, so the new sample is given the class label ω2.
ii) For k=3. The three nearest-neighbours are (3,0), (5,1), and (5,-1). These are in classes 2, 1, and 1. As the majority are in class 1 the new sample is given the class label ω1.
Marking scheme
2 marks for calculating the Euclidean Distances, 2 marks for correctly applying the k-NN to allocate the new sample to the correct class.
SEE NEXT PAGE
—— SOLUTIONS ——
f. Apply kn nearest neighbours to the data shown in Table 1 in order to estimate the likelihoods, or class-conditional probability densities, at location (4,0) for each class, i.e., calculate p(x|ω1) and p(x|ω2). Use Euclidean distance to measure the distance between samples, and use k=1.
For p(x|ω1), the closest sample to (4,0) is at (5,1) (or (5-1)). We
need to make a circle of radius √2 (“volume” = πr2 = 2π) around (4,0) to encompass this sample.
Therefore density is:
p(x|ω1) = k/n = 1/2 = 0.0796 V 2π
For p(x|ω2), the closest sample to (4,0) is at (3,0). We need to make a circle of radius 1 (“volume” = π) around (4,0) to encompass this sample.
Therefore density is:
p(x|ω2) = k/n = 1/3 = 0.1061 Vπ
Marking scheme
2 marks for correct method, 2 marks for correct application.
SEE NEXT PAGE
—— SOLUTIONS ——
g. Using Table 1 estimate the prior probabilities of each class, i.e., calculate
P(ω1) and P(ω2). Answer
Marking scheme
1 mark for each correct answer.
P ( ω 1 ) = 52 = 0 . 4 P ( ω 2 ) = 53 = 0 . 6
h. Using the answers to sub-questions 1.f and 1.g apply Bayes’ Rule to determine the posterior for each class for the sample with feature vector (4,0), i.e., calculate P (ω1|x) and P (ω2|x). If you have failed to answer sub-questions 1.f and 1.g use the following values: p(x|ω1) = 0.2, p(x|ω2) = 0.1, P (ω1) = 0.3, and P (ω2) = 0.7.
Using correct answers to sub-questions 1.f and 1.g:
P(ωj|x) = p(x|ωj)P(ωj) p(x)
P (ω1|x) = 0.0796 × 0.4 = 0.0318 p(x) p(x)
P (ω2|x) = 0.1061 × 0.6 = 0.0637 p(x) p(x)
p(x) = p(x|ωj )P (ωj ) = 0.0318 + 0.0637 = 0.0955 j
P (ω1|x) = 0.0318 = 0.3333 0.0955
SEE NEXT PAGE
—— SOLUTIONS ——
P (ω2|x) = 0.0637 = 0.6667 0.0955
Using given (incorrect) values:
P(ωj|x) = p(x|ωj)P(ωj) p(x)
P (ω1|x) = 0.2 × 0.3 = 0.06 p(x) p(x) P(ω2|x) = 0.1 × 0.7 = 0.07 p(x) p(x)
p(x) = p(x|ωj )P (ωj ) = 0.06 + 0.07 = 0.13 j
P (ω1|x) = 0.06 = 0.4615 0.13
P (ω2|x) = 0.07 = 0.5385 0.13
Marking scheme 3 marks.
SEE NEXT PAGE
—— SOLUTIONS ——
May 2019 2.
a. Give a brief definition of each of the following terms: i. Feature Space
ii. Decision Boundary
i) Feature Space: the (multidimensional) space defined by the feature
vectors in the dataset.
ii) Decision boundary: the boundary that separates different regions of feature space which are classified into different classes.
Marking scheme
2 marks for each correct definition.
b. A classifier consists of three linear discriminant functions: g1(x), g2(x), and g3(x). The parameters of these three discriminant functions, in augmented notation, are: aT1 = (1, 0.5, 0.5), aT2 = (−1, 2, 2), aT3 = (2, −1, −1). Determine the class predicted by this classifier for a feature vector xT = (0, 1).
[3 marks] Writing the feature vector in augmented notation: yT = (1, 0, 1)
gj (x) = aT y. Therefore,
SEE NEXT PAGE
g1(x) = (1,0.5,0.5) 0 = 1.5
g2(x) = (−1,2,2) 0 = 1 1
—— SOLUTIONS ——
1 g3(x) = (2,−1,−1) 0 = 1
The sample belongs to the class of the linear discriminant function that has the maximum response, so xT = (0, 1) is predicted to be in class 1.
Marking scheme 3 marks.
Table 2, below, shows a dataset consisting of five exemplars that come from three classes:
Class Feature Vector, xT 1 (0,1)
2 (0.5,0.5)
SEE NEXT PAGE
—— SOLUTIONS ——
c. Do the linear discriminant functions defined in sub-question 2.b correctly
classify the data shown in Table 2? Justify your answer.
No. For justification, either:
(A) calculate the predicted class for each sample (until a mis-classified exemplar is found):
Feature Vector, xT g1(x) g2(x) g2(x) (0,1) 1.5 1 1 (1,0) 1.5 1 1 (0.5,0.5) 1.5 1 1 (1,1) 2 3 0 (0,0) 1 -1 2
predicted class 1 1 1 2 3
functions do not correctly classify the data.
(B) plot data in feature space, and note that as linear discriminant functions define decision boundaries that are straight lines, and the data is not linearly separable (sample 3 can not be separated from both samples 1 and 2 using a straight line), the linear discriminant functions can not correctly classify the data.
Marking scheme
1 mark for correct answer, 3 marks for justification.
that the third sample
is mis-classified so the linear discriminant
SEE NEXT PAGE
—— SOLUTIONS ——
d. It is suggested that a quadratic discriminant function of the form g(x) = w0 + i wixi + i,j wijxixj could successfully classify the data. To learn such a quadratic discriminant function we can learn a linear discriminant function in an expanded feature space. Re-write Table 2 showing the new feature vectors in this expanded feature space.
The feature space needs to be expanded by adding an element equal to x1 × x2 to each feature vector:
Could alternatively (or but these do not make
Marking scheme 2 marks.
Feature Vector, xT (0,1,0)
(1,0,0) (0.5,0.5,0.25) (1,1,1)
additionally) add elements equal to x21 or x2 the data linearly separable.
e. Apply the sequential multiclass perceptron learning algorithm to find the parameters for three linear discriminant functions that will correctly classify the data in the expanded feature space defined in answer to sub-question 2.d. Assume that the initial parameters of these three discriminant functions, in augmented notation, are:
aT1 = (1, 0.5, 0.5, −0.75), aT2 = (−1, 2, 2, 1), aT3 = (2, −1, −1, 1). Use a learning rate of 1. If more than one discriminant function produces the maximum output, choose the one with the lowest index (i.e., the one that represents the smallest class label).
SEE NEXT PAGE
—— SOLUTIONS ——
Sequential Learning algorithm:
• Initialise aj for each class.
• For each exemplar (yk,ωk) in turn
– Find predicted class j = arg max(aT yk ) j′ j′
– If predicted class is not true class (i.e., j ̸= ωk), update weights: aωk =aωk +ηyk
aj = aj − ηyk
• repeat until weights stop changing
aT1 (1, 0.5, 0.5, (1, 0.5, 0.5, (1, 0.5, 0.5,
(0, 0, 0, (0, 0, 0, (0, 0, 0, (1, 0, 1, (2, 1, 1,
(1, 0.5, 0.5, (1, 0.5, 0.5, (1, 0.5, 0.5, (1, 0.5, 0.5, (1, 0.5, 0.5,
0.75) 0.75) 0.75)
1,) 1,) 1,) 1,) 1,)
1.25) 1.25) 1.25) 1.25) 1.25)
( 1,2,2,1)
( 1,2,2,1)
( 1,2,2,1) (0, 2.5, 2.5, 1.25) (0, 2.5, 2.5, 1.25) (0, 2.5, 2.5, 1.25) ( 1, 2.5, 1.5, 1.25) ( 2, 1.5, 1.5, 1.25) ( 1, 2, 2, 1.5)
( 1, 2, 2, 1.5)
( 1, 2, 2, 1.5)
( 1, 2, 2, 1.5)
( 1, 2, 2, 1.5)
(2, (2, (2, (2, (2, (2, (2, (2, (2, (2, (2, (2, (2,
1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1) 1, 1, 1)
current parameters
y T g i ( x ) = a Ti y ω g1 g2 g3
(0,1,0) 1.5 1 1 1
(1,0,0) 1.5 1 1 1 (0.5, 0.5, 0.25) 1.3125 1.25 1.25 2 (1,1,1) 1 6.25 1 2 (0,0,0) 0 0 2 3 (0,1,0) 0 2.5 1 1 (1,0,0) 1 1.5 1 1 (0.5, 0.5, 0.25) 2.75 0.1875 1.25 2 (1, 1, 1) 0.75 4.5 1 2 (0,0,0) 1 1 2 3 (0,1,0) 1.5 1 1 1 (1,0,0) 1.5 1 1 1 (0.5, 0.5, 0.25) 1.1875 1.375 1.25 2
Learninghasconverged,sorequiredparametersareaT1 =(1,0.5,0.5,−1.25), aT2 =(−1,2,2,1.5),aT3 =(2,−1,−1,1).
Marking scheme
4 marks for correct method, 3 marks for correct application.
SEE NEXT PAGE
—— SOLUTIONS ——
f. Write pseudo-code for the sequential Widrow-Hoff learning algorithm
when applied to learning a discriminant function for a two-class problem.
• Initialise a to an arbitrary solution. • For each exemplar (yk,ωk) in turn
– Update solution: a ← a + η bk − aT yk yk • repeat until k | bk − aT yk yk| < θ
Marking scheme 5 marks.
SEE NEXT PAGE
May 2019 3.
—— SOLUTIONS ——
a. A linear threshold neuron has a threshold θ = −2 and weights w1 = −1 and w2 = 3. The activation function is the Heaviside function. Calculate the output of this neuron when the input is x = (2, 0.5)T .
[2 marks] Using Augmented notation, y = H(wx) where w = [−θ, w1, w2], and
x = [1,x1,x2]T. Hence, y = H((2,−1,3)(1,2,0.5)T) = H(1.5) = 1 Marking scheme
b. Write pseudo-code for updating the parameters of a single linear threshold neuron using the sequential delta learning algorithm.
1. Initialise w
2. For each exemplar, x:
3. Calculate the neuron’s output: y = H(wx) where H is the heaviside function.
4. Update weights such that: w ← w + η(t − y)xT, where t is the desired output for the current exemplar.
3. Repeat from step 2 until w stops changing.
Marking scheme 4 marks
SEE NEXT PAGE
—— SOLUTIONS ——
c. Briefly explain what is meant by a competitive neural network, and
describe two methods for implementing the competition.
In a competitive neural network the neurons compete for the right to respond to the input (in other words the activity of some neurons is suppressed by other neurons). This can be achieved using inhibitory lateral weights between the neurons, or by using a selection process that chooses the “winning” neuron(s).
Marking scheme
2 marks for correctly defining a competitive neural network, 1 mark for each method of implementing competition.
SEE NEXT PAGE
(0, 0, −0.5) (0.125, 0.125, −0.375) (0.094, 0.094, −0.344)
So output is
(0, −0.5) (0.25, −0.125) (0.188, −0.156)
0.609 0.305
(0.5, 0.5) (0.5, 0.375) (0.563, 0.348) (0.609, 0.305)
(1, 1, 0.5) (0.875, 0.875, 0.375) (0.906, 0.906, 0.344) (0.914, 0.914, 0.305)
—— SOLUTIONS ——
d. A negative feedback network has three inputs and two output neurons,
that are connected with weights W = 1 1 1 . Determine the
activation of the output neurons after 4 iterations when the input is x = (1,1,0)T, assuming that the output neurons are updated using parameter α = 0.25, and the activations of the output neurons are initialised to be all zero.
The activation of a negative feedback network is determined by iteratively evaluating the following equations:
e = x − WT y
y ← y + αWe
eT (We)T yT (WTy)T
Marking scheme
4 marks for correct equations, 2 marks for correct application.
SEE NEXT PAGE
—— SOLUTIONS ——
e. Write down the objective function used in sparse coding, and explain the role played by each of the components of the function in representing data efficiently.
The sparse coding objective function is given as:
miny ||x − VTy||2 + λ||y||0
In this, the first component of the function is the L2-norm of the error between a data point x, and its reconstruction using the dictionary (or weights) VT and sparse coefficients y. This component computes how well the weights represent the original data. The second component of this function shows the sparsity of the coefficients vector, i.e., the length of the vector, with the objective function preferring small length vectors.
Marking scheme
2 marks for correct objective function, 1 mark for explanation of each component.
f. Given a dictionary, VT , what is the best sparse code for the signal x out of the following two alternatives:
i) yT = (1,2,0,−1)
ii)yT =(0,0.5,1,0)
WhereVT = −4 3 2 −1 ,andx=(2,3)T. Assumethat
sparsity is measured as the count of elements that are non-zero. Use a trade-off parameter of λ = 1 in order to calculate the costs.
[5 marks] Computing the cost ||x − VTy||2 + λ||y||0, for each set of coefficients
gives the following results: Page 17
SEE NEXT PAGE
—— SOLUTIONS ——
1. The reconstruction is
3. The sparsity is ||y||0 = 3
4. The cost is 0 + 1 × 3 = 3;
1. The reconstruction is
1 112122
= −432−10 3
2. The L2-norm of the reconstruction error is ||(2, 3)T − (2, 3)T ||2 = ||(0,0)T|| = √02 +02 = 0
0 1 1 2 10.5
= −4 3 2 −1 1
2. The L2-norm of the reconstruction error is ||(2, 3)T −(2.5, 3.5)T ||2 = ||(−0.5, −0.5)T || = √0.52 + 0.52 = 0.7071
3. The sparsity is ||y||0 = 2
4. The cost is 0.7071 + 1 × 2 = 2.7071;
Thecostislowerforthesecondsetofcoefficients,soyT =(0,0.5,1,0)
is the better sparse code. Marking scheme
SEE NEXT PAGE
—— SOLUTIONS ——
May 2019 4.
a. Figure 1 shows a decision tree of a binary-class classification problem where the classes are denotes as ω1 for class 1 and ω2 for class 2. The
sample representation is x = x1 < 3
i. Given a sample x = Answer
7 , predict its class.
Figure 1: A decision tree.
SEE NEXT PAGE
—— SOLUTIONS ——
The predicted class is ω1.
ii. Sketch the decision boundary given by the decision tree where the
values of x1 and x2 in the axes should be shown. Answer
9 8 7 6 5 4 3 2 1 0
x2=9 x1=8 x1 = 6
Marking scheme
1 mark for each side.
0 1 2 3 4 5 6 7 8 9 10
SEE NEXT PAGE
m11 =5 f(·) m22 =7 f(·)
—— SOLUTIONS ——
b. Figure 2 shows a 3-layer partially connected feed-forward neural network.
Linear function is used as the activation function in all the input,
hidden and output units. The training dataset is given in Table 3
where the sample representation is in the form of x = 1 , output
x2 representation is in the form of z = 1 and its target representation
is in the form of t = 1 . Batch Backpropagation is employed to
perform training using the cost J =
denotes the number of training samples; zp is the network output due
to the p-th sample and tp denotes its target. Assume that only w11 is updated. Given the learning rate η = 0.1, determine the updated value of w11 at the next iteration.
Input layer
Hidden layer
[15 marks]
Output layer
Jp where Jp = 2∥tp −zp∥ ; N
SEE NEXT PAGE
Figure 2: A diagram of neural network.
—— SOLUTIONS ——
t (target output) −6
−3 −2 44
Table 3: Input samples x and its target output t.
When w11 is updated, only the paths in the following figure will be affected.
Input layer Hidden layer
Output layer
m11 =5 f(·) z1(x) f(·) z2(x)
x1 / w11 =1 f(·)
Denote the net function in the hidden layer as nety1 associated with y1; output layer as netz1 associated with z1 and netz2 associated with z12, where the activation function f(·) is a linear function.
y1 = f(nety1) = nety1 = w11x1 + w10
z1 = f(netz1) = netz1 = m11y1
z2 = f(netz2) = netz2 = m21y1 + m22y2 + m20
Cost: J = J1 + J2, J1 = 12((t1(1) − z1(1))2 + (t2(1) − z2(1))2), J2 = 1((t1(2) − z1(2))2 + (t2(2) − z2(2))2); ∂J = ∂J1 + ∂J2 .
2 ∂w11 ∂w11 ∂w11
SEE NEXT PAGE
—— SOLUTIONS ——
= ∂Jp ∂y1 ∂y1 ∂nety1
=2 ∂Jp∂zk ∂y1 nety1 k=1 ∂zk ∂y1 ∂nety1 w11
= − 2 (tk(p) − zk(p))∂zk ∂y1
=−(t (p)−z (p))m +(t (p)−z (p))m x (p);p=1,2.
nety1 k=1 ∂y1 ∂nety1 w11
1 1 11 2 2 21 1
Network Outputs:
y1 = w11x1 + w10 = x1 + 4
y2 = w21x1 + w22x2 = 2x1 + 3x2
z1 = m11y1
= 5(x1 + 4) = 5x1 + 20
z2 = m21y1 + m22y2 + m20
=6y1 +7y2 +8=6(x1 +4)+7(2x1 +3x2)+8
= 20x1 +21x2 +32
Considering the first sample (p = 1): x1 = 2 and its target t1 =
−6 5x1 +20=5×−1+20=15
8 ,wehavez1 = 20x1 +21x2 +32=20×−1+21×2+32=54
SEE NEXT PAGE
—— SOLUTIONS ——
May 2019 7CCSMPNN
∂J1 =−(t (p)−z (p))m +(t (p)−z (p))m x (p) ∂w 1 1 11 2 2 21 1
= −(−6 − z1(1)) × 5 + (8 − z2(1)) × 6 × −1
= −(−6 − 15) × 5 + (8 − 54) × 6 × −1 = − − 21 × 5 + (−46) × 6 × −1
4 marks Considering the second sample (p = 2): x2 = 4 and its target
−2 5x1 +20=5×−3+20=5
4 ,wehavez2 = 20x1 +21x2 +32=20×−3+21×4+32=56 ∂J2 =−(t (2)−z (2))m +(t (2)−z (2))m x (2)
∂w 1 1 11 2 2 21 1 11
= −(−2 − z1(2)) × 5 + (4 − z2(2)) × 6 × −3 = −(−2 − 5) × 5 + (4 − 56) × 6 × −3
= − − 7 × 5 − 52 × 6 × −3 = −1041
Update of w11: learning rate η = 0.1
SEE NEXT PAGE
—— SOLUTIONS ——
m11 ← m11 − η ∂J ∂w11
=1−0.1×∂J1 + ∂J2 ∂w11 ∂w11
= 1−0.1×(−381−1041) = 143.2000
SEE NEXT PAGE
—— SOLUTIONS ——
May 2019 5.
a. Draw the diagram of Fuzzy Inference System with labels. Briefly describe each component.
Output (Crisp)
• Fuzzifiers:
defined in the universe of discourse X characterised by membership functions. This process is called fuzzification. 1 mark
• Knowledge Base: It is a database consisting of linguistic rules in IF-THEN format. 1 mark
• Fuzzy Inference Engine: Using the If-Then rules in Knowledge base, it performs reasoning by producing a fuzzy output according to the fuzzy input given by the fuzzifier. 1 mark
• Defuzzifiers: It converts the fuzzy output given by the fuzzy inference engine to produce a crisp (real-valued) output. This process is called defuzzifiaction. 1 mark
Input (Crisp)
Knowldedge Base
SEE NEXT PAGE
Fuzzy Input
Defuzzifier
Fuzzy Fuzzy Inference Engine Output
It maps the crisp (real-valued) input into a fuzzy set
—— SOLUTIONS ——
b. A fuzzy inference system using max-product inference method has 2
rules as shown below:
Rule 1: IF x is Zero THEN
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com