May 2018 1.
—— SOLUTIONS ——
a. Explain how Bayes decision rule for minimum error is used to determine the class of an exemplar.
[4 marks] Calculate P(ωj|x) for each class ωj. Classify exemplar x in class ωi
Copyright By PowCoder代写 加微信 powcoder
for which P(ωi|x) > P(ωj|x)∀j ̸= i. Marking scheme
2 marks for knowing the need to calculate the posterior. 2 marks for knowing how to choose the class.
Table 1, below, shows samples that have been recorded from three univariate class-conditional probability distributions:
samples of x
3.54 4.83 0.74 3.86 3.32 1.69 2.57 3.34 6.58 15.85 4.75 17.17 5.63 1.68 5.57 0.98
3.75 4.00 4.53 5.34 1.59
SEE NEXT PAGE
—— SOLUTIONS ——
b. Apply kn nearest neighbours to the data shown in Table 1 in order to estimate the density of p(x|ω1) at location x = 4, using k = 3.
The closest three samples to x = 4 are at: 3.86, 3.54, and 3.34. We
need to make a “volume” of radius |4 − 3.34| = 0.66 (diameter 1.32) around x = 4 to encompass these three samples.
Therefore density is:
p(x = 4|ω1) = k/n = 3/9 = 0.2525 V 1.32
Marking scheme
2 marks for correct method, 2 marks for correct application.
SEE NEXT PAGE
—— SOLUTIONS ——
c. The class-conditional probability distribution for class 2 in Table 1, p(x|ω2), is believed to be a Normal distribution. Calculate the parameters of this Normal distribution using Maximum-Likelihood Estimation; use the “unbiased” estimate of the covariance.
Maximum-Likelihood estimate of mean is the arithmetic mean of the sample values. Therefore,
μˆ = (15.85+4.75+17.17+5.63+1.68+5.57+0.98) = 7.3757
Maximum-Likelihood “unbiased” estimate of the covariance for univariate
σˆ 2 = n − 1
σˆ 2 = 1[(15.85 − 7.3757)2 + (4.75 − 7.3757)2 + (17.17 − 7.3757)2
2 +(5.63−7.3757)2+(1.68−7.3757)2+(5.57−7.3757)2+(0.98−7.3757)2]
Marking scheme
2 marks for correct methods, 2 marks for correct application.
Therefore,
( x k − μˆ ) 2
SEE NEXT PAGE
—— SOLUTIONS ——
d. For the data in Table 1, use Parzen-window density estimation to calculate the density of p(x|ω3) at the location x = 4. Use a Gaussian window with width hn = 2. For a Gaussian Parzen window the estimate of the probability density, pn, at a value, x, is given by the formula:
1n 1 x−xi pn=n hφ h
Estimate of probability density is:
1n 1 x−xi pn(x|ω3) = n h φ h
For a Gaussian window: φ(u) = 1 exp −0.5u2
2π For this question, n = 5, hn = 2, so:
1511 4−xi2 p(x=4|ω3)= 5 i=1 2√2πexp −0.5 2
1 4−3.752
= √ exp −0.5 10 2π
2 4−42
+ exp −0.5 +…···+exp −0.5
4−1.592 2
[0.9922 + 1.0000 + 0.9655 + 0.7990 + 0.4838] = 0.1692
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 7CCSMPNN
Marking scheme
2 marks for correct method, 2 marks for correct application.
e. Using the number of samples given in Table 1 estimate the prior probabilities of each class, i.e., calculate P(ω1), P(ω2) and P(ω3).
Marking scheme
1 mark for each correct answer.
P(ω1)= 9 =0.4286 21
P(ω2)= 7 =0.3333 21
P(ω3)= 5 =0.2381 21
f. Use the answers to sub-questions 1.b, 1.c, 1.d and 1.e to determine the class of a new sample with value x = 4.
P(ωj|x) = p(x|ωj)P(ωj) p(x)
P(ω2|x = 4) ∝ p(x = 4|ω2)P(ω2)
From sub-question 1.c, p(x|ω2) is given by N(7.3757,
p(x = 4|ω2) = √ 1 exp −(4−7.3757)2 = 0.0536 2π×42.3817 2×42.3817
42.3817), so
Ignoring p(x):
P (ω1|x = 4) ∝ p(x = 4|ω1)P (ω1) = 0.2525 × 0.4286 = 0.1082
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 7CCSMPNN
Therefore,
P (ω2|x = 4) ∝ 0.0536 × 0.3333 = 0.0179
P (ω3|x = 4) ∝ p(x = 4|ω3)P (ω1) = 0.1692 × 0.2381 = 0.0403
Therefore most likely class for sample x = 4 is ω1. Marking scheme
6 marks. 3 for correctly calculating the P(ωj|x) values, 2 for correctly calculating p(x = 4|ω2), 1 for choosing correct class label.
SEE NEXT PAGE
May 2018 2.
—— SOLUTIONS ——
a. Give a brief definition of each of the following terms in the context of classification:
i. Supervised Learning ii. Generalisation
i) Supervised Learning: methods of finding patterns in data when the
dataset includes class information for each exemplar.
ii) Generalization: how well a model/classifier performs on new data (i.e. data not used for training).
Marking scheme
2 marks for each correct definition.
b. A dichotomizer is defined using the following linear discriminant function: g(x) = wT x+w0 where wT = (−1, −2) and w0 = 3. Using this linear discriminant function determine the class predicted for the feature vectorxT =(2.5,1.5).
2.5 g(x)=wTx+w0 =(−1,−2)× 1.5 +3=−2.5
Classification rule is that x is assigned to class 1 if g(x) > 0, and x is assigned to class 2 otherwise. So exemplar is predicted to be in class 2.
Marking scheme
2 marks for correct method, 2 marks for correct application.
SEE NEXT PAGE
—— SOLUTIONS ——
c. Sketch the decision boundary defined by the linear discriminant function
specified in sub-question 2.b. Answer
-8 -6 -4 -2 0 2 4 6 8
Marking scheme
2 marks for slope of approximately -0.5 (= −w1
= −(−1)), 2 marks for
w2 an intercept of approximately 1.5 (= −w0 = −3).
d. Table 2, shows a linearly separable dataset.
(2,−1) (−1,0) (0,0) (1,1) (0,−1)
i. Re-write the dataset shown in Table 2 using augmented vector notation.
SEE NEXT PAGE
—— SOLUTIONS ——
(1,2,−1) (1,−1,0) (1,0,0) (1,1,1) (1,0,−1)
Marking scheme
ii. A linear threshold neuron has a
weighted sum of its inputs and an activation function that applies a threshold (θ) and the Heaviside function. For a linear threshold neuron with parameter values θ = −1, w1 = 3 and w2 = 0.5 calculate the output of the neuron in response to each of the samples in the dataset given in Table 2.
Using augmented output notation, the output is y = H(wx), where w = [−θ,w1,w2], and x = [1,x1,x2]T. Here, w = [1,3,0.5].
For each sample in the dataset, the response is:
xT wx y=H(wx) (1,2,-1) 6.5 1
transfer function which is a linear
-2 0 1 1 4.5 1 (1,0,-1) 0.5 1
(1,-1,0) (1,0,0) (1,1,1)
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 7CCSMPNN
Marking scheme
2 marks for correct method, 2 marks for correct application
One method for learning the parameters of a linear threshold neuron, is the sequential delta learning algorithm. Give the learning rule for this algorithm, and apply the learning algorithm to find the parameters of a linear threshold neuron that will correctly classify the data in Table 2. Assume initial values of θ = −1, w1 = 3 and w2 = 0.5 , and a learning rate of 1.
The initial weight vector is w = [1, 3, 0.5]. In the sequential delta learning algorithm, the learning rule is w ← w + η(t − y)xT. Here η = 1.
0 1 2 3 4 5 6 7 8 9
t w ← w + η(t − y)xT 0 [0,1,1.5]
1 [1,0,1.5]
1 [1,0,1.5]
0 [0,-1,0.5] 1 [1,-1,-0.5] 0 [1,-1,-0.5] 1 [1,-1,-0.5] 1 [1,-1,-0.5] 0 [1,-1,-0.5] 1 [1,-1,-0.5]
[1,3,0.5] [0,1,1.5] [1,0,1.5] [1,0,1.5] [0,-1,0.5] [1,-1,-0.5] [1-1,-0.5] [1,-1,-0.5] [1,-1,-0.5] [1,-1,-0.5]
Learning has converged at weights
Marking scheme
2 marks for correct learning rule, and 5 marks for correct method.
[1,2,-1] [1,-1,0] [1,0,0] [1,1,1] [1,0,-1] [1,2,-1] [1,-1,0] [1,0,0]
6.5 1 -1 0 1 1 2.5 1 -0.5 0 -0.5 0 2 1 1 1 [1,1,1] -0.5 0 [1,0,-1] 1.5 1
w = [1, −1, −0.5]
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 3.
a. Give brief definitions of the following terms: i. Feature selection
ii. Feature extraction
1. Feature selection – choosing a subset of features from a given dataset to use for the learning problem.
2. Feature extraction – projecting the original features from a given dataset, into a new feature space.
Marking scheme
2 marks for each correct definition
b. Principal component analysis (PCA) is a commonly used method for feature extraction in pattern recognition problems. Briefly describe what principal components are, and explain why performing PCA on a dataset before learning may increase classifier accuracy.
The principal components of a dataset are a set of linearly uncorrelated orthogonal bases, where the first principal component is the basis with the greatest variance, and subsequent bases have decreasing variance. Performing PCA allows for one to find these principal components and discard principal components with low variance. By then projecting the dataset onto the remaining principal components, learning can be performed in a lower-dimensional space, which only contains components that contain the majority of the variance of the dataset.
Marking scheme Page 12
SEE NEXT PAGE
—— SOLUTIONS ——
2 marks for correct explanation of principal components, 2 marks for correct explanation of their use in learning
c. Write pseudo-code for the Karhunen-Love Transform method for PCA. [5 marks]
Given a dataset x
1. Calculate the mean of x, and subtract from the dataset.
2. Calculate the covariance matrix of the mean-zeroed dataset.
3. Find the eigenvectors and eigenvalues of the covariance matrix
4. Order the eigenvalues from largest to smallest, and discard eigenvectors
corresponding to small eigenvalues
5. Form a matrix Vˆ of the remaining principal components.
6. Project the mean-zeroed dataset onto the PCA subspace by the
transpose of Vˆ .
Marking scheme
5 marks for correct method. Partial marks for partially correct answers.
SEE NEXT PAGE
—— SOLUTIONS ——
d. When performing PCA using neural networks, Oja’s update rule is often used in place of standard Hebbian learning. Give Oja’s update rule, and briefly explain why this rule is preferred.
Oja’s update rule is given as: w ← w + ηy(xT − yw).
In standard Hebbian learning, the learnt weights approach infinity as the number of iterations increase (for a positive learning rate). Oja’s rule controls learning by causing the weight vector to approach unit length.
Marking scheme
2 marks for correct update rule, 3 marks for correct explaination
e. Linear Discriminant Analysis (LDA) is a supervised projection method for finding linear combinations of features in a dataset. Briefly explain Fisher’s LDA method for estimating the weight parameters of a decision boundary to separate two classes in a dataset, and give the objective function.
Fisher’s method for defining a decision boundary finds a weight vector
that maximises the between class scatter, while minimising the within
class scatter. This is given by the objective function J(w) = sb , where sw
sb is the between class scatter, and sw is the within class scatter. Marking scheme
2 marks for correct definition, 1 mark for correct objective function
SEE NEXT PAGE
—— SOLUTIONS ——
f. Figure. 1 shows a dataset containing two classes of two-dimensional data, with one class labelled with crosses, and the other class labeled with circles. An axis on which the data could be projected is shown with a black arrow.
Is this projection a suitable axis for maximally discriminating between the two classes of data? Justify your answer.
This projection is not a suitable axis for maximally discriminating between classes. Projections made along this axis will result in large amounts of overlap between the two classes. As the decision boundary is orthogonal to the projection axis, this means that the decision boundary will run through the centre of both clusters of data, instead of seperating them.
Marking scheme
1 mark for correct answer, 3 marks for justification
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 4.
a. Draw a diagram to show each of the following activation functions. • Symmetric hard limit (signum)
• Linear function
• Symmetric sigmoid function • Logarithmic sigmoid function • Radial basis function
Symmetric hard limit (signum) Linear
Symmetric sigmoid
−4 −2 0 2 4 Radial basis
0 −4 −2 0 2 4 nn
Marking scheme
1 mark for each figure
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
−4 −2 0 2 4
−4 −2 0 2 4
Logarithmic sigmoid
0 −4 −2 0 2 4
SEE NEXT PAGE
f (n) f (n)
f (n) f (n)
—— SOLUTIONS ——
b. Referring to the activation functions listed in Question 4.a, which activation function is not suitable for training a feedforward neural network using the backpropagation algorithm? Explain your answer.
Symmetric hard limit (signum) function is not suitable. 2 marks.
In backpropagation algorithm, the learning rule depends on the rate of change between the output error and the node inputs, which is related to the derivative of the activation functions. Symmetric hard limit (signum) function is not differentiable and its derivative is not defined at some points. 3 marks.
c. When the backpropagation algorithm is employed to train a neural network, the dataset is usually partitioned into three sub-datasets. Name these three sub-datasets.
Answer Training set
Validation set Test set
1 mark 1 mark 1 mark
d. Figure 2 shows a diagram of a feedforward neural network with two inputs and two outputs. All activation functions indicated by “/” are a linear function. In the feedforward operation, Table 3 shows the input-output relationship between inputs (x1 and x2) and outputs (z1 and z2). For example, referring to Table 3, when x1 = 2 and x2 = −0.5, the neural network’s outputs are z1 = 98 and z2 = 7.5. Determine the connection weights w11, w12, w21 and w22 which will produce the results in Table 3.
[12 marks]
SEE NEXT PAGE
—— SOLUTIONS ——
Input layer x1/w11
/ m22=9/z2 Figure 2: A diagram of 3-layer neural network.
x1 x2 z1 z2 2 −0.5 98 7.5
−4 −6 −168 −246
Table 3: Inputs x1 and x2, and outputs z1 and z2.
The output of neural network can be expressed as
z = WkjWjix
x1(1) x1(2) ··· x1(N) z1(1) z1(2) ··· z1(N)
Hidden layer Output layer / m11=8/z1
wherex= x2(1) x2(2) ··· x2(N) ,z= z2(1) z2(2) ··· z2(N) ,
w w Wji = 11 12
m m 8−4 11 12 = .
From z = WkjWjix, we have W−1z = Wjix ⇒ Wji = W−1zxT (xxT )−1.
m21 m22 6 9
2 −4 98 −168 Fromthetable,wehavex= −0.5 −6 andz= 7.5 −246 .
SEE NEXT PAGE
—— SOLUTIONS ——
8 −4 −1 98 −168 2 −4 T
Wji =W−1zxT(xxT)−1 = kj
× 2−42−4T−1 51
6 9 −0.5−6 × −0.5−6
7.5 −246 = −23 .
1 mark 1 mark 1 mark 1 mark
So, w11 = 5, w12 = 1,
w21 = −2, and w22 = 3.
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 5.
a. Briefly explain what a “weak classifier” is.
A weak classifier is a classifier with recognition performance that is slightly better than choosing randomly/blindly.
Marking scheme 3 marks.
b. Explain briefly the idea of “ensemble methods” for classification.
The idea of ensemble methods is to generate a collection of weak
classifiers from a given dataset and combine the weak classifiers to be a final stronger one, e.g., by averaging or voting.
Marking scheme 3 marks.
c. Name two methods used for designing ensemble classifiers.
Answer Bagging Boosting
1 mark 1 mark
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 7CCSMPNN
d. The Adaboost algorithm is given in Table 4.
Algorithm: Adaboost
Given D = {(x1,y1),(x2,y2),…,(xn,yn)}, yi = {−1,1},i = 1,…,n begin initialise kmax, W1(i) = 1/n, i = 1, . . . , n
do k ← k + 1
hˆk = arg min Ej where Ej = hj ∈H
Wk(i) (Weighted error rate)†
i=1,yi ̸=hj (xi )
εk = overall weighted error rate of classifier hˆk (i.e., the minimum Ej)
if εk > 0.5
αk = 12 ln1−εk
W k ( i ) e − α k y i hˆ k ( x i )
Wk+1(i) = Z
until k = kmax end
, i = 1,…,n (Update Wk(i))‡
† Finds the classifier hk ∈ H that minimises the error εk
with respect to the distribution Wk(i).
‡ Zk is a normalization factor chosen so that Wk+1(i) is a normalised distribution.
Table 4: Adaboost Algorithm.
i. Consider a classification problem with 5 samples. The dataset is denoted as D = {(x1, −1), (x2, −1), (x3, +1), (x4, +1), (x5, +1)}. Three weak classifiers, h1(x), h2(x) and h3(x), are designed where the classification results are shown in Table 5. Referring to this table, taking h1(x) and x1 as an example, the classification given by the classifier h1(x) with the input sample x1 is +1. Determine α1 by runing the first iteration of Adaboost algorithm shown in Table 4 with kmax = 3.
[14 marks]
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 7CCSMPNN
Classifier x1 x2 x3 x4 x5 h1(x) +1 −1 +1 +1 +1 h2(x) −1 +1 −1 +1 +1 h3(x) −1 −1 +1 −1 −1
Table 5: Classification Results.
In the first iteration, i.e., k = 1, W1(i) = 1/5 for i = 1, 2, 3, 4, 5.
The overall weighted error rate for each classifier is listed below: Classifier h1(x): E1 = 1/5 = 0.2 (due to the misclassification of x1) 2 marks
Classifier h2(x): E2 = 1/5+1/5 = 0.4 (due to the misclassification of x2 and x3) 2 marks
Classifier h3(x): E3 = 1/5+1/5 = 0.4 (due to the misclassification of x4 and x5) 2 marks
Classifier h1(x) gives the minimum overall weighted error rate. So, it is chosen to be the classifier hˆ1(x), i.e., hˆ1(x) = h1(x). 2 marks
ε1 =E1 =0.2 2marks α1 = 1 ln1−ε1 = 1 ln1−0.2 = 0.6931 2 marks
2 ε1 2 0.2
SEE NEXT PAGE
—— SOLUTIONS ——
ii. When the Adaboost algorithm continues, it is assumed that hˆ2(x) = h3(x), hˆ3(x) = h2(x), α2 = 0.8 and α3 = 0.9 are obtained. With the results obtained in Question 5.d.i, give the formula for this ensemble classifier.
[3 marks] H(x) = sgnα1hˆ1(x) + α2hˆ2(x) + α3hˆ3(x) = sgn0.6931h1(x) +
0.8h3(x) + 0.9h2(x) Marking scheme
SEE NEXT PAGE
—— SOLUTIONS ——
May 2018 6.
a. In the context of linear support vector machines (SVMs), considering the linearly separable case, the design of a linear SVM classifier is formulated as the following optimisation problem:
min J(w) = 21∥w∥2 w
subjecttoyi(wTxi+w0)≥1, i=1,2,…,N
where w and w0 are the parameters of the linear SVM classifier; yi ∈ {−1,+1} denotes the class label of the sample xi; N is the number of samples and ∥ · ∥ denotes the Euclidean norm operator.
i. Briefly explain why “J(w) = 12∥w∥2” has to be minimised.
2 is the margin of the hyperplane, which is going to be maximised.
Minimising ∥w∥ is equivalent to maximising 2 . Making it squared
is to make the analysis easier when doing calculus.
Marking scheme
4 marks. Partial marks available for partially correct answers.
ii. Briefly explain why the constraint “yi(wT xi + w0) ≥ 1” has to be satisfied.
[4 marks] “yi(wT xi + w0) ≥ 1” can be separated into two cases: “yi(wT xi +
w0) = 1” and “yi(wT xi + w0) > 1”.
Considering a sample xi with yi = +1, wT xi + w0 = 1 is achieved if it is a support vector and correctly classified which leads to yi(wT xi + w0) = 1. Considering a sample xi with yi = −1, wT xi + w0 = −1 is achieved if it is a support vector and correctly
SEE NEXT PAGE
—— SOLUTIONS ——
classified which leads to yi(wT xi + w0) = 1. As a result, it can be concluded that “yi(wT xi + w0) = 1” happens when the sample xi is a support vector, which is lying at the hyperplane “yi(wT xi + w0) = 1” and correctly classified.
Considering a sample xi with yi = +1, wT xi + w0 > 1 is achieved if it is correctly classified which leads to yi(wT xi + w0) > 1. Considering a sample xi with yi = −1, wT xi +w0 < −1 is achieved if it is correctly classified which leads to yi(wT xi + w0) > 1. As a
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com