King’s College London
This paper is part of an examination of the College counting towards the award of a degree. Examinations are governed by the College Regulations
under the
Degree Programmes
Module Code Module Title Examination Period
authority of the Academic Board.
BSc, MSci
6CCS3PRE
Pattern Recognition May 2018 (Period 2)
Time Allowed Three hours
Rubric ANSWER FOUR OF SIX QUESTIONS.
All questions carry equal marks. If more than four questions are answered, the answers to the first four questions in exam paper order will count.
ANSWER EACH QUESTION ON A NEW PAGE OF YOUR ANSWER BOOK AND WRITE ITS NUMBER IN THE SPACE PROVIDED.
Calculators Calculators may be used. The following models are permitted: Casio fx83 / Casio fx85.
Notes Books, notes or other written material may not be brought into this examination
PLEASE DO NOT REMOVE THIS PAPER FROM THE EXAMINATION ROOM
TURN OVER WHEN INSTRUCTED 2018 King’s College London
May 2018 6CCS3PRE
1.
a. Explain how Bayes decision rule for minimum error is used to determine
the class of an exemplar.
[4 marks]
Table 1, below, shows samples that have been recorded from three univariate class-conditional probability distributions:
class samples of x
ω1 3.54 4.83 0.74 3.86 3.32 1.69 2.57 3.34 6.58
ω2 15.85 4.75 17.17 5.63 1.68 5.57 0.98
ω3 3.75 4.00 4.53 5.34 1.59
Table 1
b. Apply kn nearest neighbours to the data shown in Table 1 in order to estimate the density of p(x|ω1) at location x = 4, using k = 3.
[4 marks]
c. The class-conditional probability distribution for class 2 in Table 1, p(x|ω2), is believed to be a Normal distribution. Calculate the parameters of this Normal distribution using Maximum-Likelihood Estimation; use the “unbiased” estimate of the covariance.
[4 marks]
QUESTION 1 CONTINUES ON NEXT PAGE
Page 2
SEE NEXT PAGE
May 2018 6CCS3PRE
d. For the data in Table 1, use Parzen-window density estimation to calculate the density of p(x|ω3) at the location x = 4. Use a Gaussian window with width hn = 2. For a Gaussian Parzen window the estimate of the probability density, pn, at a value, x, is given by the formula:
where
1n 1 x−xi pn=n hφ h
i=1n n φ(u) = √1 exp −0.5u2
2π
e. Using the number of samples given in Table 1 estimate the prior probabilities of each class, i.e., calculate P(ω1), P(ω2) and P(ω3).
[3 marks]
f. Use the answers to sub-questions 1.b, 1.c, 1.d and 1.e to determine the class of a new sample with value x = 4.
[6 marks]
Page 3
SEE NEXT PAGE
[4 marks]
May 2018 6CCS3PRE
2.
a. Give a brief definition of each of the following terms in the context of
classification:
i. Supervised Learning
ii. Generalisation
b. A dichotomizer is defined using the following linear discriminant function: g(x) = wT x+w0 where wT = (−1, −2) and w0 = 3. Using this linear discriminant function determine the class predicted for the feature vectorxT =(2.5,1.5).
[4 marks]
c. Sketch the decision boundary defined by the linear discriminant function specified in sub-question 2.b.
d. Table 2, shows a linearly separable dataset. xT Class
(2,−1) (−1,0) (0,0) (1,1) (0,−1)
Table 2
0 1 1 0 1
[4 marks]
[4 marks]
i. Re-write the dataset shown in Table 2 using augmented vector notation.
[2 marks]
QUESTION 2 CONTINUES ON NEXT PAGE
Page 4
SEE NEXT PAGE
May 2018 6CCS3PRE
ii. A linear threshold neuron has a transfer function which is a linear weighted sum of its inputs and an activation function that applies a threshold (θ) and the Heaviside function. For a linear threshold neuron with parameter values θ = −1, w1 = 3 and w2 = 0.5 calculate the output of the neuron in response to each of the samples in the dataset given in Table 2.
[4 marks]
iii. One method for learning the parameters of a linear threshold neuron, is the sequential delta learning algorithm. Give the learning rule for this algorithm, and apply the learning algorithm to find the parameters of a linear threshold neuron that will correctly classify the data in Table 2. Assume initial values of θ = −1, w1 = 3 and w2 = 0.5 , and a learning rate of 1.
[7 marks]
Page 5
SEE NEXT PAGE
May 2018
3.
a. Give brief definitions of the following terms:
i. Feature selection ii. Feature extraction
6CCS3PRE
b. Principal component analysis (PCA) is a commonly used method for feature extraction in pattern recognition problems. Briefly describe what principal components are, and explain why performing PCA on a dataset before learning may increase classifier accuracy.
[4 marks]
c. Write pseudo-code for the Karhunen-Love Transform method for PCA. [5 marks]
d. When performing PCA using neural networks, Oja’s update rule is often used in place of standard Hebbian learning. Give Oja’s update rule, and briefly explain why this rule is preferred.
[5 marks]
e. Linear Discriminant Analysis (LDA) is a supervised projection method for finding linear combinations of features in a dataset. Briefly explain Fisher’s LDA method for estimating the weight parameters of a decision boundary to separate two classes in a dataset, and give the objective function.
[3 marks]
QUESTION 3 CONTINUES ON NEXT PAGE
Page 6
SEE NEXT PAGE
[4 marks]
May 2018 6CCS3PRE
f. Figure. 1 shows a dataset containing two classes of two-dimensional data, with one class labelled with crosses, and the other class labeled with circles. An axis on which the data could be projected is shown with a black arrow.
3 2
x2
1
01×23 1
Figure 1
Is this projection a suitable axis for maximally discriminating between the two classes of data? Justify your answer.
[4 marks]
Page 7
SEE NEXT PAGE
May 2018 6CCS3PRE
4.
a. Draw a diagram to show each of the following activation functions. • Symmetric hard limit (signum)
• Linear function
• Symmetric sigmoid function • Logarithmic sigmoid function • Radial basis function
[5 marks]
b. Referring to the activation functions listed in Question 4.a, which activation function is not suitable for training a feedforward neural network using the backpropagation algorithm? Explain your answer.
[5 marks]
c. When the backpropagation algorithm is employed to train a neural network, the dataset is usually partitioned into three sub-datasets. Name these three sub-datasets.
[3 marks]
QUESTION 4 CONTINUES ON NEXT PAGE
Page 8
SEE NEXT PAGE
May 2018 6CCS3PRE
d. Figure 2 shows a diagram of a feedforward neural network with two inputs and two outputs. All activation functions indicated by “/” are a linear function. In the feedforward operation, Table 3 shows the input-output relationship between inputs (x1 and x2) and outputs (z1 and z2). For example, referring to Table 3, when x1 = 2 and x2 = −0.5, the neural network’s outputs are z1 = 98 and z2 = 7.5. Determine the connection weights w11, w12, w21 and w22 which will produce the results in Table 3.
[12 marks]
Hidden layer Output layer / m11=8/z1
/ m22=9/z2 Figure 2: A diagram of 3-layer neural network.
x1 x2 z1 z2 2 −0.5 98 7.5
−4 −6 −168 −246
Table 3: Inputs x1 and x2, and outputs z1 and z2.
Input layer x1/w11
x2/w22
Page 9
SEE NEXT PAGE
w21
m21 = 6
w12
m12 = −4
May 2018 6CCS3PRE
5.
a. Briefly explain what a “weak classifier” is.
[3 marks]
b. Explain briefly the idea of “ensemble methods” for classification.
[3 marks]
c. Name two methods used for designing ensemble classifiers.
[2 marks]
d. The Adaboost algorithm is given in Table 4.
Algorithm: Adaboost
Given D = {(x1,y1),(x2,y2),…,(xn,yn)}, yi = {−1,1},i = 1,…,n begin initialise kmax, W1(i) = 1/n, i = 1, . . . , n
k←0
do k ← k + 1
hˆk = arg min Ej where Ej = hj ∈H
n
Wk(i) (Weighted error rate)†
i=1,yi ̸=hj (xi )
εk = overall weighted error rate of classifier hˆk (i.e., the minimum Ej)
if εk > 0.5
kmax =k−1
Stop
end
αk = 12 ln1−εk
εk
W k ( i ) e − α k y i hˆ k ( x i )
Wk+1(i) = Z
until k = kmax end
, i = 1,…,n (Update Wk(i))‡
k
† Finds the classifier hk ∈ H that minimises the error εk
with respect to the distribution Wk(i).
‡ Zk is a normalization factor chosen so that Wk+1(i) is a normalised distribution.
Table 4: Adaboost Algorithm.
QUESTION 5 CONTINUES ON NEXT PAGE
Page 10
SEE NEXT PAGE
May 2018 6CCS3PRE
i. Consider a classification problem with 5 samples. The dataset is denoted as D = {(x1, −1), (x2, −1), (x3, +1), (x4, +1), (x5, +1)}. Three weak classifiers, h1(x), h2(x) and h3(x), are designed where the classification results are shown in Table 5. Referring to this table, taking h1(x) and x1 as an example, the classification given by the classifier h1(x) with the input sample x1 is +1. Determine α1 by runing the first iteration of Adaboost algorithm shown in Table 4 with kmax = 3.
[14 marks]
Classifier x1 x2 x3 x4 x5 h1(x) +1 −1 +1 +1 +1 h2(x) −1 +1 −1 +1 +1 h3(x) −1 −1 +1 −1 −1
Table 5: Classification Results.
ii. When the Adaboost algorithm continues, it is assumed that hˆ2(x) = h3(x), hˆ3(x) = h2(x), α2 = 0.8 and α3 = 0.9 are obtained. With the results obtained in Question 5.d.i, give the formula for this ensemble classifier.
[3 marks]
Page 11
SEE NEXT PAGE
May 2018 6CCS3PRE
6.
a. In the context of linear support vector machines (SVMs), considering the linearly separable case, the design of a linear SVM classifier is formulated as the following optimisation problem:
min J(w) = 21∥w∥2 w
subjecttoyi(wTxi+w0)≥1, i=1,2,…,N
where w and w0 are the parameters of the linear SVM classifier; yi ∈ {−1,+1} denotes the class label of the sample xi; N is the number of samples and ∥ · ∥ denotes the Euclidean norm operator.
i. Briefly explain why “J(w) = 12∥w∥2” has to be minimised.
[4 marks]
ii. Briefly explain why the constraint “yi(wT xi + w0) ≥ 1” has to be satisfied.
[4 marks]
iii. If the samples are nonlinearly separable, what can be done to the samples so that the linear SVM classifier becomes a nonlinear one?
b. A training dataset is given as follows:
Class1: x1 =−3;x2 =−2;x3 =6;x4 =8.
Class2: x5 =−1;x6 =0;x7 =1;x8 =2. i. Is the dataset linearly separable? Justify your answer.
[2 marks]
[4 marks]
QUESTION 6 CONTINUES ON NEXT PAGE
Page 12
SEE NEXT PAGE
May 2018 6CCS3PRE
x ii.Afeaturemappingfunctionzi =Φ(xi)= i ,i=1,2,3,4,
x2i
5, 6, 7, 8, is considered. It is found that the samples z1 to z8 in
the feature space after mapping are linearly separable. Using z2, z5 and z8 as support vectors, design an SVM classifier to correctly classify all the given samples x1 to x8.
[11 marks]
Page 13
FINAL PAGE