ECE421F MidTerm First Name: Last Name:
Student Number:
ECE 421F — Introduction to Machine Learning MidTerm Examination
Oct 16th, 2019
Circle your tutorial section:
• TUT0101 Thu 1-3 • TUT0102 Thu 4-6
Wed Oct 16th, 2019 4:10-6:00 p.m.
Instructor: Ashish Khisti
Instructions
• Please read the following instructions carefully.
• You have 1 hour fifty minutes (1:50) to complete the exam.
• Please make sure that you have a complete exam booklet.
• Please answer all questions. Read each question carefully.
• The value of each question is indicated. Allocate your time wisely!
• No additional pages will be collected beyond this answer book. You may use the reverse side of each page if needed to show additional work.
• This examination is closed-book; One 8.5 11 aid-sheet is permitted. A non-programmable calculator is also allowed.
• Good luck!
Page 1 of 15
10 marks
ECE421F MidTerm Oct 16th, 2019
1.
(40 MARKS) Consider a multi-class linear classification problem where the data points are two dimen- sional, i.e., x = (x1, x2) ∈ R2 and the labels y ∈ {1, 2, 3}. Throughout this problem consider the data-set with following five points:
D = {(x1, y1), (x2, y2), (x3, y3), (x4, y4), (x5, y5)} where the input data-vectors are given by:
x1 =(−1,0)T, x2 =(1,0)T, x3 =(1,1)T, x4 =(−1,1)T, x5 =(0,3)T and the associated labels are given by
y1=1, y2=2, y3=2, y4=1, y5=3 Our aim is to find a linear classification rule that classifies this dataset.
(a) Suppose we implement the perceptron learning algorithm for binary classification that finds a perfect classifier separating the data points between the two sets: S1 = {(x1,y1),(x4,y4)} and S2 ={(x2,y2),(x3,y3)}.
Assume that the initial weight vector w = (0, 0, 0)T , that each point that falls on the boundary is treated as a mis-classified point and the algorithm visits the points in the following order:
x1 →x2 →x3 →x4 →x1 →x2⋯
until it terminates. Show the output of the perceptron algorithm in each step and sketch the final decision boundary when the algorithm terminates. [Important: When applying the perceptron update, recall that you have to transform the data vectors to include the constant term i.e., x1 = (−1, 0)T must be transformed to x ̃1 = (1, −1, 0)T etc. ]
total/-
Page 2 of 15
ECE421F
MidTerm Oct 16th, 2019
[continue part (a) here]
total/10
Page 3 of 15
ECE421F MidTerm Oct 16th, 2019
5 marks
(b) Find any linear classification rule that perfectly separates the data points between the two sets: S12 = {(x1, y1), (x2, y2), (x3, y3), (x4, y4)} and S3 = {(x5, y5)}. Draw your decision boundary and clearly mark the labels for all the decision regions. You need not use a perceptron algorithm to find the classification rule.
total/10
Page 4 of 15
ECE421F MidTerm Oct 16th, 2019
10 marks
(c) Explain how to combine parts (a) and (b) to develop a classification rule that given any input x ∈ R2 outputs a label yˆ ∈ {1, 2, 3}. Your classification rule must achieve perfect classification on the training set. Sketch your decision boundaries in R2 and show the labels associated with each decision region.
total/10
Page 5 of 15
10 marks
(d) Suppose we wish to implement a multi-class logistic regression model for classifying the training set D. Let Ω = {w(1), w(2), w(3)} denote the model parameters of your logistic regression model where w(i) ∈ R3 is the weight vector associated with class label y = i. Given an input data vector x = (x1, x2)T the model outputs is a probability vector:
e[wT (i)⋅x ̃] pˆΩ(i∣x)=∑3j=1e[wT(j)⋅x ̃], i=1,2,3
where x ̃ = (x0 = 1,x1,x2)T ∈ R3 is the augmented vector of x as discussed in class. We assume a standard log-loss function for the training error, i.e.,
ECE421F MidTerm Oct 16th, 2019
15
Ein(Ω) = 5 ∑ en(Ω),
n=1
en(Ω) = − log pˆΩ(yn∣xn)
Assuming that we select w(1) = (1, 0, 0)T , w(2) = (0, 1, 0)T and w(3) = (0, 0, 1)T numerically evaluate ∇w(j) {e1(Ω)} for j = 1, 2, 3.
total/-
Page 6 of 15
ECE421F
MidTerm Oct 16th, 2019
[continue part (d) here]
total/10
Page 7 of 15
ECE421F MidTerm Oct 16th, 2019
5 marks (e) For the problem in part (d), find the output of one-step update of the stochastic gradient descent (SGD) algorithm when the selected training example is n = 1, and ε is selected as the learning rate.
total/5
Page 8 of 15
10 marks
(a) The optimal solution w⋆ can be expressed as a solution to the following expression: M ⋅ w⋆ = U ⋅ y where M and U are matrices of dimension (d + 1) × (d + 1) and (d + 1) × N respectively and y = [y1 , . . . , yN ]T is the observation vector. Provide an expression for M and U in terms of the following matrices:
⎡⎢xT1⎤⎥
X =⎢xT2 ⎥∈RN×(d+1)
ECE421F MidTerm Oct 16th, 2019
2.
(40 MARKS) Consider a linear regression model where the training set is specified by D = {(x1,y1),(x2,y2),…,(xN,yN)},
with xi ∈ Rd+1 and yi ∈ R. We assume that each data vector is in the augmented dimension i.e.,
xi = (xi,0 = 1, xi,1, . . . xi,d). We aim to find a weight vector w ∈ Rd+1 that aims to weighted squared error loss function
minimized a
1N
EΛ(w)= ∑λ ⋅(wTx −y)2,
pre-specified vector of non-negative Suppose that w⋆ minimizes the weighted squared error loss i.e.,
in N i i i i=1
where Λ = (λ1, . . . , λN )T is a importance of each sample.
constants that
determine the
(1)
w⋆ = arg min EΛ(w). w∈Rd+1 in
Please note that L is a diagonal matrix where the j−th diagonal entry is √λj.
⎡⎢√λ1 0 … 0⎤⎥
L=⎢ 0 √λ2 … 0 ⎥∈RN×N.
⎢ ⋮ ⎥ ⎣xTN⎦
⎢ ⋮ ⋱ √⋮ ⎥ ⎣0 0 … λN⎦
total/-
Page 9 of 15
ECE421F
MidTerm Oct 16th, 2019
[continue part (a) here]
total/10
Page 10 of 15
10 marks
ECE421F MidTerm Oct 16th, 2019 (b) Provide an expression for ∇ (EΛ(w)) and use it to provide a (full) gradient descent algorithm
w in
for numerically computing the optimal solution w⋆. Assume that a constant learning rate of ε is
used in the algorithm.
total/10
Page 11 of 15
5 marks
(c) Using gradient analysis in part (b), provide a stochastic gradient descent (SGD) algorithm for numerically computing the optimal solution w⋆. Assume that a constant learning rate of ε is used in the algorithm.
5 marks
(d) List any two advantages of using the SGD algorithm over the solution in part (a)
ECE421F MidTerm Oct 16th, 2019
total/10
Page 12 of 15
10 marks
ECE421F MidTerm Oct 16th, 2019 (e) Suppose we wish to find wβ⋆ minimizes the following:
w⋆ =arg min {EΛ(w)+β⋅∣∣w∣∣2}, (2) β w∈Rd+1 in
where EΛ(w) is the weighted squared error loss as in part (a), ∣∣w∣∣2 is the squared Eucledian in
norm of w and β > 0 is the regularization constant. Provide an analytical closed-form expression for wβ⋆ in terms of X, L, y, the identity matrix, and the constant β.
total/10
Page 13 of 15
20 marks
3. Consider a binary linear classification problem where x ∈ R2 and y ∈ {−1, +1}. We illustrate the
training dataset below. The ‘+’ label refers to y = +1 and the ‘o’ label refers to y = −1. We would like
ECE421F MidTerm Oct 16th, 2019
to construct a classifier h (x) = sign(w + w x + w x ) where sign(⋅) is the sign function as discussed w 01122
8.6. Generativevsdiscriminativeclassifiers 279 in class.
2 marks 6 marks
where I denotes the indicator function.
(a) Draw a decision boundary in the figure above that achieves zero training error.
(b) Suppose that we attempt to minimize the following loss function over w : J(w) = L(w) + λw02,
where λ = 107 is a huge constant. Sketch a possible decision boundary in the figure below. How 8.6. Generativevsdiscriminativeclassifiers 279
Figure 8.13 Data for logistic regression question.
d. Now suppose we heavily regularize only the w2 parameter. Sketch a possible decision boundary. How many classification errors does your method make on the training set?
Figure 8.13 Data for logistic regression question.
In the figure above, the adjacent vertical (and horizontal) lines are 1 unit apart from each other. Assume that the training points are above are (x1, y1), . . . , (xN , yN ) (with N = 13). We consider the
classification loss
d. Now suppose we heavily regularize only the w2 parameter. Sketch a possible decision boundary. How 1N
many classification errors does your method make on the training set?
many points are mis-classified.
L(w) = N ∑ I(yi ≠ hw(xi)) n=1
total/8
Page 14 of 15
6 marks
ECE421F MidTerm Oct 16th, 2019 (c) Suppose that we attempt to minimize the following loss function over w : J(w) = L(w) + λw12,
where λ = 107 is a huge constant. Sketch a possible decision boundary in the figure below. How 8.6. Generativevsdiscriminativeclassifiers 279
6 marks
Figure 8.13 Data for logistic regression question.
d. Now suppose we heavily regularize only the w2 parameter. Sketch a possible decision boundary. How
many classification errors does your method make on the training set?
(d) Suppose that we attempt to minimize the following loss function over w : J(w) = L(w) + λw2,
where λ = 107 is a huge constant. Sketch a possible decision boundary in the figure below. How 8.6. Generativevsdiscriminativeclassifiers 279
many points are mis-classified.
many points are mis-classified.
Figure 8.13 Data for logistic regression question.
d. Now suppose we heavily regularize only the w2 parameter. Sketch a possible decision boundary. How many classification errors does your method make on the training set?
total/12
Page 15 of 15