ECE421/1513 Final Exam April 17th, 2019 First Name: Last Name:
Student Number:
ECE 421S/ECE1513S — Introduction to Machine Learning Makeup Final Examination
April 17th, 2019 6:30 p.m. – 9:00 p.m.
Instructors: Ashish Khisti and Ben Liang and Amir Ashouri
Instructions
• Please read the following instructions carefully.
• You have 2 hour 30 minutes to complete the exam.
• Please make sure that you have a complete exam booklet.
• Please answer all questions. Read each question carefully.
• The value of each question is indicated. Allocate your time wisely!
• N(μ,σ2) denotes a Gaussian density function with mean μ and variance σ2.
• No additional pages will be collected beyond this answer book.
• This examination is closed-book; One 8.5 × 11 aid-sheet is permitted. A non-programmable calculator is also allowed.
• Good luck!
Page 1 of 14
6 marks
ECE421/1513 Final Exam April 17th, 2019
1.
(20 MARKS) Consider a binary classification problem in two dimensions. Let H0 denote the hypothesis class of all linear classifiers in two dimensions passing through the origin. Thus any h ∈ H0 can be expressed as:
h(x) = sign(w1x1 + w2x2) where x = (x1,x2) is the data point and w = (w1,w2) ∈ R2.
Furthermore let g1(x) and g2(x) be arbitrary classification functions such that gi ∶ R2 → {−1, +1} and assume that neither g1(⋅) nor g2(⋅) are in H0.
(a) Let mH0 (N) denote the growth function of the hypothesis class H0. Compute the growth function for N = 1, 2, 3, 4. Provide a brief justification for each computation.
total/6
Page 2 of 14
4 marks
(b) Let us define a new hypothesis class M=H0 ∪{g1,g2}. Let mM(N) denote the growth function of M. What is the maximum possible value mM(N) for N = 1,2,3,4 over all choices of g1(⋅) and g2 (⋅).
4 marks
(c) Suppose we train the hypothesis class M defined in part (b) over a training set D = {(x1, y1), . . . , (xN , yN )} iid
ECE421/1513 Final Exam April 17th, 2019
(where xi ∼ px(⋅) and yi = f(xi)) and produce an output hypothesis m(⋅) ∈ M that minimizes the training error. Let Ein(m) denote the average classification error of m(⋅) on the training set D and Eout(m) denote the test error for this hypothesis. Assume that g1(⋅) and g2(⋅) are fixed hypothesis selected before observing D. Find a relation between Ein(m) and Eout(m) in the following sense:
With Probability at-least 1 − δ, Eout(m) ≤ Ein(m) + Ω, where you should provide a bound on Ω using the VC dimension theory.
total/8
Page 3 of 14
6 marks
ECE421/1513 Final Exam April 17th, 2019 (d) In this part, suppose that the hypothesis g1(x) = 1 for all x ∈ R2. However g2(⋅) is selected after
observing the dataset D (stated in part (c)) as follows:
Let F denote the class of all hypothesis where the decision boundaries are horizontal in the x1 −x2
plane i.e.,
where c ∈ R and a ∈ {−1,+1}. We select g2(⋅) by finding f ∈ F that minimizes the average
f (x) = sign(a ⋅ x2 − c) classification error over D, i.e., Ein(f) is minimized.
The hypothesis g2(⋅), thus selected, is included in M along with g1(⋅). We train the hypothesis class M on the same dataset D (used to select g2(⋅)), and output m ∈ M that minimizes the training error.
Find a relation between Ein(m) and Eout(m) in the following sense: with probability at-least 1 − δ, Eout(m) ≤ Ein(m) + Ω, where you should provide a bound on Ω using the VC dimension theory.
total/6
Page 4 of 14
5 marks
2. (20 MARKS) Consider a target function f (x) = x2 where the domain of input x is the interval [−1, 1]. The training data set contains only one example: D = {(U,U2)}, where U is sampled uniformly from [−1,1]. Our hypothesis set H consists of all lines of the form h(x) = ax, for a ∈ R. Let gD(x) be the output hypothesis given training data set D, to minimize the squared error..
(a) Show that the average hypothesis, i.e., the expectation of gD(x) over D, is g ̄(x) = 0.
5 marks
(b) Find bias(x) and var(x).
ECE421/1513 Final Exam April 17th, 2019
total/10
Page 5 of 14
5 marks
ECE421/1513 Final Exam April 17th, 2019 (c) Find the expected out-of-sample error ED[Eout(gD)].
5 marks
(d) Suppose we actually have two data examples: D = {(x1,y1),(x2,y2)}, where x1 and x2 are independently and uniformly sampled from [−1,1]. However, we use the leave-one-out cross validation method, so that the training data set always contains only one example. What is the expectation of the cross validation estimation ECV over D ?
total/10
Page 6 of 14
5 marks
ECE421/1513 Final Exam April 17th, 2019
3.
(20 MARKS) In ancient times, there was a village surrounded by hundreds of lakes. Each lake was either poisonous (P) or healthy (H). Anyone who ate fish from a poisonous lake would die immediately while anyone who ate fish from a healthy lake would survive. All fish looked identical and tasted identical and there was no way of knowing whether a lake was poisonous or healthy.
Fortunately a famous chemist visited the village and was told of this dilemma. She suggested using the pH level of water to determine whether a given lake was poisonous or healthy. She hypothesized that lakes with poisonous fish would have higher pH value than healthy lakes. Accordingly she visited each lake and collected the pH value of the water in each lake. The data set is denoted by:
D = {l1,l2,…,lN}
where li denotes the pH level of lake i ∈ {1,2,…,N}. We assume that l1 ≤ l2 ≤ l3 ≤ … ≤ lN.
You are hired as a machine learning scientist to help determine the probability that a lake is poisonous given its pH value. In order to do this you propose to use a Gaussian Mixture Model (GMM) as follows:
• Pr(lake is poisonous) = p1
• Pr(lake is healthy) = p2
• f(pH = l ∣ lake is poisonous) ∼ N(μ1,σ12)
• f(pH = l ∣ lake is healthy) ∼ N(μ2,σ2)
• μ1 ≥ μ2, as the pH value for poisonous lakes will be higher on average.
Here f (⋅) denotes the conditional density function for the pH value. You decide to use the EM algorithm to train the above GMM on the dataset D.
(a) Using the above GMM, provide an expression of the probability that a lake is poisonous given that its pH level is measured to be l.
total/5
Page 7 of 14
ECE421/1513 Final Exam April 17th, 2019
5 marks (b) Write down the pseudocode for the EM algorithm with hard decisions that finds the parameters of the GMM. Assume that we initialize the algorithm in such a way that B2 = {l1,l2,…,lK} denotes one cluster of lakes and B1 = {lK +1 , . . . , lN } denotes it’s complement.
total/5
Page 8 of 14
ECE421/1513 Final Exam April 17th, 2019
5 marks (c) Write down the pseudocode for the EM algorithm with soft decisions that finds the parameters of the GMM. Do state the initialization of your algorithm explicitly.
total/5
Page 9 of 14
ECE421/1513 Final Exam April 17th, 2019
5 marks (d) Suppose that N = 5 and we observe D = {1,2,3,6,10}. Let B2 = {1,2,3} and B1 = {6,10}. Execute the hard decision EM algorithm and compute the parameters of the GMM.
total/5
Page 10 of 14
ECE421/1513 Final Exam April 17th, 2019
4.
(20 MARKS) Consider a binary linear classification problem where the data vectors x ∈ R2 and the labels y ∈ {−1, +1}. Our dataset consists of the following labeled data-points:
X1 =(−1,0) X2 =(2,0) X3 =(0,−1) X4 =(0,2) X5 =(0,3)
y1 =−1 y2 =−1
y2 =−1 y3 =+1 y4 =+1
We consider linear classifiers of the form yˆ = sign(w1x1 + w2x2 + b), where w = (w1, w2) is the weight vector and b is the bias.
Consider the two classifiers shown in the figure below. The classification boundary is shown by the dotted line. In addition, we also show two lines parallel to the classification boundary passing through the nearest training points in each class.
–
l
-3 -2
3+ 2+
-1- – -1
x1 -1 0 1 2 3
0
-2 -3
–
l
-3 -2 -1 0 1 2 3 -0- -x1
-1 -2 -3
3+ 2+
1
3 marks
5 marks
(a)
(b)
(a) (b)
Which of the two classifiers has a larger margin?
Using the given data points, write down a quadratic programming problem to find the maximum margin classifier. You can express your optimization as minimizing an objective function f(w,b) subject to constraints of the form gn(w,b) ≥ 1 for n = 1,2,…,5 and specify the functions f(⋅) and gn(⋅).
total/8
Page 11 of 14
7 marks
ECE421/1513 Final Exam April 17th, 2019 (c) Show/argue that the classifier in figure (a) in the previous page achieves optimal margin for the
given points. Also specify the support vectors.
5 marks
(d) Provide an upper bound on the leave-one-out cross validation for the classification rule in part (c).
total/12
Page 12 of 14
ECE421/1513 Final Exam April 17th, 2019
5. (20 MARKS) Consider linear regression with training data obtained from a noisy target function y = f(x)=cTx+ε,wherey∈R,x∈Rd+1 (withbiastermx0=1asusual),c∈Rd+1,andthenoiseterm ε is independently generated with zero mean and variance σ2. Suppose we have a set of N such training examples, D={(x1,y1),(x2,y2),…,(xN,yN)}, i.e., yn =cTxn +εn for n=1,2,…,N.
⎡⎢ y 1 ⎤⎥ ⎡⎢ ε 1 ⎤⎥ ⎡⎢ x 1 T ⎤⎥
⎢y2 ⎥ ⎢ε2 ⎥ ⎢x2T ⎥
Lety=⎢ ⎥,ε=⎢ ⎥,andX=⎢ ⎥. WeassumethecolumnsofXarelinearlyindependent,
⎢⋮⎥ ⎢⋮⎥ ⎢⋮⎥ ⎢y ⎥ ⎢ε ⎥ ⎢x T⎥
⎣N⎦⎣N⎦⎣N⎦
5 marks
so that XT X is invertible. Let wLS be the linear least-squares estimator, i.e., the in-sample error
Ein(w) = 1 ∥Xw − y∥2 is minimized when w = wLS. N
(a) Show that wLS = X†y, where X† is the pseudo-inverse of X, i.e., X† = (XT X)−1XT . You may use without proof the fact that the gradient of Ein(w) with respect to w is 2 XT (Xw − y).
5 marks
(b) Find an expression for Ein(wLS) as a function of only N, X, X†, and ε.
N
total/10
Page 13 of 14
5 marks
⎢ ⎥. Find the expected ⎢⋮⎥
⎢⎥
5 marks
(d) For any out-of-sample data point x, the expected out-of-sample error is Eε[(f(x) − wLST x)2]. Show that it is at least σ2.
ECE421/1513 Final Exam
April 17th, 2019
(c) Observe the result in Part (b) for the extreme case of d = 0, i.e., X =
⎡⎢1⎤⎥ ⎢1⎥
⎢⎣1⎥⎦
in-sample error Eε[Ein(wLS)] as a function of N and σ. (Side comment: you should see that Eε[Ein(wLS)] strictly increases in N and converges to σ2. A similar conclusion holds for general
d values, but you do not need to prove that.)
total/10
Page 14 of 14