1. (25 pts) Perceptron
(a) Write down the perceptron learning rule by filling in the blank below with a proper
sign (+ or -).
i. Input x is falsely classified as negative:
wt+1 = wt x ii. Input x is falsely classified as positive:
wt+1 = wt x
(b) Consider a perceptron algorithm to learn a 3-dimensional weight vector w 2 R3 =
[w0, w1, w2] with w0 being the bias term. Suppose we have training set as following:
Show the weights at each step of the perceptron learning algorithm. Loop through the training set once (i.e. MaxIter = 1) with the same order presented in the above table. Start the algorithm with initial weight w = [w0, w1, w2] = [0, 1, 1].
(c) Did the perceptron algorithm converge after the single iteration in (b) and why?
(d) Suppose we run the algorithm for more iterations, will the algorithm converge and why?
(e) Suppose we get w = [w0,w1,w2] = [ 10,2,1] when the algorithm terminates. What is the distance from Sample #3 to this learned hyperplane?
Sample #
1
2
3
4
x
[10,10]
[1,0]
[3,3]
[4,8]
y
+1
-1
-1
1
2
2. (20 pts) Linear Regression
You are given the following three data points:
x1 = 0 ,x2 = 2 ,x3 = 4 . y1 4y2 0y3 0
You want to fit a line, i.e., yˆ = w1x + w0, that minimizes the following sum of square
error:
X3 i=1
J(w) =
(w1xi + w0 yi)2.
In matrix-vector form, the objective function is J(w) = kXw yk2,
for some X, y and w = [w0,w1]T.
(a) What are X and y?
(b) What is the optimal w that minimizes the objective function?
(c) Draw the three data points and the fitted line.
3
3. (25 pts) Regularized Least Square and Gradient Descent
Suppose you have n data points {xi,yi} where xi 2 Rm and yi 2 R, you want to minimize the following loss function:
Xm
( w T x i y i ) 2 + 2 w j2 ,
j=1 where is a constant, w 2 Rm and wj is the j-th element of w.
J ( w ) = 2
i=1
1 Xn
X = 6xT2 7,y = 6y27. 4 . 5 4 . 5
(a) Let
2 x T1 3 2 y 1 3
xTn yn
Write the loss function in matrix-vector form using X, y and w.
(b) Derive the analytical solution of w that minimizes the loss function.
Note: you may assume the matrix XT X + I is invertible. You also don’t have to show the w you found minimizes the lost function instead of maximizing it.
(c) Denote the j-th element of xi as xi,j. Find the gradient of J(w) with respect to wj.
(d) With a learning rate ⌘, derive the stochastic gradient descent (SGD) rule to minimize the loss function for parameter wj.
Note: You don’t have to normalize the gradient by its norm.
4
4. (10 pts) Logistic Regression
Show that the derivative of the sigmoid function (x) = 1 satisfies: 0(x) =
(x)(1 (x)).
5
1+exp( x)
5. (20 pts) Decision Tree
Decision tree has been used for medical diagnostic because it can be easily interpreted by human. There are 8 patients who have been asked to answer yes (1) or no (0) for several symptoms. Their answer and whether they are diagnosed with XDisease are summarized in the following table:
Patient #
Fatigue
Fever
Cough
Headache
XDisease
1 2 3 4 5 6 7 8
0 0 1 1 0 1 0 1
0 1 0 1 1 1 0 0
0 0 0 1 1 0 0 1
1 1 1 1 0 0 0 1
0 1 0 1 1 1 0 1
(a) What is the binary entropy of this data set, i.e., H(XDisease)?
(b) Calculate the conditional entropy of
H(XDisease|X), for X 2 {F atigue, F ever, Cough, Headache}, i.e., the conditional entropy of XDisease conditioning on the features.
(c) Calculate the information gain:
I(XDisease; X) = H(XDisease) H(XDisease|X),
for
X 2 {F atigue, F ever, Cough, Headache}.
(d) Based on the information gain, determine the first feature to split on.
(e) Suppose we get the following tree as our final decision tree. Determine if pa- tients 9 and 10 are positive for XDisease or not based on this decision tree.
6
Patient #
Fatigue
Fever
Cough
Headache
XDisease
9 10
1 1
0 0
1 0
0 0
? ?
7