ECE M146
Introduction to Machine Learning Instructor: Lara Dolecek
TA: Zehui (Alex) Chen
Homework 3 Monday, April 12, 2021 Due: Monday, April 26, 2021 chen1046@ucla.edu
Please upload your homework to Gradescope by April 26, 4:00 pm. Please submit a single PDF directly on Gradescope
You may type your homework or scan your handwritten version. Make sure all
the work is discernible.
Note: All logs in the decision tree problems are base 2.
1. Consider the modified objective function of regularized logistic regression:
J(w) = XN ⇥y(n) log hw(x(n)) + (1 y(n)) log(1 hw(x(n)))⇤ + 1 X wi2 (1) n=1 2i
where w 2 RM , x(n) 2 RM , y(n) 2 {0, 1}, hw(x(n)) = (wT x(n)) and the sigmoid
(b) When the data is scalar, we have M = 2, x(n) = [x(n), x(n)]T where x(n) = 1, and
w = [w1, w2]T . Find the Hessian matrix r2wJ(w).
1 1+exp( t)
function (t) =
(a) Find the partial derivative @J and derive the gradient descent update rules for
@wj
.
the weights. In the above expression, wj is the j-th element of w.
1
121
2. In class we have seen the probabilistic interpretation of the logistic regression objec- tive, and the derivation of the gradient descent rule for maximizing the conditional likelihood. We have this maximum likelihood (ML) formulation for w 2 Rm:
⇤ Yn
w = arg max P (yi|xi, w). (2)
w i=1
To prevent overfitting, we want the weights to be small. To achieve this, we consider some prior distribution of the weights f(w) and use maximum a posteriori (MAP)
estimation:
Yn w i=1
w⇤ = argmax
P(yi|xi,w)f(w). (3)
Assuming a standard Gaussian prior N(0,I) for the prior on the weight vector, show that the MAP estimation is equivalent to minimizing the logistic regression objective with L2 regularization as shown in the previous question!. Note: For Z ⇠ N (0, Im),
1 Xmzi2 fZ(z)= m exp .
(2⇡)2 i=1 2
2
3. You are trying to choose between two restaurants (sample 9 and sample 10) to eat at. To do this, you will use a decision tree that will be trained based on your past experiences (sample 1-8). The features for each restaurants and your judgment on the goodness of sample 1-8 are summarized by the following chart.
Sample #
HasOutdoorSeating
HasBar
IsClean
HasGoodAtmosphere
IsGoodRestaurant
1 2 3 4 5 6 7 8
0 1 0 1 1 1 1 0
0 1 1 0 1 0 1 0
0 0 1 0 1 1 0 1
1 0 1 1 0 0 1 1
1 0 1 1 0 1 1 1
9 10
0 1
1 1
0 1
1 1
? ?
(a) What is the entropy of IsGoodRestaurant, i.e., H(IsGoodRestaurant)?
(b) Calculate the conditional entropy of IsGoodRestaurant conditioning on HasOut- doorSeating. To do this, first compute H(IsGoodRestaurant|HasOutdoorSeating = 0) and H(IsGoodRestaurant|HasOutdoorSeating = 1), then weigh each term by the probabilities P(HasOutdoorSeating = 0) and P(HasOutdoorSeating = 1), respectively. Namely, calculate the following:
H(IsGoodRestaurant|HasOutdoorSeating)
=P (HasOutdoorSeating = 0)H(IsGoodRestaurant|HasOutdoorSeating = 0) +P (HasOutdoorSeating = 1)H(IsGoodRestaurant|HasOutdoorSeating = 1).
(c) Similarly, calculate
H(IsGoodRestaurant|X), for X 2 {HasBar, IsClean, HasGoodAtmosphere},
i.e., the conditional entropy of IsGoodRestaurant conditioning on the other three features.
(d) Calculate the information gain:
I(IsGoodRestaurant; X) = H(IsGoodRestaurant) H(IsGoodRestaurant|X), for
X 2 {HasOutdoorSeating, HasBar, IsClean, HasGoodAtmosphere}.
(e) Based on the information gain, determine the first attribute to split on.
3
(f) Make the full decision tree. Start at the root with first splitting attribute you found in part (e). After each split, treat the sets of samples with X = 0 and X = 1 as two separate sets and redo (b), (c), (d) and (e) on each of them. X is the feature for previous split and is thus excluded from the avail- able features which can be split on next. Terminate splitting if after the pre- vious split, the entropy of IsGoodRetaurant in the current set is 0. For ex- ample, if we choose HasGoodAtmosphere as our first feature to split, we get H(IsGoodRetaurant|HasGoodAtmosphere = 1) = 0. We thus stop splitting the tree in this branch. Draw the tree and indicate the split at each node.
(g) Now, determine if restaurants 9 and 10 are good or not.
4
4. When training decision trees for classification, we generally use entropy or gini index to help determine good splits. These functions are examples of impurity measures. We can formally define impurity measures as follows.
Assume we are performing binary classification. Let V be a set of data points and let {yi 2{+1, 1}, 8i2V}bethesetoflabels. LetV1 ✓V. Wedefinepandqas follows:
p(V1,V)= |V1| q(V1)= |{i:i2V1,yi =1}| |V| |V1|
where | · | is the cardinality of a set.
Let i(q(V )) measure the impurity of a set V . Two desired properties of i(q(V )) are:
i(q(V )) = 0 when the set has only one class
i(q(V )) reaches its maximum value for sets where the two classes are perfectly
balanced.
When a split is performed on a set V, the result is two sets V1 [V2 = V such that
V1 \ V2 = ;. The information gain of this split is defined as
I(V1, V2, V ) = i(q(V )) (p(V1, V )i(q(V1)) + p(V2, V )i(q(V2))).
Using the previous notation, we define the binary gini Index and binary entropy as: gini(q(V )) = 2q(V )(1 q(V ))
H(q(V )) = (q(V ) log(q(V )) + (1 q(V )) log(1 q(V )))
(a) Like binary entropy, gini index is an impurity measure that is commonly used. Plot both the Gini index and the binary entropy function for q(V ) from 0 to 1. Comment on their similarities.
(b) Showthatifi(q(V))isconcaveinq(V)thenI(V1,V2,V) 08V1[V2 =V,V1\ V2 = ;. This means that every split does not lose any information. Recall that a function is concave then i( a1 + (1 )a2) i(a1) + (1 )i(a2), 8a1, a2, 8 2 [0, 1].
(c) Show that binary entropy is concave. Hint: Show that the 2nd derivative is always non-positive.
(d) Show that the binary Gini index is concave.
5
5. Consider the following plots:
(1) Decision Tree Example 1 (2) Decision Tree Example 2
(3) Decision Tree Example 3 (4) Decision Tree Example 4
Assume that you are using a binary split decision tree to classify the points. At each node, you may ask questions like: “Is f1 3?” or “Is f2 1.8?”. Here, f1 and f2 are the variables on the horizontal axis and vertical axis of the examples, respectively.
(a) Which examples can be fully separated using a depth 2 decision tree?
(b) Which example would have the most complex decision tree in terms of the number of splits? Explain why.
(c) If you used a depth 4 decision tree, is one more example now separable? If so, show how you can separate it using a depth 4 decision tree.
6
6. In this problem, you will implement the logistic regression algorithm and test them on our running datasets UCLA EE grad 2030.csv and UCLA EE grad 2031.csv. You can find detailed description of the datasets in HW1. Note that the labels are 1/ + 1 in the dataset instead of 0/1 so a conversion may be needed.
In the logistic regression algorithm, we wish to find a vector w = [w1, w2, b]T 2 R3 such that the following logistic loss is minimized:
J(w) = XN ⇥y(n) log hw(x(n)) + (1 y(n)) log(1 hw(x(n)))⇤ , (4) n=1
where x(n) 2 R3, y(n) 2 {0, 1}, hw(x(n)) = (wT x(n)) and the sigmoid function (t) = 1 . Note that the two dimensional feature vector need to be augmented with 1
1+exp( t)
to include a bias term.
(a) For each dataset, starting with the weights being all zeros, train the logistic regression model with the gradient descent algorithm using learning rate 0.01 for 10000 iterations. Do not normalize the gradient by its norm. For iteration 5, 100, 500, 1000, 5000 and 10000, plot the data and the learned hyperplane. For these iterations, also report the weights, the loss J(w), and the classification accuracy of this hyperplane.
(b) Comment on the convergence of this algorithm on the linearly non-separable dataset. Does the algorithm converge and why?
(c) Comment on the convergence of this algorithm on the linearly separable dataset. Does the algorithm converge? If not, does the separating hyperplane change a lot after iteration 1000? Does the loss change a lot after iteration 1000? Given a brief explanation on your observation.
7
7. In this section, you will consider a k-nearest neighbor classifier using Euclidean distance metric on a binary classification task. We assign the class of the test point to be the class of the majority of the k nearest neighbors.
Consider the following two datasets:
Table 1: KNN Example 1
Table 2: KNN Example 2
(a) For k 2 {1, 3, 5, 7}, what values of k minimize leave-one-out cross-validation error for each dataset? What is the resulting validation error?
(b) In general, what are the drawbacks of using too big a k? What about too small a k? To see this, calculate the leave-one-out cross-validation error for the dataset in Figure 1 using k = 1 and k = 13.
8