留学生辅导 EECS 445 — Introduction to Machine Learning Winter 2022

UNIVERSITY OF MICHIGAN
Department of Electrical Engineering and Computer Science EECS 445 — Introduction to Machine Learning Winter 2022
Homework 1 (50 pts)
Due: Wednesday, January 26th at 10:00pm

Copyright By PowCoder代写 加微信 powcoder

1 Early Draft Upload (1 pt)
To ensure that you have gradescope access and are familiar with the submission interface, please upload a (possibly incomplete) draft of your solution to Gradescope by Tuesday, January 25th at 10:00pm ( local time).
2 Vectors & Planes Review (7 pts)
Classifiers like perceptron and SVM that we will study learn a decision boundary in order to classify points. In this problem, we will work through some of the underlying mathematics.
Notation: ||x ̄|| is used to denote the l2-norm, or magnitude, of x ̄.
(a) (2 pts) In machine learning, we often represent data as vectors. The cosine of the angle between vectors can be a useful measure of similarity. Give a formula for computing the angle between two d-dimensional vectors x ̄(1) and x ̄(2). Suppose x ̄(1) = [1, √3]T and x ̄(2) = [−1.5, 0]T , what is the magnitude of each vector and what is the angle between them?
Submission: Please upload your completed assignment to Gradescope. Your submission can be either handwritten or typed. Assign all pages (including pages containing code) to their respective questions. Incorrectly assigned pages will not be graded.
• α = cos−1 􏰀 x ̄(1)·x ̄(2) 􏰁 is the angle between x ̄(1) and x ̄(2), where ∥x ̄∥ = 􏰆x2 + … + x2 is
• ∥x ̄(1)∥ = 􏰆(1)2 + (√3)2 = 2
• ∥x ̄(2)∥ = 􏰅(−1.5)2 + (0)2 = 1.5
• x ̄(1) · x ̄(2) = −1.5
• α = cos−1(−1.5) = cos−1(−1) = 120 degrees (or 2π radians). 323
∥x ̄(1) ∥∥x ̄(2) ∥ the l2 vector norm of x ̄

(b) (1 pts) Fix a d-dimensional vector θ ̄ = [θ1, …, θd]T . Let S be the hyperplane in the space Rd that includes all points x ̄ such that θ1×1 + θ2×2 + … + θdxd = 0. Find all of the d-dimensional vectors that are orthogonal to S and have a length of C. Justify your answer.
Solution: For every x ̄ ∈ S, by definition of S, x ̄ · θ ̄ = 0, so x ̄ and θ ̄ are orthogonal. Thus, θ ̄=[θ1,…,θd]T isorthogonaltothehyperplaneS,henceanyvectororthogonaltothisplanemust be a linear scaling of θ ̄. We can find its length with the L2 norm – it has length ||θ ̄||. Thus, we can scale θ ̄ linearly to have the desired length by setting v ̄ = θ ̄∗C .
There is one other possible vector with these properties: −v ̄, which has the same length, but is
pointing in the opposite direction, and thus is also orthogonal to the hyperplane S.
(c) (4 pts) In this problem we will formulate how the distance of a point to a hyperplane is measured, a useful tool in constructing decision boundaries. Consider an arbitrary d-dimensional point x ̄ ∈ Rd and a hyperplane through the origin described by θ ̄ = [θ1, . . . , θd]T . The signed distance of x ̄ from the plane is the perpendicular distance between x ̄ and the hyperplane (distance between x ̄ and its projection on the hyperplane), multiplied by +1 if x ̄ lies on the same side of the plane as the vector θ ̄ points and by −1 if x ̄ lies on the opposite side.
(i) (2 pts) Derive the expression for the signed distance of x ̄ from the hyperplane.
Solution: Suppose x ̄0 is an arbitrary point on the plane and v ̄ = x ̄−x ̄0. Note that v ̄ represents a vector from an arbitrary point x0 on the hyperplane to x ̄. To find the signed distance of x ̄ from the hyperplane, we need to first determine the projection of v ̄ onto θ ̄.
􏰂 θ ̄ · v ̄ 􏰃 θ ̄ projθ ̄(v ̄) = ∥θ ̄∥ ∥θ ̄∥
Note that the signed distance from point x ̄ to the hyperplane is the signed length of the pro- jection and θ ̄ is a unit vector, so the signed distance is as follows.
d = θ ̄ · v ̄ = θ ̄ · ( x ̄ − x ̄ 0 ) = θ ̄ · x ̄ ||θ ̄|| ||θ ̄|| ||θ ̄||
since θ ̄ · x ̄0 = 0 (x ̄0 is a point on the hyperplane).
We could alternatively view this problem geometrically. Consider the following graph.

We see that x ̄ = x ̄ + d θ ̄ and θ ̄⊥x ̄ (θ ̄ is orthogonal to x ̄ ), so the following must hold. 0∥θ ̄∥0 0
̄ ̄􏰂 θ ̄􏰃 θ·x ̄=θ· x ̄0+d∥θ ̄∥
= ∥ θ ̄ ∥ d Rearranging this result gives us the same result:
d = θ ̄ · x ̄ ∥θ∥
(ii) (1pts)Nowconsiderahyperplanewithoffset(i.e.θ ̄·x ̄+b=0).Howdoesthesolutionfrompart (i) change?
Solution: If offset is allowed, the hyperplane can be represented as θ ̄ · x ̄0 + b = 0 where b is

the offset and x ̄0 is any point on the plane. Again, following our logic in part (i)
d = θ ̄ · v ̄ ||θ ̄||
= θ ̄ · ( x ̄ − x ̄ 0 ) ||θ ̄||
= θ ̄ · x ̄ − θ ̄ · x ̄ 0 ||θ ̄||
= θ ̄ · x ̄ + b ||θ ̄||
The only difference is that now θ ̄ · x ̄0 is now b instead of 0.
(iii) (1 pts) Let p be the plane consisting of the set of points for which 5×1 − 2×2 + 3 = 0. i. What is the signed perpendicular distance of point x ̄ = [1, 1]T from p?
ii. What is the signed perpendicular distance of the origin from p?
Solution: Using the formulation from part (ii), θ ̄ = [5, −2]T , b = 3 • For[5,−2]T,
θ ̄·[1,1]T +3 ||θ ̄||
(5·1)+(−2·1)+3 = 􏰅52 + (−2)2
θ ̄ · ̄0 + 3 3 ||θ ̄|| = √29
• For the origin,
3 Linear Binary Classifiers (5 pts)
Note: Throughout this course, we will use the convention that points that lie on the decision boundary are
misclassified.
In this problem we will build our intuition for constructing optimal decision boundaries.
Consider a natural language processing (NLP) task where we are trying to determine whether an Amazon book review is positive or not. We have checked each review to see if they contain the words “definitely”, “great” and “best” which we encoded into a feature vector in the following way: x ̄ = [x1, x2, x3]T , where xi is 1 if the word exists in a review and 0 otherwise in the order “definitely”, “great” and “best”. This is a form of text representation known as a bag of words model. Our training dataset consists of the following reviews and rating (y(i) represents the the label of Review. 1 for positive review and -1 for negative review.):

Review y(i) “A great thriller that kept me reading late into the night.” 1 “Didn’t like this book, I understood it but the story line was a let down.” -1 “Best book ever!! I will definitely read more by this author.” 1
“This author usually writes great books, but this one definitely disappointed.” -1
Table 1: Dataset for 4(a)
For example, the review “A great thriller that kept me reading late into the night.” would be encoded in the bag of words model as x ̄ = [0, 1, 0]T
We aim to find parameters such that for all x ̄(i) in our training set
sign(θ ̄ · x ̄(i) + b) = y(i)
(a) (1 pt) What is a mathematical representation of the feature space (i.e. all possible feature vectors)?
(b) (1 pt) What is a mathematical representation of the label space (i.e all possible output classes) for this classification problem?
(c) (1 pt) If b = 0 (i.e., no offset), would it be possible to learn such a θ ̄? Explain.
(d) (2 pts) How about if a non-zero offset is allowed? Provide an example of such θ ̄ and b, or explain why it’s not possible.
Solution: In this case we note that each feature has mapping {0, 1}. There are a total of 3 features, so the feature space is {0, 1}3
Solution: The label space is {−1, 1} indicating whether the review was positive or negative.
Solution: No, because the second review is represented by the vector x(2) = [0, 0, 0]T which al- ways lies on the decision boundary. Therefore, it is misclassified by any decision boundary through the origin (without offset), according to the above rule.
Solution: Yes. If we plot the dataset, x ̄(1) = [0, 1, 0]T , x ̄(2) = [0, 0, 0]T , x ̄(3) = [1, 0, 1]T , x ̄(4) = [1, 1, 0]T in R3 we see that it is linearly separable by a plane. One hyperplane that correctly classifies this data is defined by θ ̄ = [−1, 1, 2]T , b = − 12

4 numpy Exercises (9 pts)
Note: Throughout this course we will heavily utilize numpy in coding problems on homeworks and projects. These questions are meant to help familiarize you with some useful features of numpy. You will apply numpy to a few isolated problems. That way you are not learning features (and resolving bugs) in the middle of more complex problems. Before you can do this problem, you must install numpy and jupyter notebook (see the references section at the end of this homework).
numpy is an extremely efficient library where many underlying operations are often written in C++. Thus it is often much faster (upwards of 100x in many cases) to perform operations on numpy matrices instead of individually on elements using python for loops. For all of these problems do not use loops or list comprehension of any kind. All inputs and outputs of the functions you will implement in this problem should be numpy matrices.
We’ve provided a starter jupyter notebook file (starter code/q4 starter.ipynb) on Canvas to help you work through the following problems. Please make sure to attach a PDF copy of the completed notebook to your write-up and include it in your page assignments.
Notation: ||x ̄|| is used to denote the l2-norm of x ̄.
(a) (1 pts) One use case of numpy is computing the norm of a set of vectors. In this question, given an N × M matrix A compute numpy array B of size N such that entry B[i] is the l1-norm of row i in matrix A. (maximum lines of code 1 – note that line limits do not include the function definition lines; no partial credit will awarded if more lines are used).
(b) (2 pts) Another useful feature in numpy is broadcasting. Sometimes given a pair of vectors (or more) we wish to reconstruct a matrix from them. In this case given a numpy array A of size N and a numpy array B of size M, construct and return N × M matrix C where C[i,j] = A[i] + B[j]. (maximum lines of code 1 – note that line limits do not include the function definition lines; no partial credit will awarded if more lines are used).
(c) (3 pts) Another potential application is assigning points to groups. In this question we will consider a set of cell towers and a set of home addresses. The goal is to find the the closest cell tower for each home address. You are given the set of cell towers in matrix A. A is a M × D matrix which denotes the locations of M towers in a D dimensional space. You are also given a N × D matrix B which is the set of locations of the home addresses. You must return numpy array C of size N where C[i] is the cell tower that home i (ith row of matrix B) should be assigned to.
For example, if the 9th home address B[9,:] is closest to the 6th cell tower A[6,:] then C[9] = 6. (maximum lines of code 4 – note that line limits do not include the function definition lines; no partial credit will awarded if more lines are used).
||x ̄ − y ̄||2 = ||x ̄||2 + ||y ̄||2 − 2 ∗ x ̄T y ̄
We’ve provided a buggy solution for this problem in the starter notebook. Please find and fix all the
bugs in the solution for full credit.
(d) (3 pts) Given M D-dimensional vectors in matrix A and vector B, find the vector in A with the smallest
cosα (where α is the angle between the vector in A and B) and return as a tuple (the index of this 6

vector, cos α between them). In this case, matrix A is an M × D matrix of points in D dimensional
space and vector B is a D × 1 vector. For example if it is the vector in row 6, return (6, cos(angle)).You
may assume the angle for all vectors is between ( −π , π ) radians. (maximum lines of code 5 – note that 22
line limits do not include the function definition lines; no partial credit will awarded if more lines are used).
HINT: Note that
a ̄ · ̄b = ||a ̄|||| ̄b||cos(α) Perceptron Algorithm Without Offset (9 pts)
Consider a sequence of 2-dimensional data points, x ̄(1), x ̄(2), . . . , x ̄(n) and their corresponding labels,
y(1), y(2), . . . , y(n). Recall the perceptron algorithm updates the parameters whenever y(i) ̸= h(x ̄(i); θ ̄) where h(x ̄(i); θ ̄) = sign(θ ̄ · x ̄(i)), and points are randomly shuffled at the beginning of each iteration. Assume that the points are linearly separable, and that θ ̄ is initialized to zero. Let mi denote the number of times x ̄(i) is misclassified during training.
(a) (4 pts) Derive the final decision boundary for the perceptron in terms of mi, x ̄(i) and y(i). Show your work.
Solution: The perceptron algorithm initializes the parameters θ ̄ to zero. Given that we are per- forming a binary classification, y(i) ∈ {−1, 1} determines the direction of the update uniquely. The perceptron algorithm uses the data point value as the magnitude of the parameter update. By going over the algorithm, we can see that the updates for θ are a cumulative sum of the y(i)x ̄(i) product for the points that are misclassified. Therefore, the final value of θ ̄ is a summation of the product of the number of misses, the direction of each miss, and the value of each data point that was missed. This is represented by the equations
θ ̄ = 􏰄 m i y ( i ) x ̄ ( i )
(b) (3 pts) Table 2 shows a dataset and the number of times each point is misclassified until convergence when one applies the perceptron algorithm (without offset).
Assuming θ ̄ is initialized to zero, what is the value of θ ̄ post training?
θ ̄ = 4(−1)[−1, 1] + 4(1)[3, −2] + 5(1)[−2, 3] + 2(1)[1, −1] + 1(−1)[1, −4] + 0(1)[3, 3] = [7, 5]

x ̄(i) [−1, 1]
[3, −2] [−2, 3] [1, −1] [1, −4]
1 2 3 4 5 6
(c) (1 pt) Given a set of linearly separable points, does the order in which the points are presented to the algorithms affect whether or not the algorithm will converge?
(d) (1 pt) In general, if the algorithm converges, could the order affect the uniqueness of the solution that the algorithm converges to?
6 Perceptron Convergence Proof [12 pts]
In lecture, we stated that given linearly separable data, the Perceptron algorithm would always converge in a finite number of steps. We will now prove this statement!
Suppose the training dataset is Sn = {(x ̄(i), y(i))}ni=1 with x ̄(i) ∈ Rd and y(i) ∈ {−1, 1}. Let θ ̄(k) be the normal vector after the kth update.
Assume that ||x ̄(i)|| > 0, ∀i = 1, …, n and that θ ̄ is initialized as θ ̄(0) = ̄0
Recall that the Perceptron update step is
θ ̄(k+1) = θ ̄(k) + y(i)x ̄(i)
Let us assume that there exists a separating hyperplane parameterized by θ ̄∗ ∈ Rd, i.e. y(i)θ ̄∗x(i) > 0, ∀i.
In addition, we can define a margin γ that is:
γ=min y(i)θ ̄∗·x ̄(i)
y times misclassified −1 4
1 4 1 5 1 2
Table 2: Dataset and misclassification counts for 5(b)
Solution: The convergence of the perceptron algorithm is determined solely by whether the data is linearly separable.
Solution: Theorderofthedatapointshasnoeffectonconvergence,butmayaffectthetotalnumber of mistakes made by the algorithm, thereby potentially changing the boundary that the algorithm outputs.

First, we are going to prove some useful facts for our approach.
(a) (1 pt) Given a separating hyperplane parameterized by θ ̄∗′ ∈ Rd, ||θ ̄∗′ || ≠ 0, prove that we can parame-
terize this same hyperplane by θ ̄∗ s.t. ||θ ̄∗|| = 1.
Solution: Given θ ̄∗′ , our hyperplane S is the set of x ̄ ∈ Rd such that θ ̄∗′ · x ̄ = 0.
̄∗ We can normalize θ
into θ = ||θ ̄∗′ || . We can prove that this does not change the hyperplane as
x ̄∈S ⇐⇒ θ ̄∗′ ·x ̄=0
⇐⇒ ||θ ̄∗′||(θ ̄∗·x ̄)=0
⇐⇒ θ ̄∗·x ̄=0 (since||θ ̄∗′||≠ 0)
Thus, ||θ ̄∗|| = 1, and θ ̄∗ captures the same decision boundary as θ ̄∗′ , as desired.
(b) (2 pts) Now, prove that we can rescale all of our data points so that ||x ̄(i)|| ≤ 1, ∀i = 1, …, n without changing the classification of any of the points with respect to the hyperplane defined by θ ̄∗.
Solution: Let us define
M = max||x ̄(i)|| i
Now we can divide all of our datapoints by this value and ensure that the maximum magnitude is now 1. Let us show that with these new x ̄(i)′ = x ̄(i) , that the classification of x ̄(i)′ is equivalent to
x ̄ ( i ) .
̄∗ x ̄(i) sign(θ·x ̄ )=sign(θ·M)
1 ∗ (i) =sign(M(θ ̄ ·x ̄ ))
=sign(θ ̄∗ · x ̄(i)) M does not change the sign in the last step because M > 0.
(c) (4 pts) Let us start with proving a lower bound on the magnitude of θ ̄(k).
(i) Prove that θ ̄(k+1) · θ ̄∗ ≥ θ ̄(k) · θ ̄∗ + γ. (HINT: The Perceptron update step might be helpful here.)

Solution: First we use the definition of the Perceptron update to rewrite θ ̄(k+1) · θ ̄∗. Let m be the index of the (k + 1)th misclassified point.
θ ̄(k+1) · θ ̄∗ = (θ ̄(k) + y(m)x ̄(m)) · θ ̄∗ = θ ̄ ( k ) · θ ̄ ∗ + y ( m ) x ̄ ( m ) · θ ̄ ∗
≥ θ ̄ ( k ) · θ ̄ ∗ + γ
(ii) Prove that ||θ ̄(k+1)|| ≥ (k + 1)γ. (HINT: First prove that θ ̄(k+1) · θ ̄∗ ≥ (k + 1)γ, then write ||θ ̄(k+1)|| = ||θ ̄(k+1)|| × ||θ ̄∗||.)
Solution: We will first show that θ ̄(k+1) · θ ̄∗ ≥ (k + 1)γ. By what we have proven in ci, θ ̄(k+1)·θ ̄∗ ≥θ ̄(k)·θ ̄∗+γ≥θ ̄(k−1)·θ ̄∗+2γ≥…≥θ ̄(k−k)·θ ̄∗+(k+1)γ
= θ ̄(0) · θ ̄∗ + (k + 1)γ Now, since ||θ ̄∗|| = 1, we have the following chain of reasoning:
||θ ̄(k+1)|| = ||θ ̄(k+1)|| × ||θ ̄∗|| = θ ̄(k+1) · θ ̄∗
cos α ≥ θ ̄(k+1) · θ ̄∗
≥ (k + 1)γ
Where α is the angle between those two vectors, and the inequality follows because cos α ≤ 1 always.
θ ̄(k+1) · θ ̄∗ ≥ (k + 1)γ
(d) (4 pts) Now we will prove an upper bound on the magnitude of θ ̄(k).
(i) Prove that ||θ ̄(k+1)||2 ≤ ||θ ̄(k)||2 + 1. (HINT: remember that ||x ̄||2 = x ̄ · x ̄.)
||θ ̄(k+1)||2 = θ ̄(k+1) · θ ̄(k+1)
= (θ ̄(k) + y(m)x ̄(m)) · (θ ̄(k) + y(m)x ̄(m))
= (θ ̄(k) · θ ̄(k)) + (y(m)x ̄(m) · y(m)x ̄(m)) + (2(θ ̄(k) · y(m)x ̄(m))) = ||θ ̄(k)||2 + y(m)2 ||x ̄(m)||2 + (2(θ ̄(k) · y(m)x ̄(m)))
We know y(m)2 = 1 since y(m) ∈ {−1, 1} and ||x ̄(m)||2 ≤ 1 by the assumption proved in b

so we have:
||θ ̄(k+1)||2 ≤ ||θ ̄(k)||2 + 1 + (2(θ ̄(k) · y(m)x ̄(m)))
Finally, we know that since x ̄(m) is misclassified, θ ̄(k) · y(m)x ̄(m) ≤ 0 by definition so we get:
||θ ̄(k+1)||2 ≤ ||θ ̄(k)||2 + 1
(ii) Now prove that ||θ ̄(k+1)||2 ≤ (k + 1).
(e) (1 pt) Relate inequalities in (c)(ii) and (d)(ii) to prove an upper bound on k solely in terms of γ.
Solution: By what was proven in di,
||θ ̄(k+1)||2 ≤ ||θ ̄(k)||2 + 1 ≤ ||θ ̄(k−1)||2 + 2 ≤ … ≤ ||θ ̄(k−k)||2 + (k + 1) = ||θ ̄(0)||2 + (k + 1) ||θ ̄(k+1)||2 ≤ (k + 1)
||θ ̄(k+1)|| ≥ (k + 1)γ, ||θ ̄(k+1)||2 ≤ (k + 1) ||θ ̄(k+1)||2 ≥ (k + 1)2γ2, ||θ ̄(k+1)||2 ≤ (k + 1) (k + 1)2γ2 ≤ ||θ ̄(k+1)||2 ≤ (k + 1)
(k + 1)γ2 ≤ 1
7 SGD with Logistic Loss (7 pts)
The logistic loss function is defined as
Losslogistic(z) = ln(1 + e−z) Given this loss function and a learning rate η,
(a) (3 pts) Find the expression of SGD update for θ ̄(k+1) using θ ̄(k), x ̄(i), y ̄(i) and η. Show your work.

Solution: For fixed (x(i), y(i)), update:
θ ̄(k+1) = θ ̄(k) − η∇θ ̄Losslog(y(i)(θ ̄· x ̄(i)))
Therefore,
∇θ ̄Losslog(y(i)(θ ̄· x(i))) = ∇θ ̄ ln(1 + e−y(i)(θ ̄·x ̄(i)))
= e − y ( i ) ( θ ̄ · x ̄ ( i ) )
1 + e − y ( i ) ( θ ̄ · x ̄ ( i ) )
= − y ( i ) x ̄ ( i )
1 + e y ( i ) ( θ ̄ · x ̄ ( i ) )
θ ̄(k+1) = θ ̄(k) + η y(i)x ̄(i)
1 + e y ( i ) ( θ ̄ · x ̄ ( i ) )
􏰀 − y ( i ) x ̄ ( i ) 􏰁
(b) (4 pts) Find the expression of GD update for θ ̄(k+1) using θ ̄(k), x ̄(i), y ̄(i) and η. Show your work.
Solution: The empirical risk is:
J(θ ̄) = N
∂ J ( θ ̄ ) = 1 􏰄N 1
· e − y ( i ) ( θ ̄ · x ̄ ( i ) ) · − y ( i ) x ( i ) j
􏰀−y(i)x(i)􏰁 j
Losslog(y(i)(θ ̄· x ̄(i))) = N1 􏰄ln(1+e−y(i)(θ ̄·x ̄(i)))
N i=1 1+e−y(i)(θ ̄·x ̄(i)) = 1 􏰄N e−y(i)(θ ̄·x ̄(i))
N i=1 1+e−y(i)(θ ̄·x ̄(i)) 1 N −y(i)x(i)
N i=1 1+ey(i)(θ ̄·x ̄(i))
So we can obtain ∇θ ̄J(θ ̄) as:
Therefore,
∇ ̄ J ( θ ̄ ) = 1 􏰄N
− y ( i ) x ̄ ( i )
θ N i=1 1+ey(i)(θ ̄·x ̄(i))
θ ̄(k+1) = θ ̄(k) + η 􏰄N y(i)x ̄(i)
N i=1 1+ey(i)(θ ̄·x ̄(i))

References
We are using Python 3.9 for this course (you can find references for the standard library here: https://docs. python.org/3/library/). To make your life easier, we require you to use virtual environments. You can create a virtual environment in your project directory with python3 -m venv env, activate with source env/bin/activate, then install required packages with pip.
This homework uses the following pac

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com