程序代写代做代考 go Excel C algorithm flex EECS 189 Introduction to Machine Learning Fall 2020

EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Tuesday, September 8 at 11:59 p.m. 2 Study Group Icebreaker
HW1
Once you have received your group assignment, meet at least once with everyone in the group to introduce yourselves and discuss expectations for the semester. As an icebreaker, everyone should talk about what they want to get out of this machine learning class and why they decided to take it. Your written answer for this problem should be a summary of the group’s answers. This meeting is an excellent opportunity for your group to set up times to work on the homeworks together and discuss how the members prefer to study for this course.
Solution: Your solution should summarize what your group members hope to get out of this course.
3 Exam Policy and Practice
Please read through the entirety of the EECS189 exam proctoring policy (click here) carefully before proceeding. This question is designed to familiarize you with some of the things you will have to do during the exam.
(a) After reading through the Exam Policy carefully, please answer the following questions.
(i) Given you experience no disruptions during the exam, how many total minutes do you have for scanning and submission? Does it matter if you are using a tablet or whether you are using paper to write your answers?
(ii) Are you required to record locally during the exam? How much space should you have available on your computer for a local recording?
(iii) How should you contact the course staff for an emergency situation during the exam?
Solution:
(i) You have a total of 20 minutes for scanning and submission if you experience no disrup- tion and are using paper. 10 minutes if you are using a tablet. If you experience x minutes of disruption during the exam, you may work for min(x, 15) minutes past the end of the exam.
(ii) You are not required to record locally; you may do a Zoom cloud recording. You should have 5 GB available on your computer if you are doing a local recording.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1

(iii) You should contact the course staff by making a private post on Piazza. However, you should not be looking at Piazza during the exam other than to make a private post in the case of an emergency.
(b) Please configure your Zoom link.
(i) Fill out the following Google Form (click here) to submit the Zoom link you will be using. You must use this Zoom link for this assignment as well as for the exams. This can easily be done by submitting your Personal Meeting Room link and setting your Personal Meeting ID as your default on all devices you will be using for the final.
(ii) Ensure that anyone can join your Zoom link and that there is no waiting room for your Zoom meeting. You can try to do this by entering the meeting on one device that is logged in to your official Berkeley Zoom account and then entering the meeting on another de- vice that is logged into some other Zoom account. If you are able to do that, then your Zoom link is joinable. If you are not put into a waiting room, then your Zoom meeting will not have a waiting room. (You might want to have a member of your study group try this out with you if you don’t already have two Zoom accounts.)
Solution: Ensure your Zoom link is joinable and that the Zoom link in the form is the correct link which you will be using for the exam.
(c) You will now conduct a Zoom recording. You should use this recording to work through a homework problem or other study material to simulate the actual circumstances of the final exam.
(i) Start the Zoom call for the link you provided above. Turn on your microphone and recording device (webcam, phone camera). Turn off your speaker. Share your entire desktop (not just a particular window).
(ii) Start recording via Zoom. You may record locally or on the cloud (see the exam policy document for details).
(iii) HoldyourCalIDnexttoyourfaceandrecordyourselfsayingyournameintothewebcam. Both your face and your entire CalID should be visible in the video. We should be able to read your name and SID. This step should take at least 3 seconds. See figure 2. If you do not have a CalID for some reason, please hold up some document which has an image of you and proves your identity, such as a driver’s license.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2

Figure 1: ID card demonstration. Do not actually black out your SID and name.
(iv) Turn your recording device (webcam, phone) around 360◦ slowly so that we can see your entire room clearly. There should be no uncovered screens anywhere in the room during your exam. Only admin TAs and instructors will be able to see your videos (both for this assignment and for the actual exams).
(v) Position your recording device in such a way that we can see your workspace and your hands. We don’t care if we can see your face or not. If you are not using your phone to record your workspace, then it should be visible in the recording, face down. See figure 3. Think during this test run. On the actual exam, you will want to use the desktop to see the exam itself. If you are using your laptop’s built-in camera, the angle might be bad. In this case, think about how you might position the laptop or consider using your phone or an external webcam on a stand to record your workspace. We want you to iron out these details ahead of the actual exam so that the exam itself has no additional stress due to proctoring logistics.
Figure 2: Demonstration of taking your exam. Your setup should look like this while you are taking the exam.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3

(vi) Your microphone should be on at all times. We should be able to see the time on your desktop at all times.
(vii) Record for a full two and a half hours. You should use this time to work through a homework problem or other study material for the course. The more realistic it is to actually taking an exam, the better practice it will be for you. (We also want to make sure that your computer can handle the video encoding task if your are doing a local recording.)
(viii) After two and a half hours, stop the recording. Check your recording to confirm that it contains your video as well as your desktop throughout its duration. Upload your video to Google drive and submit the link to the video using this Google Form (click here). You must make sure that the link sharing permissions are set so that we may view the video.
Solution: Ensure that you were able to submit your video to the Google form.
(d) A Midterm Google document should be shared with you with instructions for the midterm and the time which you will start taking the exam a few days before the midterm. More details will be made available closer to the exam date.
Link for policy:
https://docs.google.com/document/d/14qO3xRt724j8Fs12i9d4E-pMWuyqk4h8b5gBbnd3hVo/
edit
Form to submit Zoom link:
https://forms.gle/P7CDYVpaqLBV4zez5
Form to submit video link:
https://forms.gle/F5V1J89NKf21Rj7Y6
4 Linear Regression and Adversarial Noise
In this question, we will investigate how the presence of noise in the data can adversely affect the model that we learn from it.
Suppose we obtain a training dataset consisting of n points (xi,yi) where n ≥ 2. In case of no noise in the system, these set of points lie on a line given by y = w[1]x + w[0], i.e, for each i, yi = w[1]xi + w[0]. The variable x is commonly referred to as the covariate 1 and y’s are referred to as the observation. Suppose that all xi’s are distinct and non-zero. Our task is to estimate the slope w[1] and the y-intercept w[0] from the training data. We call the pair (w[1], w[0]) as the true model.
Suppose that an adversary modifies our data by corrupting the observations and we now have the training data (xi, y ̃i) where y ̃i = yi + εi and the noise εi is chosen by the adversary. Note that the adversary has access to the features xi but can not modify them. Its goal is to trick us into learning a wrong model (wˆ [1], wˆ [0]) from the dataset {(xi, y ̃i), i = 1, . . . , n}. We denote by (wˆ [1], wˆ [0]) the
1Besides covariate, some other names for x include feature, regressor, predictor.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4

model that we learn from this dataset {(xi, y ̃i), i = 1, . . . , n} using the standard ordinary least-squares regression.
(a) Suppose that the adversary wants us to learn a particular wrong model (w[1]∗,w[0]∗). If we use standard ordinary least-squares regression, can the adversary always (for any choice of w[1]∗ and w[0]∗) fool us by setting a particular value for exactly one εi (and leaving other observations as it is, i.e., y ̃j = yj, j 􏰁 i), so that we obtain wˆ[1] = w[1]∗ and wˆ[0] = w[0]∗? If yes, justify by providing a mathematical mechanism for the adversary to set the value of the noise term as a function of the dataset {(xi, yi), i = 1, . . . , n} and (w[1]∗, w[0]∗)? If no, provide a counterexample.
Solution: The answer is no. Intuitively, the adversary is only in control of one degree of freedom. We now present a counterexample with just two data points.
Remark: Note that all xi’s are non-zero and distinct and any counterexample needs to be compatible with these two pieces of information. The reason behind assuming this was to avoid two corner cases: (1) vertical lines (infinite slope), (2) rank deficiency of the covariate matrix (discussed in Food for thought below).
Let the two pairs of observations be (x1, y1) = (1, 0), (x2, y2) = (−1, 1). Let X be the feature
matrix:
y1  0
and let y denote the observation vector y = y  = 1. The corrupted observation vector can
2

0+ε1  0 
take one of the forms y ̃ =  1  or y ̃ = 1 + ε . Standard OLS in this case simplifies to
2
finding the solution to the equation Xw = y ̃ and hence we have
 x1 1  1 1
X = x2 1 = −1 1 . 
11 −1 1y ̃ −y ̃ −1   1 2
Xw=y ̃⇒wˆ=X y ̃=  y ̃⇒wˆ=  . 21 1 2y ̃1+y ̃2
Note that we can take inverse because X is a square matrix. For a more general case, refer to part (b). Thus the OLS solutions are of the form

1 ε1 − 1 −1 − ε2 wˆ= orwˆ=  2 ε1 + 1  1 + ε2 
and it is immediately clear that this adversary CANNOT fool us in choosing wˆ = 0 or say 
2
wˆ = 2. In fact, there are infinitely many w∗ that the adversary cannot trick us into learning.
The reason being that, in both cases, if the adversary uses a certain value of noise to trick us to hallucinate a particular w[1]∗, it can no longer choose w[0]∗ of its choice, as the OLS solution determines it automatically.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
 0

(b) Repeat part (a) for the case when the adversary can corrupt two observations, i.e., for the case when the adversary can set up at most two of the εi’s to any non-zero values of its choice.
Solution: Yes. Intuitively, the adversary now has control over two degrees of freedom.
The adversary needs control of two points that have different x coordinates. Consider n points
{(xi,yi)}ni=1,wherelinearregressionrecoverswˆ1,wˆ2. Wecanstartbyexaminingtheclosedform
solution for least-squares. Without loss of generality, let us assume that the first two points are
the ones that the adversary is going to want to corrupt. Let us construct the standard feature

x1 1
 
 
x 1
2 2 
matrix X =
obvious that since the first two rows of X are linearly independent since all xi’s are distinct,
that X has full rank 2. The solution of linear least-squares is:  
rest
Note that the adversary chooses (ε1,ε2) and wants least squares to yield exactly the line the adversary desires. Note that that the adversary needs to solve the following equation:

  ε1  ∗
w[0]   1 1 ε2 
 ..  so that we are minimizing minw ∥Xw − y∥ . At this point, it is immediately 
.. 
.. 
  xn 1
wˆ1    T −1 T T −1 T T −1 T    
wˆ = =(X X) X y ̃=(X X) X (y+ε)=(X X) X (y +ε)    2   2 y  0 wˆ2    
w[1]  T
−1 T T
−1 T   T −1 x1
  x2ε1
 −(X X) X y=(X X) X ε =(X X)  0
 . ∗ 2 
x1 x2
Butweknowthatsincex 􏰁x,thematrix isinvertibleandsothishasthesolution:
1211  ∗
ε1 x1 x2 −1 T w[1]  T −1 T
ε  = ( 1 1 ) (X X)(w[0]∗ − (X X) X y).
2
This explicit formula lets the adversary move the least-square solution to wherever it wants assuming it has control over the two “outliers” in the first two positions.
(a) Adversary solves for εi in terms of w[1]∗, w[0]∗.
(b) Adversary applies εi and gives data to you.
(c) You run least squares using the εi and retrieve wˆ 1 , wˆ 2 , which we have already determined to be equal to w[1]∗, w[0]∗.
Food for thought: In this problem, we were asked to consider the case where all xi’s were distinct. If there are not two distinct points with different x coordinates, this means all the training points have the same x coordinate. At this point, it is clear that ordinary least squares will simply fail since it will not have enough rank in X itself. The adversary can’t remedy this failure since, as stated, it can only perturb the y measurements, not the x ones.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
  y1  ε1

(c) In the context of machine learning and applications, what lessons do you take-away after working through this problem?
5
Solution: As you can see, the adversary can fool us into “learning” an arbitrary model by (substantially) tweaking a few data points. Machine learning is learning from data and thus, our accuracy depends on the noise in the data and the mechanism used to learn the model. As this problem illustrates, when subject to unbounded noise or outliers, ordinary linear least- squares regression can yield a line very far from the true model. Here we saw that if an adversary corrupts a certain number of data points (equal to the degree of freedom in the data generation mechanism), we can be forced to hallucinate an arbitrary model.
In certain applications, it is desirable to learn a model that is not affected too much by a small fraction of the data points. In this course, we will learn about many techniques for making the learning algorithm robust to different types of noise in the data.
In the next problem you are introduced to how Orthogonal Matching Pursuit (OMP) can help hunt for outliers and remove them from the data, producing a more robust version of least- squares. Another example is Reed-Solomon coding which you may have seen in CS 70. These codes are robust to arbitrary amounts of noise in a few data points. The idea generally is to exploit redundancy in the data to make the model more rubust to potentially big noise, instead of only trying to leverage the redundancy to best average away the small noises. (There is no big or small in the CS 70 finite-field perspective, but there is in the reals and for complex numbers.)
Outlier Removal via OMP
[This is another rerun problem from 16B that serves as a way to also force the review of a key approach from 16A, orthogonal matching pursuit. The ideas introduced in this problem are very important, as you will see later in this HW (and later HWs). ]
The problem of “outliers” (bad data points) is ubiquitous in the real world of experimentally col- lected data. This problem is about how we can leverage known techniques (from 16A) to do some- thing about them in a way that doesn’t require a human to manually look at points and subjectively decide which ones are good or bad.
Suppose we have a system where we believe that vector-inputs x lead to scalar outputs in a linear way pT x. However, the parameters p are unknown and must be learned from data. Our data collection process is imperfect and instead of directly seeing pT x, we get observations y = pT x + ε where the ε is some kind of disturbance or noise that nature introduces.
To allow us to learn the parameters, we have n experimentally obtained data points: input-output
pairs (xi, yi) where the index i = 1, . . . , n. However, because we believe that our observed outputs
might be a bit noisy, we only have an approximate system of equations. In particular, if we group
T
x 
1


T
x 
.
2

  where clearly X is a matrix that has d 
the data points into a matrix and vector as follows: X =
. .
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
   x T 
n

columns and n rows, and y =
equations that we want to solve as Xp ≈ y.

y1 
 
 
y 
 2


 .  is an n-vector. Then we can express the approximate system of 
. .
  yn 
The classic least-squares problem views the goal as minimizing
n
∥y−Xp∥2 =􏰑(yi −xTi p)2 (1)
i=1
over the choice of p and thereby making the residual y − Xp have as small a Euclidean norm as
possible. This is a reasonable thing to do when we believe that the individual residuals yi − xTi p are 
ε1 
all small on average, meaning that we believe that the true disturbance vector ε =
However, nature (or our observation process) isn’t always this well behaved. When a small subset of observations don’t line up well like the other data points, we call these outliers. For these, we don’t always know what is wrong. It could be that there was a lot of disturbance or observation noise. It could also be that there was something wrong with how the experiment that we conducted and the x in reality was very different from what we think it was. Or it could simply be that the physical system was just behaving in a very unusual way that we cannot hope to predict or understand with our model.
Anexaggeratedexampleiswhenwehaveasetofobservationsthatsatisfyperfectlyyi =􏰏dj=1xi[j] with the |yi| < 1 for all i 􏰁 c, but there is one crazy xc such that yc = 􏰏dj=1 xc[j] + 10100. Then, as we can see, if we were to do a standard least-squares solution that attempts to minimize (1), this single crazy observation would be enough to shift our estimated pˆ from the “true” p∗ = [1, . . . , 1]T by a large amount. Why? Because pˆ = (XT X)−1XT y is a linear function of y and hence the crazy huge deviation in the c-th component of y is going to cause a huge multiple of the c-th column of (XT X)−1XT to be added to the estimate for p. The c-th column of (XT X)−1XT is just (XT X)−1xc by our definition of the matrix X above. And so, from the perspective of being an outlier that corrupts our estimate, it really doesn’t much matter whether the observation fault was in the scalar yc or in the vector xc — whichever side of the approximate equation representing the c-th data point is actually messed up, our estimate is going to be pretty badly messed up if we just blindly use least-squares. Consequently, it is evident that the least-squares-estimated pˆ is not what we really always want. Is there a way that allows us to reliably remove outliers when we know only a small proportion of the data points are outliers? In this problem, we will demonstrate one of the simplest outlier removal methods that leverages the orthogonal matching pursuit (OMP) approach you learned in 16A. (a) As we saw in our example, we could improve our solution if we could remove outliers. One HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8     ε   2    .   . . ε  n is small. way to do this is by augmenting our matrices in the following way. Let ei be the i-th standard  0     0 .    . .   i   i basis vector. That is, e = 1 where the solitary 1 is in the i-row of the vector e . Consider the  .  augmented data matrix and parameter vectors: .   . 0 0 􏰔􏰕 X = X e , y =y, p = . (2) new i new new fi Here the variable fi is effectively a “fakeness” variable that we associate with the i-th data point. Apply the standard understanding of least squares to find the solution to min ∥ynew − Xnewpnew∥2. (3) pnew and write out the formula for the least-squares estimate 􏰓p  Solution: By the standard least-squares formula you learned and proved in 16A, we have 􏰓p = ( X ⊤ X ) − 1 X ⊤ y . new new new new new (b) Inthepreviouspart,the(d+1)×(d+1)matrixXT X playedarole.Let’slookatthismatrix new new in “block” form: What are XT ei, Simplify these as much as possible in terms of the individual data points xi, etc.  􏰔􏰕􏰔􏰕T􏰔􏰕T T X  X X X ei X X = X e X e =   X e =   T T new new i i eTi i eTiX eTiei eTi X, eTi ei? Solution: Recall that xTi is the i-th row in X. Then, the i-th column of XT must be xi. Conse- quently, we have Notice that eTi X = (X⊤ei)⊤ and so Finally, it is clear that X⊤ei = xi e ⊤i X = x Ti e ⊤i e i = 1 . HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9 new   . 􏰓 fi  p = 􏰓p   (c) Based on what you know from the previous part, revisit the formula you got in the part before that and prove that the value for yi in the i-th data point only impacts the estimate fi and does not effect the least-squares estimate for 􏰓p at all. (HINT: yi sits in the i-th row of the vector y. What does the nature of matrix multiplication imply for its influence on the least-squares estimate? Read the full problem text above for a reminder. What does the inverse of a matrix do to a particular column of that matrix?) Solution: How do we prove something like this? Well, it is going to involve the least-squares estimate, so we might as well just expand it out and see what happens. As we do so, we know that we need to focus in on the impact of yi so we will have to express the matrix multiplication in a way that lets us do that. Note that  ⊤ −1 ⊤ X X xi X    xTi   y 1 e⊤i  new ⊤−1 ⊤−1 X X xi xi X X xi x1  (4)      · · · 0 ··· 0 . . . xn yi−1   =   ·y+   T iT  xi−1 xi+1 0 ··· 0 yi+1 xi 1 1 xi 1   However, notice that in the expression involving yi, the inverse of a matrix is muliplying the last column of a matrix. Because the inverse times a matrix is just the identity, this means that the answer here is the last column of the identity by the properties of matrix multiplication.  ⊤ −1     X X xi xi 0   y=y. xTi 11i 1i This implies from Equation (4) that yi does not contribute to pˆ — it only influences the fake- or-not variable estimate fˆ. Consequently, the parameter estimate 􏰓p is the just the first part of    y1    .  . .  y  n 􏰓 (d) Continuing the previous part, use what you know to prove that yi = 􏰓pTxi + fi. That is, regardless of what yi is, there is no residual left in the i-th position. HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10 i −1   ⊤  X X xi x1 ··· xi−1 xi+1 ... xnyi−1  T    .   x 1 0 ··· 0 0 ··· 0 y  i  i+1   .  . .    y1    .  . .   .  . . y  n 􏰓 Solution: For questions like this, the issue is always where to even start. We need something that involves all the expressions above. Because the estimates are involved, it is natural to start with the least-squares equation. The inverse in the least-squares formula is actually just a shorthand for solving a system of linear equations. If we recall what that system is:  At this point, we know that we are going to need to get yi in some way. So we need to expand out the right hand side and hope for the best. Expanding out as above ⊤    y1    .  . . X y =·y+ new new   i  0 ··· 0  .    0 ··· 0 yi+1   .  . . y  n 􏰓p ⊤⊤ f (XnewXnew) =Xnewynew. (5) i 􏰓  xi x1 ··· xi−1 xi+1 ... xnyi−1      1 At this point, we have a lot of hope because we can instantly see that the bottom row of this has no contributions from anything other than yi. So indeed, the last row has isolated yi. So, we can just write out the bottom row of the least-squares system of equations (5) using immediately can write out the matrix multiplication to see Note that here, 􏰏 skipthe j=iterminthesum.  ⊤ X X xi the fact that we know that X⊤ X =  . Since we only care about the last row, we newnew xTi 1 T x i 􏰓p + f i = y i . TTTT This is almost what we want. We just realize that xi 􏰓p is a scalar and hence xi 􏰓p = (xi 􏰓p) = 􏰓 􏰓pT xi, which then gives us what we want to prove: yi = 􏰓pT xi + fi. In this proof, the only obstacle was fear. You needed to get started and then just move forward hoping for the best. At every step, there weren’t actually many options. The key was simply being willing to move forward even though you weren’t sure that this would get you all the way to the end. 􏰓p 􏰏  f 􏰓 T2 j􏰁i (y j − x j p) . is just a compact way to write sum up over all indices j from 1 to n but (e) Given that 􏰓pnew =   minimizes (3), show that 􏰓p minimizes 􏰓 i j􏰁i (HINT: There are many ways to proceed. Let p′ be the vector that minimizes r = 􏰏 xTj p)2. (y − j􏰁i j HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11  ′ p  i possibly go any lower? Recall that a square of a real number is always ≥ 0.) Solution: In proving something, it is good to understand what we want to prove. Let p′ be the vector that minimizes r. We want to show that this is indeed the estimate 􏰓p. The Use this to construct an f ′ so that p′ =   achieves the same r in (3). Could the cost in (3) i new f′ least-squares expression that we have gives us both that and fi and so it is natural to construct 􏰓 what we think that fi should be. Fortunately, the previous part has given us a relationship that we can use to guess. So, let us set f ′ = yi − xT p′ . This is what we think that the least-squares estimate should be. ii Now, we just need to prove it. To do so, we need to give a name to our guessed estimate. Let ′  ′ p  p =  new f′ How can we show that indeed this is the estimate that least-squares will find? We basically have two choices at this point. We could try to plug this into the equation (5) that least-squares satisfies and shows that indeed this satisfies it. But that is going to be hard because we don’t have an explicit formula for p′. Instead, we have to use the fact that the least squares estimate is actually the estimate that minimizes the sum of squared deviations on our data. In other words, we need to go back to what the least squares estimate actually means, and not just to a formula that we might have to compute it. Let’s assume that the least-squares estimate pˆ does not minimize r. Then, this implies that 􏰑 ( y j − x Tj p ′ ) 2 < 􏰑 ( y j − x Tj pˆ ) 2 . j􏰁i j􏰁i How can we get a contradiction? We need to use the fact that the least squares estimate is optimal on its data. To exploit this, we just need to use p′new appropriately since this is some other estimate and thus ∥ynew − Xnewp′new∥2 ≥ ∥ynew − Xnew􏰓pnew∥2. Then, we just need to expand both sides. j􏰁i (7) j􏰁i Meanwhile, the optimal least-squares estimate must be at least as good as this on the entire data set 􏰑(yj − xTj p′)2 = ∥ynew − Xnewp′new∥2 (8) j􏰁i i ∥y −X p′ ∥2=􏰑(y−xTp′)2+(y−f′−xTp′)2 (6) new newnew j j i i i = 􏰑(yj − xTj p′)2. HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12 􏰓 (f) (g) ≥ ∥ynew − Xnew􏰓pnew∥2 (9) = 􏰑(yj − xTj pˆ)2 + 02 (10) j􏰁i = 􏰑(yj − xTj pˆ)2. (11) j􏰁i This contradicts our assumption that 􏰑 ( y j − x Tj p ′ ) 2 < 􏰑 ( y j − x Tj pˆ ) 2 . j􏰁i j􏰁i A quantity can’t be less than something it is greater than or equal to. Thus, we conclude that 􏰓p indeed must minimizes 􏰏 (y − xT p)2. j􏰁i j j In this proof, we had basically a single decision point where we had to make a choice of how to proceed. From that point forward, there was really only one way to move forward. But once again, we had to overcome the fear of not knowing how everything will work out. Only when we got to the end could we see that it all worked out. Argue why the above implies that the resulting solution is the same parameter estimate that we would have gotten had we simply removed the i-th data-point entirely from our data set. Solution: The resulting solution is equivalent to optimizing over r defined in the previous question, which is exactly least-squares over the data set with the i-th data point removed. Since we get a minimizer for the problem with the i-th data point removed, it is as though we solved that problem. From what you have seen above together with what you see from running the attached jupyter notebook, argue in words why augmenting the data matrix X with an identity and then running OMP is a reasonable approach to systematically eliminating outliers. In the next homework, you will see how we can find the right stopping condition for OMP. Solution: By augmenting the data matrix X with an identity and then running OMP, the algo- rithm is removing one point with a large residual (i.e., rj = (yj − x⊤j p)2) per iteration of OMP. It is reasonable to assume that the removed point with large residual is likely to be an out- lier. As described in the jupyter notebook, OMP can be terminated when the algorithm cannot distinguish between the remaining points, which can happen when all the outliers have been removed and all the remaining points are on a line. In the next homework you will look at a more principled way to terminate OMP, including when noise prevents your data from existing on a line. Geometry of Ridge Regression 6 You recently learned ridge regression and how it differs from ordinary least squares. In this ques- tion we will explore useful interpretations and reconceptualizations of ridge regression. HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13 (a) Recall that ridge regression can be understood as the unconstrained optimization problem min ∥y − Xw∥2 + λ∥w∥2, (12) w where X ∈ Rn×d is a design matrix (data), and y ∈ Rn is the target vector of measurement values. One way to interpret “ridge regression” is as the ordinary least squares for an augmented data set — i.e. adding a bunch of fake data points to our data. Consider the following augmented measurement vector yˆ and data matrix Xˆ :   y 0  λI ˆ√ yˆ=X= , problem itself and expand out the terms. Recallthat∥yˆ−Xˆw∥2=􏰏n+d(yˆi−xˆiw)2wherexˆiarerowsofXˆ:thesquarednormoftheerror 2 i=1 is the sum of squared errors in individual coordinates. Our augmentation adds d more terms to that sum, which exactly give the ridge regularization. To see that we can write nn 􏰑(yˆi − xˆiw)2 = 􏰑(yi − xiw)2 = ∥y − Xw∥2 i=1 i=1 dd√ 􏰑(yˆi−xˆiw)2=􏰑( λwi)2=λ∥w∥2 i=n+1 i=n+1 n+d 􏰑(yˆi − xˆiw)2 = ∥y − Xw∥2 + λ∥w∥2. i=1 Alternatively, we can look at the solution and simplify it. We know that the solution to ordinary least squares for the augmented data is just  ⊤  ⊤   X  d d where 0d is the zero vector in Rd and Id ∈ Rd×d is the identity matrix. Show that the optimiza- tionproblemminw∥yˆ−Xˆw∥2hasthesameminimizeras(12). Solution: There are two easy ways of seeing the answer. The first is to look at the optimization XX X y ˆ⊤ˆ−1ˆ⊤ √  √ −1√      ( X X ) X yˆ = (     )      λId  λId  λId 0d   ⊤ −1⊤  Notice that this is the same as the solution for ridge regression. Either way, we get the desired result. 􏰔 √􏰕X􏰔 √􏰕y √ =(X λI  ) X λI    d  λId d 0d = (X⊤X + λId)−1X⊤y HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14 (b) Perhaps more surprisingly, one can achieve the same effect as in the previous part by adding fake features to each data point instead of adding fake data points. (Does this remind you of a previous problem?) Let’s construct the augmented design matrix in the following way: Xˆ = [ X α I n ] i.e. we stack X with αIn horizontally. Here α is a scalar multiplier. Now our problem is underdetermined: the new dimension d + n is larger than the number of points n. Therefore, there are infinitely many values η ∈ Rd+n for which Xˆ η = y. Consider the following problem: min ∥η∥2 s.t. Xˆ η = y. (13) η Find the α that that if η∗ is the minimizer of (13), then the first d coordinates of η∗ form the minimizer of (12). Can you interpret what the final n coordinates of η∗ represent? Solution: Let’s look inside the d + n dimensional vector η by writing it as η =  ξ . Here, w is d-dimensional and ξ is n-dimensional. Then (13) expands to min∥w∥2 +∥ξ∥2 s.t.Xw+αξ=y. w,ξ The constraint just says that αξ = y − Xw. In other words, αξ is the classic residual. This yields ξ = 1 (y − Xw) and plugging that into the first part we get α min ∥w∥2 + 1 ∥y − Xw∥2. w,ξ α2 When considering whether the optimization problem is equivalent, we need to think about the minimizer and not the minimum itself. For this, we simply notice that scaling the objective by a constant factor doesn’t change the minimizers and so: argmin∥w∥2+ 1∥y−Xw∥2 =argminα2∥w∥2+∥y−Xw∥2 w,ξ  w w,ξ α2 √ The last n rows of η∗ therefore represent a scaled version of the classic OLS residual. which is equivalent to (12) for α = Notice that by taking the limit α → 0, we would just get OLS. (c) Suppose the SVD of X is X = UΣV⊤. Let’s make a change of coordinates in the feature space, such that V becomes identity: Xnew = XV and wnew = V⊤w. Denote the the solution to ridge regression (12) in these new coordinates as wˆ new. Show that the i-th coordinate of wnew can λ. be obtained from the corresponding coordinate of U⊤y by multiplication by σi , where σ 2i + λ σi is the i-th singular value of X (or zero if i is greater than the rank of X.) HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15 Solution: We know that wnew is the solution of the minimization procedure min ∥Xnewwnew − y∥2 + λ∥wnew∥2. wnew We would like to have Σ instead of Xnew to make use of the singular values of X. Note that Xnew = UΣ, so we just need to multiply Xnew by UT from the left to get Σ. The only ques- tion is what will happen with our optimization procedure. Recall that multiplication by an orthonormal (unitary) matrix does not change Euclidean norm, thus ∥Xnewwnew − y∥2 = ∥U⊤Xnewwnew − U⊤y∥2 = ∥Σnewwnew − U⊤y∥2. Therefore, our optimization may be rewritten as min ∥Σnewwnew − U⊤y∥2 + λ∥wnew∥2. wnew Now, we can rewrite that as We see that the target function is a sum of functions, each of which only depends on one coordinate of wnew. Thus, we can minimize each of those terms separately in it’s own argument. We see that for i > rank(X)
and for i ≤ rank(X)
arg min
(wnew)i
σi +λ
rank(X) d  􏰑􏰇 􏰈􏰑 2  ⊤ 2
min  σi(wnew)i − (U y)i + λ (wnew)i  . wnew  i=1 i=1 
arg min (wnew)2i = 0 (wnew )i
􏰋􏰇 ⊤􏰈2 2􏰌σi⊤ σi(wnew)i − (U y)i + λ(wnew)i = 2 (U y)i.
(d) One reason why we might want to have small weights w has to do with the sensitivity of the predictor to its input. Let x be a d-dimensional list of features corresponding to a new test point. Our predictor is w⊤x. What is an upper bound on how much our prediction could change if we added noise ε ∈ Rd to a test point’s features x?
Solution: The change in prediction is given by
|w⊤(x + ε) − w⊤x| = |w⊤ε| ≤ ∥w∥2∥ε∥2 (14)
where the second step is a result of the Cauchy-Schwarz inequality. To build intuition as to why this is true, you can divide both sides by ∥w∥2∥ε∥2 which gives you
􏰅􏰅 ∥w∥2 ∥ε∥2 􏰅􏰅
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16
􏰅􏰅
􏰅􏰅􏰅⟨ w , ε ⟩􏰅􏰅􏰅≤1 (15)

In words, this inequality is saying that if you project a unit vector ε onto some direction w , ∥ε ∥2 ∥w∥2
the resulting length must be less than or equal to 1. The prediction will thus vary from the original prediction by at most ∥w∥2∥ε∥2.
Trying to keep the learned weights w small is therefore a way to keep the learned function from having too strong of a dependence on small changes in the features.
(e) We know that the solution to ridge regression (12) is given by wˆ r = (X⊤X + λI)−1X⊤y. What happens when λ → ∞? It is for this reason that sometimes ridge regularization is referred to as “shrinkage.”
Solution:
As λ → ∞ the matrix (X⊤X + λI)−1 converges to the zero matrix, and so we have w = 0.
(f) Another advantage of ridge regression can be seen for under-determined systems. Say we have the data drawn from a d = 5 parameter model, but only have n = 4 training samples of it, i.e. X ∈ R4×5. Now this is clearly an underdetermined system, since n < d. Show that ridge regression with λ > 0 results in a unique solution, whereas ordinary least squares can have an infinite number of solutions.
[Hint: To make this point, it may be helpful to consider w = w0 + w∗ where w0 is in the null space of X and w∗ is a solution.]
[Alternative Hint: You might want to consider (13) as the way to interpret ridge regression.] Solution: First, we will follow the first hint and show that ridge regression always leads to a
unique solution. We know that the minimizer is given by w = (X⊤X + λI)−1X⊤y.
We also know that the eigenvalues of the matrix X⊤X + λI are all at least λ, and so the matrix is invertible, thus leading to a unique solution.
The alternative hint lets us see uniqueness immediately since in (13), we have a full row rank
ˆ√ˆ
matrix X because the λIn is clearly full rank and X is a wide matrix by construction. Conse-
quently, the solution for η is given by the classic Moore-Penrose pseudo-inverse Xˆ ⊤(Xˆ Xˆ ⊤)−1 which is unique since (Xˆ Xˆ ⊤) is invertible.
For ordinary least squares, let us assume that w′ minimizes ∥Xw − y∥2, i.e., ∥Xw′ − y∥2 ≤ ∥Xw−y∥2 forallw∈Rd.
Since d > n, the matrix X has a non-trivial nullspace. We can take any vector w0 in the nullspace of X and consider the new vector w′′(α) = w′ + αw0.
Noticethat∥Xw′′(α)−y∥2 =∥Xw′ −y∥2 ≤∥Xw−y∥2 forallw∈Rd becausew0 isinthenull space of X. Thus, the vector w′′(α) is a minimizer for any choice of α. We have thus shown that the least squares problem has an infinite number of solutions.
(g) For the previous part, what will the answer be if you take the limit λ → 0 for ridge regres- sion?
[Hint: You might want to consider (13) as the way to interpret ridge regression.]
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17

(h)
Solution:
Plugging the singular value decomposition X = UΣV⊤ into the solution of the ridge regression,
we get
w∗(λ) = (X⊤X + λI)−1X⊤y
= V(Σ⊤Σ + λI)−1Σ⊤U⊤y
and for λ → 0 this converges to the unique solution w∗ = VΣ−1U⊤y, also called the minimum norm solution. (Notice here that Σ−1 has the obvious definition of just involving the relevant square matrix of nonzero singular values.)
For those who remember from EE16B, this is also closely connected to the idea of the Moore- Penrose Psuedo-inverse. This problem continues the trend of illustrating the utility of the SVD while reasoning about such problems. It really is a workhorse of machine learning reasoning.
If we wanted to use the augmented features approach to ridge regularization, the answer is immediate since what happens is that the constraint in (13) just becomes the Xw = y constraint when λ → 0. So we get the minimum norm solution to this.
Tikhonov regularization is a general term for ridge regression, where the implicit constraint set takes the form of an ellipsoid instead of a ball. In other words, we solve the optimization problem
w = arg min 1∥y − Xw∥2 + λ∥Γw∥2 w2
for some full rank matrix Γ ∈ Rd×d. Derive a closed form solution to this problem. Solution: The objective is
f(w) = 1∥y − Xw∥2 + λ∥Γw∥2 2
= w⊤(1X⊤X + λΓ⊤Γ)w − y⊤Xw + 1y⊤y. 22
Taking derivatives with respect to w and setting the result to zero, we obtain 0 = w⊤(X⊤X + 2λΓ⊤Γ) − y⊤X.
Thus, the solution is now given by w = (X⊤X+2λΓ⊤Γ)−1X⊤y, and one can again verify that this is a minimizer by showing that the Hessian X⊤X + λΓ⊤Γ is positive definite since the matrix Γ has full rank, implying that Γ⊤Γ is itself positive definite. Since X⊤X is positive semi-definite by construction, we are done.
Approximating a 1D function
7
It is really useful for students to be able to anchor your intuitions about machine learning in intu- itions that you already have from calculus, etc. Universal function approximation can sometimes come across as mysterious, whereas in reality, it is something that you are already very familiar with. This is a problem designed to remind you of things that you already essentially know.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18

(a) For a p-times differentiable function f : R → R, the Taylor series expansion of order m ≤ p − 1 about the point x0 is given by
f (x) = 􏰑m 1 f (i)(x0)(x − x0)i + 1 f (m+1)(a(x))(x − x0)m+1. (16) i=0 i! (m+1)!
Here, f (m) denotes the mth derivative of the function f , and a(x) is some value between x0 and x. By definition, f (0) = f .
The last term rm(x) := 1 f (m+1)(a(x))(x − x0)m+1 of this expansion is typically referred to as (m+1)!
the remainder term when approximating f (x) by an m-th degree polynomial.
We denote by φm the m-th degree Taylor polynomial (also called the Taylor approximation), which consists of the Taylor series expansion of order m without the remainder term and thus
reads
f(x) ≈ φm(x) = 􏰑m 1 f(i)(x0)(x − x0)i i=0 i!
where the sign ≈ indicates approximation of the left hand side by the right hand side.
For functions f whose derivatives are bounded in the neighborhood I around x0 of interest, if
we have | f (m)(x)| ≤ T for x ∈ I, we know that for x ∈ I that the approximation error of the m-th
order Taylor approximation |f(x)−φm(x)| = |rm(x)| is upper bounded |f(x)−φm(x)| ≤ T|x−x0|m+1 . (m+1)!
Compute the 1st, 2nd, 3rd, and 4th order Taylor approximation of the following functions around the point x0 = 0.
• ex
• sinx
Solution:
• Thenth(n>0)derivativeofex isex,ande0 =1.Wecanwritetheexpansionaboutx=0 as
φ1(x) ≈1 + (x − 0)1
φ2(x) ≈1 + (x − 0)1 + 1(x − 0)2
2
φ3(x)≈1+(x−0)1 +1(x−0)2 +1(x−0)3 26
φ4(x)≈1+(x−0)1+1(x−0)2+1(x−0)3+ 11(x−0)4 2 6 24
• The first 4 derivatives of sin x are given by
first: cosx second: −sinx third: −cosx
fourth: sinx
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19

Putting this together in the expansion about x0 = 0 gives us a 4-th order approximation (and the lower order approximations can be read off immediately).
φ4(x)≈0+(x−0)1+10(x−0)2+1(−1)(x−0)3+ 10(x−0)4 2 6 24
(b) For f (x) = ex, plot the Taylor approximation from order 1 through 4 at x0 = 0 for x in the domain I := [−4, 3] using the provided jupyter notebook.
We denote the maximum approximation error on the domain I by ∥ f − φm∥∞ := supx∈I | f (x) − φm(x)|, where ∥ · ∥∞ is also called the sup-norm with respect to I. Compute ∥f − φm∥∞ for m = 2. Compute the limit of ∥f − φm∥∞ as m → ∞. √ 􏰇n􏰈n
Hint: Use Stirling’s approximation for integers n which is: n! ∼ 2πn e .
Now plot the Taylor approximation of f up to order 4 on the interval [−20, 8]. How does
the approximation error behave outside the bounded interval I?
Solution: The following plot shows the original function and its approximations
ex
1st order
2nd order 3rd order
4th order
20
15
10
5
0
5
43210123
x
import numpy as np
import matplotlib.pyplot as plt
import os
import math
plot_col = [’r’, ’g’, ’b’, ’k’, ’m’]
plot_mark = [’o’, ’ˆ’, ’v’, ’D’, ’x’, ’+’]
def plotmatnsave(ymat, xvec, ylabels, dirname, filename):
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 20
f(x)

no_lines = len(ymat)
fig = plt.figure(0)
if len(ylabels) > 1:
for i in range(no_lines):
xs = np.array(xvec)
ys = np.array(ymat[i])
plt.plot(xs, ys, color = plot_col[i % len(plot_col)], lw=1, label=ylabels[i])
lgd = plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=3, mode=”expand”, borderaxespad 􏰀→ =0.)
savepath = os.path.join(dirname, filename)
plt.xlabel(’$x$’, labelpad=10)
plt.ylabel(’$f(x)$’, labelpad=10)
plt.savefig(savepath, bbox_extra_artists=(lgd,), bbox_inches=’tight’)
plt.close()
# Get arrays
x_vec = np.linspace(-4,3,1000)
y_mat = np.zeros((5,1000))
# Actual function and approximations applied on x
y_mat[0,:] = list(map(lambda x: np.exp(x), x_vec))
y_mat[1,:] = list(map(lambda x: 1+x, x_vec))
y_mat[2,:] = list(map(lambda x: 1+x+0.5*x**2, x_vec))
y_mat[3,:] = list(map(lambda x: 1+x+0.5*x**2+1/6*x**3, x_vec))
y_mat[4,:] = list(map(lambda x: 1+x+0.5*x**2+1/6*x**3+1/24*x**4, x_vec))
# labels
labels = [’$eˆx$’, ’1st order’, ’2nd order’, ’3rd order’, ’4th order’]
# Plot and save
filename = ’approx_plot.pdf’
plotmatnsave(y_mat, x_vec, labels, ’.’, filename)
We know that r2(x) = 0 at x = 0 by definition of φ2(x) = 0, therefore x = 0 is a minimum.
Note that
We look at the intervals I1 = [0, 3] and I2 = [−4, 0] separately. (i) For x ∈ I1, we have that |r2(x)| is monotonically increasing with x, since the derivative reads ex − x − 1 which is positive for all x > 0 (think about why). Therefore, the maximum value is at the boundary x = 3 with r2(3) = 11.5855. (ii) For x ∈ I2, we see that |r2(x)| = −r2(x) is monotically decreasing and thus the maximum is again at the boundary x = −4 with −r2(−4) = 4.9817. Using the fact that at x = 0, r2(x) = −r2(x) and comparing the two boundary values reveals that
∥ f − φ2∥∞ = sup |r2(x)| = r2(3) = 11.5855 x∈I=I1∪I2
As in the description, in order to upper bound |rm(x)| for all x ∈ I, we need to find an up- per bound on the absolute value of the derivative in the interval, that is supx∈I |f(m)(x)| =
|ex| = e3. Therefore, ∥ f − φm∥∞ ≤ e34m+1 . (m+1)!

ex−(1+x+1×2) forx∈[0,3]
| r ( x ) | =  2
2  12 x
1+x+2x −e forx∈[−4,0]
sup
Using Stirling’s formula, we obtain that
x∈I
e3 􏰍 4e 􏰎m+1 (m+1)! ≈ √2π(m+1) m+1
e34m+1
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21

Both terms individually converge to zero (for the second term since there exists m such that for all N ≥ m the term 4e ≤ 1) and thus their product also converges to zero.
m+1
The following figure demonstrates that although the 4th order approximation is reasonably close to the original function within the bounded interval I = [−4, 3], the maximum approxi- mation error quickly becomes very large for x 􏰂 I
ex
1st order
2nd order 3rd order
4th order
5000
4000
3000
2000
1000
0
1000
20 15 10 5 0 5
x
Mathematically speaking, given a fixed interval I even though the uniform approximation error (taking the supremum inside the interval I) is bounded by some ε, this does not guarantee anything about the error for x 􏰂 I. Although for any interval [−a, a], you may find a D(ε, a) for an arbitrary ε such that supx∈[−a,a] | f (x) − φD(ε,a)| < ε, this degree D(ε, a) is potentially different for all a. One way to see it is that the upper bound of the approximation error really reads ea 􏰉 a · e 􏰊m+1 √2π(m+1) m+1 as a function of a, i.e. the larger a is, the larger m we might need. Note here that the actual error ea am+1 is not in fact mononotically decreasing with m across all integers but eventually (m+1)! does for large m. This is the reason the error at x = −10 in the plot above actually increases with the degree. (c) Let’s say we would like an accurate polynomial approximation of the functions in part (a) for all x ∈ I. Given the results of the previous parts, we can in fact find a Taylor polynomial of degree D such that |f(x) − φD(x)| ≤ ε for all x ∈ I. What is an upper bound of ∥f − φm∥∞ for arbitrary non-zero integers m? Using this upper bound, show that if D is larger than O(log(1/ε)), we can guarantee that ∥ f − φD∥∞ ≤ ε for both choices of f (x) in part (a). Note that constant factors are not relevant here, and assume sufficiently small positive ε ≪ 1. HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22 f(x) Solution: Note: The upper bound question, originally asked for in question (b), is moved here because its relevance really enters here. You may give full credit to question (b) if you managed to find an upper bound which allowed you to show (c). Using the argument and approximation from (b), it suffices to require e3 􏰍 4e 􏰎D+1 u(D):= √2π(D+1) D+1 ≤ε. (17) Ignoring constants for large D, we can directly upper bound u by u(D) ≤ C􏰇4e􏰈D and thus it D suffices to require D log D ≥ log 1 . Note that with ε ≪ 1, we require D to be “large enough”. ε This question does not require to give precise bounds but aims at learning to quickly grasp the scaling of model or sample complexity with respect to the error, ignoring constants. Plugging in D = O(log 1 ) on the right hand side and ignoring constants, we obtain ε εεε so that inequality (17) holds. (d) Conclude that a univariate polynomial of high enough degree can approximate any func- tion f on a closed interval I, that is continuously differentiable infinitely many times and has bounded derivatives | f (m)(x)| ≤ T for all m ≥ 1 and x ∈ I. Mathematically speaking, we needtoshowthatforanyε>0,thereexistsadegreeD≥1suchthat∥f−φD∥∞ <ε,where the sup-norm is taken with respect to the interval I. This universal approximation property illustrates the power of polynomial features, even when we don’t know the underlying function f that is generating our data! Later, we will see that neural networks are also universal function approximators.) Solution: Given any function f that is i times continuously differentiable, we can bound the magnitude of the derivatives of the function on a compact domain I as |f(i)(x)| ≤ T for some constant T < ∞ and all i > 0. The maximum approximation error is now bounded by
(supx∈I |x − x0|)m+1 ∥f(x)−φm(x)∥∞ ≤T (m+1)! .
Because T and supx∈I |x−x0| are constants independent of m, we again have ∥f(x)−φm(x)∥∞ → 0 as in question (b). Thus, as m becomes sufficiently large, a polynomial of high enough degree can approximate any sufficiently smooth function.
(Note that here, we only deal with functions whose derivatives do not identically vanish to zero about the point x0. This was an implicit assumption.)
􏰍 􏰇1􏰈log 1 􏰎 􏰇1􏰈 􏰇1􏰈 −logu(D) = log log ε = log loglog ≥ log1/ε
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23

(e) Now we will approximate a function using periodic functions instead of polynomials. Let f : [0, 2π] → R be a piecewise differentiable function. Suppose it were possible to find coeffi-
cients ck ∈ C such that
f (x) exp􏰃− jlx􏰄dx =
Per the hint, the integal on the right is zero when k 􏰁 l. Thus,
􏰒 2π 00

f(x) = 􏰑 ck exp􏰃jkx􏰄. k=−∞
Give a formula to compute the coefficient ck given access to the function f (x).
[Hint: 􏰐 2π ejnx = 0 for a nonzero integer n. Think about the set of functions exp􏰃jkx􏰄 on
0
the interval [0, 2π] as a set of functions that are orthogonal to each other using the natural
complex inner product on these functions. What is that natural inner product? Do the natural replacement of a sum with an integral.]
Solution: Let l be an integer. Multiply both sides of the expression by exp􏰃− jlx􏰄 and integrate from 0 to 2π:
􏰒2π 0
􏰑∞ 􏰒2π
ck exp 􏰃 j(k − l)x􏰄 dx
k=−∞ 0 􏰒 2π
f (x) exp􏰃− jlx􏰄dx = cl
cl = 1 f(x)exp􏰃−jlx􏰄dx.
􏰒 2π 2π 0
dx
(f) Of course, it is not guaranteed that a given function f can be written in this form. We define the complex Fourier series of f as

fˆ(x)= 􏰑ckexp􏰃jkx􏰄, k=−∞
where ck is what you just computed. The truncated Fourier series is given by m
fˆ(x)= 􏰑c exp􏰃jkx􏰄. mk
k=−m
Solution:
Give a formula for the Fourier coefficients (i.e., ck) of f (x) =  and use the 
−1, otherwise jupyter notebook to plot the Fourier series for this function for m = 1, 3, and 5.
􏰒 2π 􏰒 π 􏰒 2π 
1 􏰃􏰄1􏰃􏰄 􏰃􏰄
c = f(x)exp−jkxdx=  exp−jkxdx− exp−jkxdx
k
2π 2π   00π

1, 0≤x<π HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24 􏰍􏰎 1 1 􏰃 􏰄􏰅􏰅π 􏰃 􏰄􏰅􏰅2π exp−jkx􏰅 −exp−jkx􏰅 2π(−jk) 􏰅0 􏰅π j 􏰇2 exp􏰃− jkπ􏰄 − 1 − exp􏰃−2π jk􏰄􏰈 2πk = = = Use this result to show that lim 􏰐 2π |f(x) − fˆ (x)|2dx = 0. m→∞ 0 m  − 2 j j 􏰃 􏰄  kodd 􏰇exp−jkπ −1􏰈=πk (g) Nowyou’llshowthattheFourierseriesnicelyapproximatesanappropriatelysquare-integrable function. Parseval’s equality states that 􏰏∞ |c |2 = 1 􏰐 2π f 2(x)dx. In other words, lim 􏰏m  kπ  0 k even j=−∞ j 2π 0 m→∞ 1 􏰐 2π f2(x)dx. If you want to understand why this is true, you can take appropriate limits to j=−m |c |2 = j 2π 0 prove it, but we don’t ask you to do that here. Solution: First we observe that fˆ is a real function. Observe that c = 1 􏰐 2π f (x) exp􏰃− jkx􏰄dx = m −k 2π0 8 This goes to zero per Parseval’s identity. Learning 1-D functions c∗.Wehavefˆ(x)=c+􏰏m cejkx+c e−jkx=c+􏰏m cejkx+􏰇cejkx􏰈∗.Sincey+y∗is k m 0 k=1k −k 0 k=1k k real, we know that fˆ is the sum of real functions, which makes it real. m 􏰒 2π 􏰒 2π |f(x) − fˆ (x)|2dx = |f(x)|2 − f∗(x)fˆ (x) − f(x)fˆ∗(x) + |fˆ (x)|2dx mmmm 00 Now we use the fact that both f and fˆ are real: 􏰒 2π = f2(x)dx − 2 = = = 􏰒2π 􏰑m 􏰒2π f2(x)dx−2 ci f(x)exp􏰃jix􏰄dx 0 i=−m0 􏰑m 􏰒2π 􏰒2π 0 􏰒2π 0 i,k=−m mm m 􏰒 2π 000 f 2(x)dx − 2 f 2(x)dx − 2π ci(2πc−i) + 2π |ci|2 |ci|2 In this problem we explore the canonical example of learning one dimensional functions from samples. Read the questions in the associated notebook and fill in code in the required places. HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25 􏰒 2π mm f(x)fˆ (x)dx + |fˆ (x)|2dx + cic∗k exp􏰃j(i−k)x􏰄dx 0 􏰑 􏰑 i=−m 􏰑m i=−m i=−m Most of the code functions as a playground for you to play around with different parameters and understand their effect. (a) Read the question mentioned in the jupyter notebook and write your solution here. Solution: As we increase the standard deviation of the additive white gaussian noise the train- ing samples move further away from the true function. (b) Read the question mentioned in the jupyter notebook and write your solution here. Solution: Intuitively we expect the polynomial features to be easier to represent the func- tion y = x with and similarly the function y = cos(2πx) is easier to represent using Fourier features. In fact using the correct choice of feature type we can represent both these functions with just one feature alone. (c) Read the question mentioned in the jupyter notebook and write your solution here (ex- cluding code). Solution: We learned the coefficients: LR.fit(phi, y) coeffs = LR.coef_ Rdg.fit(phi, y) coeffs = Rdg.coef_ w = [0.0101, 1.0297]. and obtained a loss of 0.0135. Your answer might vary based on the randomness in training samples but you should get a coefficient on feature x (feature 2) close to 1 and a coefficent on the constant feature (feature 1) close to 0. (d) Read the question mentioned in the jupyter notebook and write your solution here (ex- cluding code). Solution: y_plot_pred = phi_plot @ w Visually we see that the learned function is a good approximation of the true function. (e) Read the question mentioned in the jupyter notebook and write your solution here (ex- cluding code). Solution: fit_error = np.mean((y_fit_true - y_fit_pred)**2) HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26 We obtained a fit MSE of 0.000393. Your answer might again vary due to randomness but you should be getting a value close to zero here indicating the goodness of fit. (f) Read the question mentioned in the jupyter notebook and write your solution here (ex- cluding code). Solution: We see that using Fourier features it is easy to learn the function y = cos(2πx) and using polynomial features it is easy to learn the function y = x. Both feature types can succeed in representing both functions provided we include sufficient features. (g) Read the question mentioned in the jupyter notebook and write your solution here. Solution: As d grows to n we see that uniform randomly spaced x leads to poor approxima- tions of the true function (with very large function values predicted by the learned function). However when using evenly spaced samples with Fourier features we are able to learn a good approximation to the true function even as d grows to n. In fact for d = n we are able to interpolate each training sample and achieve a training loss of 0 (almost 0 due to numerical issues). (h) Read the question mentioned in the jupyter notebook and write your solution here. Solution: We see that the eigenvalues using uniform randomly sampled x with polynomial features are the lowest and the minimum eigenvalue decreases in magnitude as d increases. Thus the matrix is very ill-conditioned for d close to n. Something similar happens for uni- form randomly sampled x with Fourier features. Now the eigenvalues are larger in magnitude than the previous case, but still the minimum eigenvalue is very close to 0 leading to poor conditioning. However, while using evenly spaced x with Fourier features all the eigenvalues have the same magnitude. The matrix is well conditioned even for d = n. Another quirk due to numerical issues is the kink in the blue eigenvalue curve close to d = 25. This is because the eigenvalues being computed numerically by numpy are no longer real/pos- itive even though we know they should be because the matrix φ⊤φ is symmetric. (i) Read the question mentioned in the jupyter notebook and write your solution here. Solution: We observe that by choosing the appropriate level of ridge regularation λ we are able to succed in learning a good approximation to the true function even when d is close to n. (j) Read the question mentioned in the jupyter notebook and write your solution here. Solution: Adding a ridge regularizationt term essentially sets a floor to the eigenvalues and improves conditioning of the matrix φ⊤φ + λI. The minimum eigenvalues are no longer very small. Increasing λ raises the floor. HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27 w, loss = solve_ls(phi_train, y_train) w, loss = solve_ridge(phi_train, y_train, lambda_ridge) 9 Fourier features for regularly spaced samples In this problem we want to show that Fourier features interact with measurements on a grid in a pleasing way. Suppose we measure our functions in points {xj = 2πm/(2n + 1)}nm=−n — a grid on (−π, π). If we introduce 2n + 1 features  √ n √ n       1   2cos(kx)  2sin(kx) √ ,√ and√ ,  2n+1  2n+1   2n+1  k=1 k=1 the resulting (2n + 1) × (2n + 1) matrix will be orthogonal. To prove that directly requires dealing with trigonometry, so we take another approach: look at the complex features Solution: Denote  􏰃 􏰄n   exp jkx  √,  2 n + 1  k = − n where j stands for imaginary unit. (a) Show that for the complex features the matrix of data will be unitary. the data matrix in complex features as Z: exp􏰃jkxm􏰄 1 􏰍2πjkm􏰎 Zm,k = √ = √ exp 2n+1 2n+1 2n+1 Let’s show that Z is unitary i.e. that Z∗Z = I. First, let’s choose k1 􏰁 k2 and compute (Z∗Z)k1,k2 : ∗ 1 (ZZ)k1,k2 =2n+1 1 = 2n+1 1 = 2n+1 =0 Now, if k1 = k2 we have 􏰑n 􏰍 2πjk1m􏰎 􏰍2πjk2m􏰎 exp −2n+1 exp 2n+1 exp􏰇2πj(n+1)(k2−k1)􏰈−exp􏰇−2πjn(k2−k1)􏰈 m=−n 2n+1 2n+1 1−exp􏰇2πi(k2−k1)􏰈 2n+1 exp􏰃2πj(k2 − k1)􏰄 − 1 􏰍2πjn(k2 − k1)􏰎 1−exp􏰇2πj(k2−k1)􏰈 exp 2n+1 2n+1 1 (ZZ)k1,k2 =2n+1 􏰍2πjk2m􏰎 ∗ 􏰑n m=−n 􏰍 2πjk1m􏰎 exp −2n+1 exp 2n+1 1 1 􏰑n = 2n + 1 = 1. m=−n HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28 (b) Use the expression e jφ = cos􏰃φ􏰄 + j sin􏰃φ􏰄 and the previous part to show that the data matrix in real features is orthogonal. (HINT:youcanexpress2cos(x) = ejx+e−jx and2sin(x) = −j(ejx−e−jx),andyoualreadyknow that columns with complex exponents are orthonormal.) Solution: Let’s denote the column exp(jkx) of data matrix in complex features that corresponds to √ column of the data matrix in real features that corresponds to √ Note that √ to √ as ek. Analogously, denote the √ 1 2cos(kx) 2n+1 as 1, to We already know that vectors {ek}nk=−n are orthogonal and have unit norm. We want to show 2n+1 11 1=e0, ck = √ (ek +e−k), sk = √ (ek −e−k). √ 2n+1 as ck and 2 sin(kx) 2n+1 assk. the same for 1, {ck}nk=1 and {sk}nk=1. 2j2 We can immediately see that vectors 1, ck and sk have unit norms. Indeed ∥1∥2 = ∥e0∥2 = 1, 􏰆 􏰆2  2  2  1   1  􏰆 􏰆     2 􏰆􏰆 1 􏰆􏰆 ∥c ∥ = √ (e + e ) =  √  +  √  = 1, k2􏰆􏰆2k −k􏰆􏰆222 􏰆􏰆 􏰆􏰆2  2  2 2 􏰆􏰆1 􏰆􏰆 1 1 ∥s∥ =􏰆 √ (e −e )􏰆 =√  +√  =1, k k−k 2􏰆􏰆j2 􏰆􏰆2 2 2 where we used that {ek}nk=−n is an orthonormal basis. Moreover, we can easily see that for 1 is orthogonal to any ck or sk and for any k1 􏰁 k2 both ck1 andsk1 areorthogonaltobothck2 andsk2.Indeed,1iscolinearwithe0,ck1 andsk1 lieinthe span of ek1 and e−k1, and ck2 and sk2 lie in the span of ek2 and e−k2. Since vectors {ek}nk=−n are orthogonal, the above mentioned spans are also orthogonal. Therefore, the only thing that we need to check is that ck and sk are orthogonal. Let’s compute: s ⊤c = s ∗c kkkk  ∗   1   1     = √ (e −e ) √ (e +e )  k −k k −k j22 = 1 (e∗kek + e∗ke−k − e∗−kek − e∗−ke−k) −2j = 1 (1+0−0−1) −2j = 0. HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 29 10 A Simple Classification Approach Make sure to submit the code you write in this problem to “HW1 Code” on Gradescope. Classification is an important problem in applied machine learning and is used in many different applications like image classification, object detection, speech recognition, machine translation and others. In classification, we assign each datapoint a class from a finite set (for example the image of a digit could be assigned the value 0, 1, . . . , 9 of that digit). This is different from regression, where each datapoint is assigned a value from a continuous domain like R (for example features of a house like location, number of bedrooms, age of the house, etc. could be assigned the price of the house). In this problem we consider the simplified setting of classification where we want to classify data points from Rd into two classes. For a linear classifier, the space Rd is split into two parts by a hyperplane: All points on one side of the hyperplane are classified as one class and all points on the other side of the hyperplane are classified as the other class. The goal of this problem is to show that even a regression technique like linear regression can be used to solve a classification problem. This can be achieved by regressing the data points in the training set against −1 or 1 depending on their class and then using the level set of 0 of the regression function as the classification hyperplane (i.e. we use 0 as a threshold on the output to decide between the classes). Later in lecture we will learn why linear regression is not the optimal approach for classification and we will study better approaches like logistic regression, SVMs and neural networks. (a) The dataset used in this exercise is a subset of the MNIST dataset. The MNIST dataset assigns each image of a handwritten digit their value from 0 to 9 as a class. For this problem we only keep digits that are assigned a 0 or 1, so we simplify the problem to a two-class classification problem. Download and visualize the dataset (example code included). Include three images that are labeled as 0 and three images that are labeled as 1 in your submission. Solution: We have digits labeled as zero at indices 2, 3, 5 in the dataset. We have digits labeled as one at indices 0, 1, 4 in the dataset. HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 30 This can be visualized with the following script: (b) We now want to use linear regression for the problem, treating class labels as real values y = −1 for class “zero” and y = 1 for class “one”. In the dataset we provide, the images have already been flattened into one dimensional vectors (by concatenating all pixel values of the two dimensional image into a vector) and stacked as rows into a feature matrix X. We want 􏰆 􏰆2 to set up the regression problem minw 􏰆􏰆Xw − y􏰆􏰆2 where the entry yi is the value of the class (−1 or 1) corresponding to the image in row x⊤i of the feature matrix. Solve this regression # zeros visualize_digit(train_features[2,:], train_labels[2]) visualize_digit(train_features[3,:], train_labels[3]) visualize_digit(train_features[5,:], train_labels[5]) # ones visualize_digit(train_features[0,:], train_labels[0]) visualize_digit(train_features[1,:], train_labels[1]) visualize_digit(train_features[4,:], train_labels[4]) 􏰆 􏰆2 problemforthetrainingsetandreportthevalueof􏰆􏰆Xw−y􏰆􏰆 aswellastheweightsw. 2 For this problem you may only use pure Python and numpy (no machine learning libraries!). Solution: With the same loading code as above: The outputs are: regression residual squared error is 422.750833193 weights are [ -3.30879480e-01 3.91613483e-01 1.48013666e-01 -1.60475999e-01 1.03201739e-01 -1.96623709e-02 -1.27722740e-01 9.46968887e-03 -1.71516072e-02 -5.67266904e-03 -4.68674535e-03 -1.12593472e-02 -5.70893241e-03 4.59235208e-03 1.78789273e-02 -3.00950985e-02 1.00359349e-02 -6.45869076e-02 -2.30248440e-02 -2.63121855e-02 -1.63464829e-01 4.66997147e-01 -2.83771055e-03 -1.05607457e-01 -1.80681601e-01 1.34385273e-01 -5.08473255e-03 -3.07385344e-02 -6.59416839e-02 1.75730512e-02 -3.06562148e-02 -6.23832922e-03 1.54059939e-02 -3.13759260e-02 -2.54694535e-03 -6.17963215e-03 # Linear regression X = train_features y = 2*train_labels.astype("int8") - 1 w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y)) residuals = np.dot(X, w) - y epsilon = np.linalg.norm(residuals)**2 print("regression residual squared error is {}".format(epsilon)) print("weights are {}".format(w)) -1.02767663e-03 ... 5.19377999e-02 3.95462736e-02 6.29393160e-02 9.63310339e-03 2.76091605e-01 -7.88934063e-03 -1.78755876e-02 5.81122423e-03 -1.39549663e-02 -2.94532962e-02 -7.18745068e-02 -6.40835539e-02 2.70997570e-03 -2.76509989e-02 3.72376144e-02 1.01846397e-01 5.16173951e-02 -2.26117996e-03 8.83652642e-02 5.59021682e-02 5.63378446e-02 -4.26671766e-02 1.50067955e-02 HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 31 -3.69484797e-02 7.89522007e-03 7.59321675e-02 -8.43895786e-03 1.20719289e-02 1.71253383e-01 -3.02490622e-01 2.85772234e-03 -2.39322335e-03 -3.16695720e-02 3.74489347e-03 -3.21877114e-02 1.20041789e-02 -4.41990234e-03 1.04306452e-02 3.75087769e-03 1.62195284e-02 -3.37275560e-03 -4.06924039e-02 -1.61125399e-02 -6.15769811e-03 -3.09251025e-02 -5.01930043e-02 1.58581156e-02 -1.27891421e-01 1.68097302e-01 -2.77848482e-01 4.18617725e-02] (c) Given a new flattened image x, one natural rule to classify it is the following one: It is a zero if x⊤w ≤ 0 and a one if x⊤w > 0. Report what percentage of the digits in the training set are correctly classified by this rule. Report what percentage of the digits in the test set are correctly classified by this rule.
Solution: Continuing the code from above:
def classify(w, features):
return 1*(np.dot(features, w) >= 0.0)
train_error = 1.0 * sum(classify(w, train_features) != train_labels) / n_train
print(“classification error on the train dataset is {}”.format(train_error))
# Load the test dataset
# It is good practice to do this after the training has been
# completed to make sure that no training happens on the test
# set!
test_error = 1.0 * sum(classify(w, test_features) != test_labels) / n_test
print(“classification error on the test dataset is {}”.format(test_error))
The output is:
So 99.76 percent of digits on the train set are correctly classified and 99.81 percent of digits on the test set are correctly classified.
(d) Why is the performance typically evaluated on a separate test set (instead of the training set) and why is the performance on the training and test set similar in our case? We will cover these questions in more detail later in the class.
Solution: If the model that is fit on the training set is very flexible (for example a linear func- tion with very high dimension or a so called neural network with many parameters), fitting the model often reduces the training error close to zero. This is called overfitting and can for example happen if the dimension of the model is similar to the number of observation. In this case, the classification on the training set is perfect and not indicative of the classification per- formance on inputs that are not part of the training set. Therefore, evaluating the performance on the (unseen) test set is a better indication of performance on future data.
In the setting of this problem, we are pretty safe against overfitting, since the dimension of the model d = 400 is much smaller than the number of observations n = 11623.
(e) Unsatisfied with the current performance of your classifier, you call your mother for advice, and she suggests to use random features instead of raw pixel features. Specifically, she suggests to use the feature map
φ(x) = cos􏰇G⊤x + b􏰈 ,
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 32
classification error on the train dataset is 0.00240901660501
classification error on the test dataset is 0.00189125295508

classification error on the train dataset is 0.0
classification error on the test dataset is 0.00047281323877068556
11
where each entry of G ∈ Rd×p is drawn i.i.d. as N(0, 0.01) and each entry in the vector b ∈ Rp is drawn i.i.d from the uniform distribution on [0, 2π]. Note that cos should be taken point wise, i.e. φ(x) should output a vector in Rp. With p = 2500, report what percentage of digits are correctly classified using this approach on the training set and test set. Make sure to adapt the same classification rule, i.e. the threshold set for the outputs.
Solution: Please refer to the following script:
# Setting up the random variables
p = 2500
G = np.random.normal(loc=0.0, scale=0.1, size=(p, train_features.shape[-1]))
b = np.random.uniform(0, 2*np.pi, size=(p,1))
# Transform the training feature
train_features_transformed = (G@train_features.transpose() + b).transpose()
train_features_transformed = np.cos(train_features_transformed)
# Training
w = np.linalg.solve(np.dot(train_features_transformed.T, train_features_transformed), np.dot( 􏰀→ train_features_transformed.T, y))
train_error = 1.0 * sum(classify(w, train_features_transformed) != train_labels) / n_train
print(“classification error on the train dataset is {}”.format(train_error))
# Testing
# Note that test features should also be transformed
test_features_transformed = (G@test_features.transpose() + b).transpose()
test_features_transformed = np.cos(test_features_transformed)
n_test = test_labels.shape[0]
test_error = 1.0 * sum(classify(w, test_features_transformed) != test_labels) / n_test
print(“classification error on the test dataset is {}”.format(test_error))
The output is:
System Identification by ordinary least squares regression
Making autonomous vehicles involves machine learning for different purposes. One of which is learning how cars actually behave based on their data.
Make sure to submit the code you write in this problem to “HW1 Code” on Gradescope.
(a) Consider the time sequence of scalars xt ∈ R and ut ∈ R in which xt+1 ≈ Axt + But. In control
theory, xt usually represents the state, and ut usually represents the control input. Find the
numbers A and B so that 􏰏(x − Ax − Bu)2 is minimized. The values of x and u are tt+1 t t t t
stored in a.mat.
Solution: We have xt+1 = 0.977552135184xt − 0.0877532187735ut. Putting the following piece of code in the space we left on the corresponding Jupyter notebook cell and running the cell will give you the desired result.
b_solve = x[1:]
A_solve = np.vstack((x[:-1], u[:-1])).T
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 33

x_solve = np.linalg.inv(A_solve.T.dot(A_solve)).dot(A_solve.T).dot( 􏰀→ b_solve)
A, B = x_solve
print(“A: {}, B: {}”.format(A, B))
(b) Consider the time sequences of vectors xt ∈ R3 and ut ∈ R3 in which xt+1 ≈ Axt + But. Find
the matrix A ∈ R3×3 and B ∈ R3×3 so that the sum of the squared l2-norms of the error,
􏰏 ∥x − Ax − Bu ∥2, is minimized. The values of x and u are stored in b.mat. tt+1 t t2 t t
Solution:

 0.15207406 0.93480864 −0.00110243    
A=  0.03893567 0.30958727 0.87436511  (18)
−0.52552959 0.0540906 −0.47026217 
0.20568264 −0.92861546 −0.47124981
 0.04894161 

−0.37090438  
B= −0.04524735  0.91096923
0.12756569  (19) −0.84222314
X_raw = X_raw.reshape(X_raw.shape[:-1])
U_raw = U_raw.reshape(U_raw.shape[:-1])
X_curr = X_raw[:-1]
U_curr = U_raw[:-1]
Y = X_raw[1:]
X = np.hstack((X_curr, U_curr))
soln = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y).T
A, B = soln[:, :3], soln[:, 3:]
print(“A: \n{}, \nB: \n{}”.format(A, B))
Y_hat = A.dot(X_curr.T) + B.dot(U_curr.T)
print(“Error”, np.linalg.norm(Y_hat – Y.T) / Y.shape[0])
(c) Consider a car following model that models how cars line up on a straight 1D highway at a given time. The acceleration of a car can be approximated by a linear function of the positions and velocities of its own and the car in front of it. Mathematically, we can formulate this as
x ̈i ≈axi +bx ̇i +cxi−1 +dx ̇i−1 +e,
where xi, x ̇i, and x ̈i are the position, velocity, and acceleration of the ith car in the line.
Find a, b, c, d, and e that minimize
􏰑 ∥ − x ̈ i + a x i + b x ̇ i + c x i − 1 + d x ̇ i − 1 + e ∥ 2 2
i
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 34

12
x ̈i =h(xi−1 −xi)+ f(x ̇i−1 −x ̇i)−g(x ̇i −L)+wi,
and try to explain the physical meaning for each term, with L being the speed limit.
Solution: The quantity xi−1 − xi represents the spacing between car i and car i − 1, x ̇i−1 − x ̇i represents the relative velocity, and g(x ̇i − L) represents deviation of car’s velocity from the speed limit. Thus, we can say that car acceleration is equal to the linear combination of the head space, speed difference with respect to the preceding vehicle, and the speed difference with respect to the speed limit.
Your Own Question
using data file train.mat, which contains the status of 40 000 cars at a given point from the I-80 highway in California. The data were sampled from the Next Generation Simulation (NGSIM) dataset so that the i may not be continuous. For your convenience, we give you the profiles of each sampled car and the car that is in front of it.
Solution: The resulting model can be represented with x ̈i = −0.012xi − 0.318x ̇i + 0.011xi−1 + 0.275x ̇i−1 − 0.883.
# Assemble matrix
X = np.vstack([x, xd, xp, xpd, np.ones_like(x)])
# Solve
soln = xdd.dot(X.T).dot(np.linalg.pinv(X.dot(X.T)))
a, b, c, d, e = soln.squeeze()
print(“Fitted dynamical system:”)
print(“xdd_i[t] = {:.3f} x_i[t] + {:.3f} xd_i[t] + {:.3f} x_i-1[t] +
􏰀→ {:.3f} xd_i-1[t] + {:.3f}”.format(a, b, c, d, e))
(d) Try to justify why your result in (c) is physically reasonable. Hint: You can reorganize your equation to be
Write your own question, and provide a thorough solution.
Writing your own problems is a very important way to really learn the material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level.
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 35

Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever.
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 36

Contributors:
• Alexander Tsigler • Alvin Wan
• Anant Sahai
• Ashwin Pananjady • Chawin Sitawarin • Fanny Yang
• Inigo Incer
• Jane Yu
• Josh Sanz
• Kate Sanders • Khalil Sarwari • Kuan-Yun Lee • Peter Wang
• Philipp Moritz • Raaz Dwivedi • Ross Bozcar
• Ruta Jawale
• Satish Rao
• Soroush Nasiriany
• Vignesh Subramanian • Yichao Zhou
HW1, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 37