EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Wednesday, September 23 at 11:59 p.m. 2 Probabilistic Model of Linear Regression
HW3
Both ordinary least squares and ridge regression have interpretations from a probabilistic stand- point. In particular, assuming a generative model for our data and a particular noise distribution, we will derive least squares and ridge regression as the maximum likelihood (ML) and maximum a-posteriori (MAP) parameter estimates, respectively. This problem will walk you through a few steps to do that. (Along with some side digressions to make sure you get a better intuition for ML and MAP estimation.)
(a) Assume that X and Y are both one-dimensional random variables, i.e. X, Y ∈ R. Assume an affinemodelbetweenXandY:Y=Xw1+w0+Z,wherew1,w0 ∈R,andZ∼N(0,1)isa standard normal (Gaussian) random variable. Assume w1, w0 are fixed parameters (i.e., they are not random). What is the conditional distribution of Y given X?
Solution: When we condition on the event X = x, the expression Xw1 + w0 becomes xw1 + w0, so
Y|(X = x) ∼ xw1 + w0 + Z. Now xw1 + w0 + Z is a constant plus a standard normal, so
xw1 + w0 + Z ∼ N(xw1 + w0, 1). The conditional density becomes:
11
p(Y|X=x)= √ exp −2(Y−(xw1+w0))2 . (1)
2π
(b) Given n points of training data {(X1, Y1), (X2, Y2), . . . , (Xn, Yn)} generated in an iid fashion by the probabilistic setting in the previous part, derive the maximum likelihood estimator for w1, w0 from this training data.
Solution: The log likelihood function is given by nn
i=1 i=1
1 n
11
2
logp(Y|X=X)= log√ exp{− (Y−(Xw +w))
ii 2π2i10
(Yi − (Xiw1 + w0))2 + Constant
= −2
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
(2)
i=1
Differentiating the log likelihood with respect to w1 and w0, we obtain:
∂ n ∂w logp(Yi|X=Xi)=−
1i i=1 ∂ n
∂w logp(Yi|X=Xi)=− 0i i=1
Xi(Xiw1 +w0 −Yi) (Xiw1 +w0 −Yi)
Setting both of the partial derivatives to zero, we immediately get two equations with two unknowns. The terms that only have Yi and YiXi are pulled to the other side of the equality:
Then, these can be solved directly:
n ni=1 XiYi − ni=1 Xi ni=1 Yi
nnn
w X2 + w X = X Y
1i0iii i=1 i=1 i=1
nn wX+wn=Y
1i0i i=1 i=1
w 1 = n ni = 1 X i 2 − ( ni = 1 X i ) 2
ni=1 Yi ni=1 Xi2 − ni=1 Xi ni=1 XiYi
w 0 = n ni = 1 X i 2 − ( ni = 1 X i ) 2 .
To observe the significance of these terms, we now introduce a few notations. Define the
sample mean, sample variance (unadjusted) and sample cross-covariance as follows:
1 n 1 n X ̄=n Xi, Y ̄=n Yi,
sˆ 2X = n sˆXY = n
i=1
i=1
(Xi −X ̄)(Yi −Y ̄)= n
i=1 i=1
1 n 1 n
( X i − X ̄ ) 2 = n 1 n
X i2 − X ̄ 2
1 n i=1
XiYi −X ̄Y ̄,
Then, we have
i=1
w0 =Y ̄−X ̄w1,and w 1 = sˆ X Y .
sˆ2X
(c) Now, consider a different generative model. Let Y = Xw + Z, where Z ∼ U[−0.5, 0.5] is a continuous random variable uniformly distributed between −0.5 and 0.5. Again assume that w is a fixed parameter. What is the conditional distribution of Y given X?
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Solution: As before, w is fixed, so we have
P(Y = y|X = x) = P(z = y − Xw|X = x)
= P(z = y − xw)
1if −0.5
Solution: Noting that Xi > 0, we find that the likelihood function is given by Πi p(Yi|Xi) = Πni=11{−0.5 + Xiw < Yi < 0.5 + Xiw}
Yi − 0.5 Yi + 0.5 = Π ni = 1 1 X < w < X
ii
.
For this case, maximizing likelihood is equivalent to ensuring that all the indicators have value 1, which in turn requires that all those regions intersect:
max
Yi − 0.5 Yi + 0.5 < w < min .
iXi iXi So any value of w satisfying these constraints is a valid MLE.
(e) Take the model Y = Xw + Z, where Z ∼ U[−0.5, 0.5]. In Jupyter Notebook, simulate n training samples {(X1, Y1), (X2, Y2), · · · , (Xn, Yn)} and illustrate what the likelihood of the data looks like as a function of w after n = 5, 25, 125, 625 training samples. Qualitatively describe what is happening as n gets large.
(You may use the starter code provided in Jupyter Notebook. Note that you have considerable design freedom in this problem part. You get to choose how you draw the Xi as well as what true value w you want to illustrate. You have total freedom in using additional python libraries for this problem part. No restrictions.)
Solution: We generated samples assuming Y = 3X + Z. We assumed that the model parameter is between 0 and 4 (you can assume any range that includes the true parameters.) and we calculated the likelihood by using the formula we obtained above for different values of win this interval. To do so, for a specific w we compute the likelihood Πni=1p(Yi|Xi;w) which is either zero or 1 because the probability distribution is uniform. In the figure below, likelihood of the data can be seen as a function of w for different sizes of the data. Notice the y-axis
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
values in the different plots. We can observe from the plots that increasing the number of samples leads the posterior to concentrate around the true value of w = 3 in this case.
The code here uses a simple approach of just giving each training sample a veto into whether or not a particular w is feasible or not.
0.08 0.07 0.06 0.05 0.04
n=5
n=25
0.040
0.035
0.030
0.025
0.020
0.015 0.03 0.010 0.02 0.005 0.01 0.000 0.00
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 ww
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
n=125
n=625
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 ww
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Figure 1: Plot of the likelihood of data as a function of w for different sample size First part of the missing code in the Jupyter Notebook:
n = sample_size[k]
X = 2 + 2 * np.random.random(n) # generating n X with U[2, 4]
Z = np.random.uniform(-0.5, 0.5, n) # generating i.i.d random
→ uniform noise
Y = 3 * X + Z # computing y from x and z assuming w=3 N = 1001
W = np.linspace(0, 4, N)
Second part of the code:
(f) Consider n training data points {(x1, Y1), (x2, Y2), · · · , (xn, Yn)} generated according to Yi = w⊤xi + Zi where Yi ∈ R, w, xi ∈ Rd with w fixed, and Zi ∼ N(0, 1) iid standard normal random
w = W[i1]
for i in range(n):
# Yi has uniform distribution in [wXi-0.5,wXi+0.5]
if abs(Y[i] - w * X[i]) > 0.5:
likelihood[i1] = 0
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
variables. Argue why the maximum likelihood estimator for w is the solution to a least squares problem.
Solution: Since the logarithm is a monotonically increasing function, we can just look at the log likelihood function.
1n
log p(Yi|xi) = −2 (Yi − (x⊤i w))2 + Constant
i i=1
= −1∥Xw − Y∥2 + Constant,
2
withX∈Rn×d whoseith rowisx⊤i andY∈Rn whoseith entryisYi.
At this point, we are done. We have shown that maximizing the maximum-likelihood objective is the same as maximizing a negative squared error objective, which is the same as minimizing the squared error objective, the same as ordinary least squares.
However, if we wanted to, we could just keep solving this.
Differentiate the log likelihood with respect to w and setting the gradient to zero, we get
∇ logp(Y|x)=X⊤(Xw−Y). wii
i
Setting the gradient to zero, we get
w = (X⊤X)−1X⊤Y. which is the same solution as the least squares problem.
(g) (Multi-dimensional ridge regression) Consider the setup of the previous part: Yi = w⊤xi + Zi, whereYi ∈R,w,xi ∈Rd,andZi ∼N(0,1)iidstandardnormalrandomvariables.Nowwetreat w as a random vector and assume a prior knowledge about its distribution. In particular, we use the prior information that the random variables Wj are i.i.d. ∼ N(0, σ2) for j = 1, 2, . . . , d. Derive the posterior distribution of w given all the xi, Yi pairs. What is the mean of the posterior distribution of the random vector w?
Hint: Compute the posterior up-to proportionality, i.e. you may discard terms that do not depend on w to simplify the algebra. After a few steps, you should be able to identify the family of the distribution of the posterior. Then, you can determine the mean and the variance by completing the square. We find that it is simpler to work in matrix and vector format:
Y1
. . .
⊤
x1 . . .
and Y =
Solution: Note that, we have the following prior distribution on w:
P(w) = Πdj=1P(wj)
X =
x ⊤n
Y n
w2 d 1 j
=Π √ exp−
j=1
2πσ2 2σ2
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
2 1 ∥w∥
2 = (2πσ2)d exp − 2σ2
w⊤w
∝exp−2σ2 .
Furthermore, to compute the likelihood of the data, we observe that Y|{X,w} ∼ N(Xw,1).
Concretely, we have
1
p(Y|X,w)∝exp −2∥Y−(Xw)∥2 . (4)
Now by Bayes’ Theorem, we have
P(w|x1, Y1, · · · , xn, Yn) ∝ Πni=1P(Yi|xi, w) Πdj=1P(wj)
11 ∝ exp −2∥Y − Xw∥2 exp −2σ2 w⊤w
11
∝exp −2(Y−Xw)⊤(Y−Xw)−2σ2w⊤w (5)
111 ∝ exp −2Y⊤Y − 2w⊤X⊤Xw + w⊤X⊤Y − 2σ2 w⊤w
∝exp−1w⊤ X⊤X+ 1 Id w+w⊤X⊤Y
2σ2
The last line uses the fact that you can ignore the term −1Y⊤Y as it does not depend on w.
2
At this point, you should recognize that this is a Guassian pdf (the quadratic term inside the
exponential). Now, we have to complete the square to put it in a standard form to obtain the mean an the variance. To make the notation easier, let’s define the matrix M as
M=X⊤X+ 1Id σ2
Then, noticing that M is symmetric, we see that:
P(w|x1, Y1, · · · , xn, Yn) ∝ exp{−1[w⊤Mw − 2(X⊤Y)⊤w]}
2
(6)
How did we know this was how we should complete the square? The w part on the sides is clear as is the central M since we knew we wanted the w⊤Mw term. Now, because we knew that we wanted to get an (X⊤Y)⊤w term as the linear term in the square, we had to cancel the M in the middle with an M−1. Of course, as argued earlier, the constant here absorbs the compensation for the term that we added while completing the square.
Therefore, the posterior of w given x1, Y1, · · · , xn, Yn is a multivariate Gaussian. The mean of
∝ exp{−1[(w − M−1(X⊤Y))⊤M(w − M−1(X⊤Y))] + Constant} 2
this multivariate Gaussian is
(X⊤X + 1 Id)−1X⊤Y. (7) σ2
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
(h) Consider d = 2 and the setting of the previous part. Use a computer to simulate and illustrate what the a-posteriori probability looks like for the W model parameter space after n = 5, 25, 125 training samples for different values of σ2.
(Again, you may use the starter code. And like problem (e), there are no restrictions for using additional python libraries for this part as well.)
Solution: First, we assumed a variance for the random variable W and we generate d numbers from N(0, σ2). Then, we generated the samples using these parameters and computed a value proportional to the probability using Equation 5 from above. We could have normalized to get the exact posterior probability; Our results would have been equivalent as far as the plots would go.
In the figures below, the a-posteriori probability of the model parameters given data can be seen in the model space W = (W1, W2). As we can see in the plots, increasing the number of samples results is shrinking the zone of likely parameters.
n=5 n=25 444
222
000
222
444
n=125
42024 42024 42024 w1 w1 w1
Figure 2: Heatplots of the a-posteriori probability of the model given the training data, plotted in the model space (W1, W2) for different sample sizes with σ = 0.5, and the true W = (0.4, 0.5)
n=5 n=25 444
222
000
222
444
n=125
42024 42024 42024 w1 w1 w1
Figure 3: Heatplots of the a-posteriori probability of the model given the training data, plotted in the model space (W1, W2) for different sample sizes with σ = 1.5, and the true W = (1.2, 0.8)
First part of the missing code in the Jupyter Notebook:
# generating random samples for X = (X1, X2) and Z from normal dist
X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)
Z = np.random.normal(0, 1, n)
Y = w1 * X1 + w2 * X2 + Z
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
w2 w2
w2 w2
w2 w2
Second part of the code:
L = np.exp(-0.5 * (Y – X1 * w1s – X2 * w2s) ** 2)
posterior[i1][i2] = L.prod() * np.exp(-0.5 * (w1s ** 2 + → w2s ** 2) / sigma)
3 Estimation and approximation in linear regression
In typical applications, we are dealing with data generated by an unknown function (with some noise), and our goal is to estimate this function. So far we used linear and polynomial regressions. In this problem we will build towards being able ot explore the quality of polynomial regressions when the true function is not actually polynomial. Notice that this is reflective of the typical case in machine learning applications — our feature models are rarely exactly correct for the actual underlying patterns that we are trying to learn.
Suppose we are given a full column rank feature matrix X ∈ Rn×d and an observation vector y ∈ Rn. Let the vector y represent the noisy measurement of a true signal y∗:
y = y∗ + z, (8) with z ∈ Rn representing the random noise in the observation y, where zj ∼ N(0,σ2) are i.i.d. We
define the vectors w∗ and wˆ as follows:
w∗ =argmin∥y∗ −Xw∥2 and wˆ =argmin∥y−Xw∥2.
ww
Observe that for a given true signal y∗ the vector w∗ is fixed, but the vector wˆ is a random variable
since it is a function of the random noise z that we are seeing in the observations.
Note that the vector Xw∗ is the best linear fit of the true signal y∗ in the column space of X. We will use this as the proxy for the best approximation we can hope for given our features. Similarly, the vector Xwˆ is the best linear fit of the observed noisy signal y in the column space of X.
Instead of using either the “parameter recovery” framework to see how well we are recovering the underlying parameters, or the “prediction error” perspective based on how well that we would do on a test point drawn according to a test distribution, we will take another more training data oriented point of view in this problem. After obtaining wˆ , we would like to bound the error ∥Xwˆ − y∗∥2, which is the denoising error incurred based on the specific n training samples. This treats the entire universe as being these n samples and our goal is to remove the noise from them as best we can using our model. If you want, you can think of this as a problem in which the test distribution is a uniform distribution over the observed training inputs with the test distribution for the outputs being the true noise-free outputs. In a sense, the denoising problem is as easy as it gets in machine learning since we don’t have to worry about any potential difference between the training samples and where we will be tested.
In this problem we will see how to get a good estimate of this denoising error. When using poly- nomial features, we will also learn how to decide the degree of the polynomial when trying to fit a noisy set of observations from a smooth function.
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8
Remark:
You can use the closed form solution for OLS for all parts of this problem. For parts (a)-(c), assume that the feature matrix X and the true signal vector y∗ are fixed (and not random). Furthermore, in all parts the expectation is taken over the randomness in the noise vector z.
(a) Show that E[wˆ ] = w∗ and use this fact to show that
∗ 2∗ 2 2
E ∥y −Xwˆ∥2 = ∥y −E[Xwˆ]∥2 +E ∥Xwˆ −E[Xwˆ]∥2 .
Note that the above decomposition of the squared error corresponds to the sum of bias-squared
and the variance of our estimator Xwˆ .
Hint 1: For the first part, you can start by writing out wˆ and w∗ in a closed form.
Hint 2: To show the second part, you might want to use the Pythagorean theorem, and you can use this fact without proof: y∗ − Xw∗ is orthogonal to Xw∗ − Xwˆ . You can also verify this yourself.
Solution: First we will show that E[wˆ ] = w∗. We use the following closed form expressions for the vectors wˆ and w∗:
We have
w∗ = (X⊤X)−1X⊤y∗ and wˆ = (X⊤X)−1X⊤y.
wˆ = (X⊤X)−1X⊤y = (X⊤X)−1X⊤(y∗ + z) = (X⊤X)−1X⊤y∗ + (X⊤X)−1X⊤z
= w∗ + (X⊤X)−1X⊤z
Since w∗ is fixed we obtain that E[w∗] = w∗. Also notice that E[z] = 0. Using these two facts, we conclude
E[wˆ ] = E[w∗] + E[(X⊤X)−1X⊤z] = w∗ + (X⊤X)−1X⊤E[z] = w∗.
Now we will show the second part of the question. The hint is a good place to start with. Since we know that the two terms are orthogonal, we can try the Pythagorean theorem on it and see what we get:
∥y∗ −Xw∗∥2 +∥Xw∗ −Xwˆ∥2 =∥y∗ −Xw∗ +Xw∗ −Xwˆ∥2 = ∥ y ∗ − X wˆ ∥ 2 2
Taking expectations on both sides in the above equality, we find that
∗ 2 ∗ ∗2 ∗2
E ∥y −Xwˆ∥2 =E∥y −Xw ∥2 +E ∥Xwˆ −Xw ∥2 ∗ ∗2 ∗2
= ∥y − Xw ∥2 + E ∥Xwˆ − Xw ∥2
The last step makes use of the fact that ∥y∗ − Xw∗∥ is a fixed scalar. Using the previous result
E[wˆ ] = w∗, we obtain that E[Xwˆ ] = Xw∗ which yields the desired result.
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9
(b) Recall that if v ∼ N(0,Σ), where Σ ∈ R(d×d) is the covariance matrix, then for any matrix A ∈ Rk×d , we have Av ∼ N (0, AΣAT ). Use this fact to show that the distribution of the vector wˆ is given by
wˆ ∼ N(w∗, σ2(X⊤X)−1). Solution: In the previous part, we saw that
wˆ = w∗ + (X⊤X)−1X⊤ z. (9)
=:A
Since z ∼ N(0, σ2I), we get that Az ∼ N(0, σ2AA⊤). Notice that
AA⊤ = (X⊤X)−1X⊤(X⊤X)−1X⊤⊤ = (X⊤X)−1.
And thus we have Az ∼ N(0, σ2(X⊤X)−1). Finally, recall that if v ∼ N(0, Σ) and c is any fixed vector, then v + c ∼ N (c, Σ). Consequently, the result follows using equation (9) and the fact that w∗ is fixed.
(c) Show that
1 ∗22d nE∥Xwˆ−Xw∥2 =σn.
(Hint: The trace trick: trace(AB) = trace(BA), might be useful in doing this proof if you choose one natural route. Another useful way to understand this might involve thinking about what coordinate system makes life easier. )
Notice that the above has a very intuitive interpretation: there are n samples, but only d degrees of freedom in what we are fitting. This allows only d of the noise energy to come through after
n
we “filter” the noisy y in this way.
Solution: Lets first expand ∥Xwˆ − Xw∗∥2. We have
∥Xwˆ − Xw∗∥2 = ∥X(wˆ − w∗)∥2
= X(wˆ − w∗)⊤(Xwˆ − Xw∗)
= (wˆ − w∗)⊤X⊤X(wˆ − w∗).
∗2∗⊤⊤ ∗
The easier approach is to change coordinates. Look at the distribution for (wˆ − w∗) from the previous part — this tells us that there exists a matrix A and a standard Gaussian random variable U ∼ N(0, Id) so that (wˆ − w∗) = σAU and AA⊤ = (X⊤X)−1. In particular, we can choose the symmetric A = (X⊤X)−1 for this matrix A. Consequently, (A−1)2 = (X⊤X) as well.
And thus we have
E ∥Xwˆ−Xw∥2 =E (wˆ−w) X X(wˆ−w) .
At this point, there are two different ways to proceed. We include both of them so that you can
learn.
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
∗⊤⊤∗⊤⊤ E (wˆ−w) X X(wˆ−w) =E (σAU) (X X)σAU
2⊤ −1−1 =σEUA(A A )AU
2⊤ =σEUU
= dσ2. What we want to show follows by dividing both sides by n.
The longer alternative derivation uses properties of the trace. We include this just so that students who haven’t seen this sort of thing get to see it in action. For any scalar α, we can write α = tr(α). Using the identity tr(AB) = tr(BA), we find
∗⊤⊤ ∗⊤ ∗ ∗⊤ E tr (wˆ−w) X X(wˆ−w) =E tr X X(wˆ−w)(wˆ−w) .
Note that even for a random matrix A ∈ Rd×d, we have dd
E[tr(A)] = E[ Aii] = E[Aii] = tr(E[A]). i=1 i=1
This identity is a powerful one and worth remembering. Using this identity, we have
⊤ ∗ ∗⊤⊤∗ ∗⊤ E tr X X(wˆ−w)(wˆ−w) =tr X XE (wˆ−w)(wˆ−w) .
Recall that E[wˆ ] = w∗. Hence, the matrix E (wˆ − w∗)(wˆ − w∗)⊤ is precisely the covariance matrix of the random vector wˆ and is equal to σ2(X⊤X)−1. Using this fact, we obtain that
⊤ ∗ ∗⊤ ⊤ 2⊤−1 trXXE(wˆ−w)(wˆ−w) =trXXσ XX
= σ2 tr(Id) = σ2d.
Dividing by n proves the desired result.
To summarize, we have used the following sequence of equalities:
∗2∗⊤⊤ ∗ E ∥Xwˆ−Xw∥2 =E (wˆ−w) X X(wˆ−w)
∗⊤⊤ ∗ = E tr (wˆ − w ) X X(wˆ − w )
⊤ ∗ ∗ ⊤ = E tr X X(wˆ − w )(wˆ − w )
⊤ ∗ ∗ ⊤ = tr E X X(wˆ − w )(wˆ − w )
⊤∗ ∗⊤ = tr X XE (wˆ − w )(wˆ − w )
⊤ 2⊤−1 =trXXσ XX
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
= σ2 tr(Id) = σ2d.
(d) Assume the underlying model is a noisy linear model with scalar samples {αi, yi}ni=1, i.e. yi = w1αi+w0+zi. We construct matrix X by using D+1 polynomial features PD(αi) = [1, αi, . . . , αiD]⊤ of the distinct sampling points {αi}ni=1. For any D ≥ 1, compare with model (8) and compute w∗ for this case. Also compute the bias (∥y∗ − Xw∗∥2) for this case. Using the previous parts of this problem, compute the number of samples n required to ensure that the average expected denoising squared error is bounded by ε? Your answer should be expressed as a function of D, σ2, and ε.
Conclude that as we increase model complexity, we require a proportionally larger num- ber of samples for an equally accurate denoising.
Solution: Notice that we can re-write the underlying model as follows: y=y∗ +z=X[w0,w1]⊤ +z where
y = [y1 …yn]⊤, z = [z1 …zn]⊤ and X = [P1(α1)…P1(αn)]⊤
Now observe that the true signal is indeed linear and hence we only need linear features. Consequently, we have that for D = 1: w∗ = [w0, w1]⊤. Extending for D ≥ 1, we conclude that w∗ = [w0, w1, 0]⊤, where 0 ∈ RD−1.
From the above, it is clear that ∥y∗ − Xw∗∥2 = 0, i.e. the bias is 0.
Now, lets compute the number of samples n required to ensure that denoising error is bounded by ε. For a polynomial regression model, the matrix X has full column rank when the points are distinct (a fact that is clear since Lagrange interpolation works), hence d = D + 1. From part (a) we know that the error is: ∥Xwˆ − y∗∥2 = ∥y∗ − Xw∗∥2 + ∥Xw∗ − Xwˆ ∥2, and using part (d) and the fact that the bias is 0, we get that:
1 ∗22D+1 n E ∥ X wˆ − y ∥ 2 = σ n
and in order to upper bound the error with ε it suffices to have n > σ2(D+1). ε
Clearly, the number of samples required for error ε increases linearly with the degree D and with the variance σ2.
(e) Simulate the problem from part (d) for yourself. Set w1 = 1, w0 = 1, and sample n points {αi}ni=1 uniformly from the interval [−1, 1]. Generate yi = w1αi + w0 + zi with zi representing standard Gaussian noise.
Fit a D degree polynomial to this data and show how the average error 1 ∥Xwˆ − y∗ ∥2 scales n2
as a function of both D and n.
You may show separate plots for the two scalings. It may also be helpful to average over
multiple realizations of the noise (or to plot point clouds) so that you obtain smooth curves.
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12
(For this part, the libraries numpy.random and numpy.polyfit might be useful. You are free to use any and all python libraries. You may use the starter code provided in Jupyter Notebook.)
Solution: In the Figure 4 below, we plotted the log error vs degree of the polynomial and log of error vs the sample size for better understanding. As we can see the error increases with D and decreases with n.
First part of the missing code in the Jupyter Notebook:
# generating alpha from unifrom dist with sample size n_s + step
alpha = np.random.uniform(-1, 1, n_s + step)
y_s = alpha * w1 + w0
Second part of the code:
10 1
6×10 2
4×10 2 3×10 2
2×10 2
3×10 1
2×10 1
10 1
w_hat = np.polyfit(alpha[:n], y_s[:n] + z, d)
errors += (np.polyval(w_hat, alpha[:n]) – y_s[:n]) ** 2
2
4 6
degree of polynomial (D)
12
40 50 60 70 80 90 100 110 120 number of samples (n)
8 10
Figure 4: Average error vs D and n
(f) Now, let us be more realistic. Assume that the true pattern we are trying to learn is the noisy exponential function with scalar samples {αi, yi}ni=1 where yi = eαi + zi with distinct sampling points {αi}ni=1, in the interval [−4, 3] and i.i.d. Gaussian noise zi ∼ N(0, 1). We again construct matrix X by using D + 1 polynomial features [1, αi, . . . , αiD]⊤ and use linear regression to fit the observations. Recall, the definitions of the bias and variance of the OLS estimator from part (a) and notice that in this case, for a fixed n, as the degree D of the polynomial increases: (1) the bias decreases and (2) the variance increases. Use bounds or approximations as needed and argue that to get a good denoising error, a reasonable choice of D is given by O(log n/ log log n).
You can directly use previous parts of this problem, as well as the problem on the earlier homework (Approximating a 1D function). You may assume that n and D are large for ap- proximation purposes. Feel free to use Stirling’s approximation and whatever else you need to make your life easier.
One useful heuristic for balancing a tradeoff where you have the sum of two terms — one increasing and one decreasing — is to pick the point at which the two terms are equal.
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13
log of error (n=120)
log of error (D=12)
Solution: Using part (a), we split the denoising error in two terms: bias and variance. We first compute the bias (approximation error).
We use HW 1 Problem 7 (Approximating a 1D function) now. For the approximation error, we use the Dth order Taylor series approximation of the function f : α → eα about the point 0, which is given (for some α′ ∈ [0, α]) by
D αi
eα = (−1)(i−1)/2 +fD+1(ζ)
i=1,3,.. i!
φD(α)
We derived the following bound in the last homework:
α′D+1 (D + 1)!
.
e34D+1 ||f−φD||∞= sup |e −φD(α)|≤(D+1)!.
α∈[−3,4]
Consequently, the approximation error for the sample {αi, y∗} is bounded by e34D+1 . Summing
and dividing through by n yields that
3D+1 2 2 1∗∗2e4 Bias = Approximation error = ∥y − Xw ∥ ≤ .
that
1 ∗ 2 D+1 Variance = Estimation error := n E ∥Xwˆ − Xw ∥2 = n .
To optimally trade-off bias and variance, we set them to be equal. This provides the right scaling (up to constants). Think about why? Hint: Approximation x + y ≈ max{x, y} for x, y > 0 will be useful. (This is the same approximation used to understand the Bode plots for first-order low-pass and high-pass filters.) Using Stirling’s approximation, ignoring constants and making a few approximations we find that
α
n (D+1)!
Now using part (c) and the facts that the matrix X is of size n × (D + 1) and σ = 1, we obtain
i (D+1)!
(4e)D+1 D
(D+1)D+1 ≈ n
(4e)D+1 D
DD ≈ n
√ DD+1
n≈ 4e √ DD
n≈ 4e
D
logn ≈ Dlog 4e .
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14
Now we verify if the approximation is valid for D = log n/ log log n: D
and hence we are done.
Dlog 4e ≈DlogD
logn logn
= loglogn ·log loglogn
= logn ·loglogn−logloglogn
log log n ≈ log n,
(g) Simulate the problem in the previous part yourself. Sample n points {αi}ni=1 uniformly from
the interval [−4, 3]. Generate yi = eαi + zi with zi representing standard Gaussian noise. Fit a
D degree polynomial to this data and show how the average error 1 ∥Xwˆ − y∗∥2 scales as n2
a function of both D and n. You may show separate plots for the two scalings. For scaling with D, choose n = 120. It may also be helpful to average over multiple realizations of the noise (or to plot point clouds) so that you obtain smooth curves. (For this part, the libraries numpy.random and numpy.polyfit might be useful and you are free to use any and all python libraries.)
Solution: In the Figure 5 below, we plotted the log of error vs degree of the polynomial and log of error vs the sample size for better understanding.
First part of the missing code in the Jupyter Notebook:
Second part of the code:
(h) Comment on the differences in plots obtained in part (e) and part (g). Solution:
In part (e), the true model was a linear model and therefore polynomial feature based OLS estimate was unbiased, i.e., we had no bias or approximation error. The variance or estimation error scales linearly with the degree of the polynomial and hence we saw in the graph that the denoising error increases monotonically with D.
For part (g), the true signal was an exponential function which can not be represented exactly by polynomials. As a result, we have an approximation error which however decreases with increase in the degree of the polynomial (as we had explored in the previous homework).
# generating alpha from unifrom dist with sample size n_s + step
alpha = np.random.uniform(-4, 3, n_s + step)
y_s = np.exp(alpha)
w_hat = np.polyfit(alpha[:n], y_s[:n] + z, d)
errors += (np.polyval(w_hat, alpha[:n]) – y_s[:n]) ** 2
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15
3×10 1
100
10 1
2×10 1
4
However, as the model complexity (which in our case is equal to the degree of the polynomial) increases, the variance of the estimator also increases. As a result we see a trade-off between the approximation error and the variance that comes from estimation error. For small D, the bias is large and for large D, the variance is too big. Consequently, we see a sweet spot D∗ at which the two terms are of the same order.
More generally, the approximation error measures how well our regression model fits the true data. Since we can better approximate the unknown function with higher order polynomials, this term decreases with D. However, it becomes harder to estimate the correct polynomial, since we now give ourselves extra freedom. This phenomenon is the trade-off between dif- ferent kinds of errors in learning in action! Because we focused on the denoising problem and used OLS, the only effects that are present is contamination due to training noise and the approximation error. There is no irreducible error (by construction), and OLS doesn’t have any systematic error in estimating the best parameter. OLS also doesn’t have any external contamination due to the learning algorithm.
As far as the scaling with respect to the number of samples n goes, we see that bias is unaffected by the number of samples. The variance of the estimator should only decrease with the number of samples and hence we see a decrease in prediction error with increase in number of samples.
Note that in the limit of infinite samples, the estimation error variance vanishes away and the prediction error will be equal to the bias term that comes from the approximation error.
Orthonormalization of features
2 4 6 8 10 degree of polynomial (D)
12
40 50 60 70 80 90 100 110 120 number of samples (n)
Figure 5: Average error vs D and n
Note: This problem has a companion Jupyter notebook. Most of the parts have a corresponding demo/visualization in the Jupyter notebook but for parts (c) and (g) you are required to fill part of the code in the notebook.
Consider the problem setting of learning a 1-dimensional function f (x) from noisy samples (xi, yi)i=1,…,n like in Problem 8 of HW 1. We perform linear regression in feature space with the featurization
φ : R → Rd to learn coefficients w. On a given test point x, we predict
fˆ(x) = φ(x)⊤w.
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16
log of error (n=120)
log of error (D=12)
To evaluate our learned predictor we care about its “test performance”, the average performance on data coming from our test distribution. Next we will formalize this.
Let X, Y denote the random variables corresponding to the test data. Further suppose that Y = f (X), i.e each randomly sampled test point comes along with its true function value (there is no additional randomness in the Y at test time due to noise). Let X have the probability density function p, i.e X ∼ p. Thus the test data is (X, f (X)) and its distribution is determined by p. On the random test point X, we make the prediction Ypred = fˆ(X) = φ(X)⊤w.
Note: The test data distribution can be different than the train data distribution. One such case is if the training points xi were not sampled from the same distribution p as the test point. Even if the training points were sampled from p, the training labels yi may not be the true function value at points xi due to the presence of noise and thus the test and train data distributions typically vary in practice.
For this question we assume that the training points and consequently the learned coefficients w are fixed and not random. We define the prediction error/test error as,
x
Note: The expectation is over the randomness due to the test data point X. Nothing else is random at test time here. We are using squared loss here because it is convenient and has nice geometry. By changing the loss function we can have other measures of prediction performance.
(a) Prediction error vs parameter estimation error.
Supposethetruefunctioncanbeexactlyrepresentedinthefeaturespace, f(X)=φ(X)⊤w∗ for some w∗ ∈ Rd. In this case we can define another metric of test performance, the parameter estimation error given by
∗ 2 ∗ 2 Eest =Ew−w 2 =w−w 2,
since neither w nor w∗ vary based on the test point X. Estimation error measures how well we are estimating the paramters of the function in feature space while test error measures how close the values predicted by the learned function are to the true function values.
Show that if C = E[φ(X)φ(X)⊤] = Id, then Epred = Eest. In this case, the features are orthonor- mal with respect to the test distribution.
Epred = E[(Ypred − Y)2] = E[(φ(X)⊤w − f (X))2] =
(φ(x)⊤w − f (x))2 p(x)dx
Solution: We have,
Epred = E[(φ(X)⊤w − φ(X)⊤w∗)2]
= E[(w − w∗)⊤φ(X)φ(X)⊤(w − w∗)]
= (w − w∗)⊤E[φ(X)φ(X)⊤](w − w∗) ∗ 2
= w − w 2
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17
= Eest.
(b) The entries of the “covariance” matrix are given by,
Cij = E[φi(X)⊤φj(X)] =
φi(x)⊤φj(x)p(x)dx,
Consider a 3-dimensional polynomial featurization φ(x) = [1, x, x2]⊤ and let p = Uniform[−1, 1].
x
Solution: Since C is symmetric suffices to compute C11, C22, C33, C12, C13, C23. We have,
where φ(x) = [φ1(x), φ2(x), . . . , φd(x)]⊤. Calculate C. Are the features orthonormal?
C11 =
C22 =
C33 =
C12 =
C13 =
11
2 12dx = 1
−1
11 1
2 x2dx=3 −1
11 1 2 x4dx=5
−1 11
2 xdx = 0 −1
11 1 2 x2dx=3
C23 = 2
−1 Since C I3, the features are not orthonormal.
−1 11
x3dx = 0.
(c) Orthonormalization of features: Polynomial example Consider the same polynomial featur- ization from the previous part φ(x) = [1, x, x2]⊤. Using the Gram-Schmidt procedure, or- thonormalize these features with respect to the test distribution p = Uniform[−1, 1]. Then complete the corresponding part in the Jupyter notebook.
(HINT: Denote the relevant inner product between features as ⟨φ1, φ2⟩ = 1 1 φ1(x)φ2(x)dx.) 2 −1
Solution:
=x−2 x.1dx −1
= x.
Since ⟨φ2, φ2⟩ = 1 we rescale to obtain φ2(x) = 3x.
We start with φ1(x) = 1. Check that ⟨φ1, φ1⟩ = 1. Next,
φ2 =φ2 −⟨φ2,φ1⟩φ1 11
√
3
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 18
Similarly
φ3 = φ3 − ⟨φ3, φ1⟩φ1 − ⟨φ3, φ2⟩φ2
11
= x2 − 2
−1
= x2 − 1 − 0 3
= x2 − 1 3
=4. 45
11 √ √
x2.1dx − (2 x2. −1
3xdx)
3x
Since,
⟨φ3,φ3⟩= =1+1−2
1112 (x4+ − x2)dx
2−1 93 599
we normalize to get φ3 = 1 23
√
45(x2 − 1).
(d) Orthonormalization of features using the covariance.
have features φ(x) such that C = E[φ(X)φ(X)⊤] = K for a positive-definite matrix K. Show
2
that φ(x) = K− 1 φ(x) is a set of features orthonormal with respect to the test distribution
p. If we follow this technique for the previous part do we end up with set of orthonor- mal features? Explain. It may be useful to look ahead in the Jupyter notebook, to see this technique applied numerically to the previous example.
222
Here, you can think about K 1 as the natural symmetric matrix so that K 1 K 1 = K — basi-
cally the matrix you get by applying the spectral theorem for real symmetric matrices to K and then just taking the square-roots of the all eigenvalues but keeping the same orthogonal eigenvectors.
2
Solution: We have since K− 1 is symmetric,
22 E[φ(X)φ(X)⊤] = E[K− 1 φ(X)φ(X)⊤K− 1 ]
−1 ⊤−1 = K 2 E[φ(x)φ(x) ]K 2
= K−1 K1 K1 K−1 2222
= I.
If we use this technique for the set of polynomial features in the previous part then we end with a different set of orthonormal features since in this case the transform is via a symmetric matrix while in the previous part it was via an upper triangular matrix. Since there are infinitely many combinations of orthonormal features in a space (One example could be rotating the features) this is not contradicting the solution from the previous part.
Now suppose X has density p and we
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 19
(e) Effect of test distribution. Consider the 2-dimensional polynomial featurization φ(x) = [1, x]⊤ with test distribution,
1
, −1≤x≤0 3
p(x)=2, 0≤x≤1 . 3
0, o.w.
Arethefeaturesorthogonalwithrespecttop,i.eis xp(x)dx=0?Itwillbeusefultoview x
the numerical calculation of the covariance of these features in the Jupyter notebook.
Solution: We have,
10 21 xp(x)dx = 3 xdx + 3 xdx
x −1 0 = −1 + 1 0.
63
Thus the features 1 and x are not orthogonal with respect to this test distribution. Recall that under the uniform test distribution p = Uniform[−1, 1] these features were orthogonal. Thus the test distribution plays an important role in determining whether the features are orthogonal.
Next we will see why having an orthonormal set of features with respect to the test distribution is useful.
(f) Transforming bases. Consider an orthonormal featurization, φ(x) with respect to test distri- bution p and let the true function be f(x) = φ(x)⊤w with ∥w∥2 = α 0. Show we can find a new featurization φ(x) whose features are also orthonormal with respect to p such that
f(x)=αφ(x)⊤e1 wheree1 istheunitvectorwithoneinthefirstentryandzeroeverywhere else.
Solution: First find orthogonal matrix Q such that Qαe1 = w. To do this choose the first column of Q as w and complete the result of the columns using Gram-schmidt orthonormal-
ization. (If you remember, this is the same kind of alignment operation you saw in 16B or wherever you recursively proved that square matrices could be upper-triangularized by means of a unitary change of basis — some of you might have seen the proof just for symmetric matrices.) Then choose φ(x) = Q⊤φ(x). Check that,
E[φ(X)φ(X)⊤] = E[Qφ(X)φ(X)⊤Q⊤] = QE[φ(X)φ(X)⊤]Q⊤ = I. αφ(x)⊤e1 = αφ(x)⊤Qe1 = φ(x)⊤w,
since by construction αQe1 = w.
(g) Decomposition of prediction error. Assume that we have a featurization φ(x) and test distri- bution p such that E[φ(X)] = 0. i.e. The features have zero mean. (If we wanted, we could also let one of the features be the constant 1 to pull the constant part out and everything would still be fine.) Let the features be orthonormal with resepect to the test distribution. Let the
α
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 20
true function be f(x) = φ(x)⊤w∗ for w∗ = w∗1e1. We learn the predictor fˆ(X) = φ(X)⊤w where w = [w1,w2,…,wd]⊤. Thus,
fˆ(X) = w1φ1(X) + wjφj(X). j=1d
Let S and C be random variables defined as
S = ( w 1 − w ∗1 ) φ 1 ( X ) d
C = wjφj(X). j=2
Show that S and C are uncorrrelated, i.e E[S C] = E[S ]E[C]. Further show that the prediction error can be decomposed as
where,
Epred =E(f(X)− fˆ(X))2 =(w∗1)2(1−SU)2 +CN2,
SU = w1 w∗1
d
CN2 =w2j.
j=2
Implement this decomposition in the Jupyter notebook and verify that they match with
numerical observations.
Solution: We have,
d E[SC]=E[(w1 −w∗1)φ1(X)(wjφj(X))]
j=2
= (w1 − w∗1)wjE[φ1(X)φj(X)]
j=2 = 0,
d
since for j 1,E[φ1(X)φj(X)] = 0 because the features are orthonormal. Also E[S] = E[C] = 0 since E[φ(X)] = 0. Thus S and C are uncorrelated.
The prediction error can be expressed as,
Epred =E[(f(X)−fˆ(X))2] dd
= E[( w∗jφj(X) − wjφj(X))2] j=1 j=1
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 21
= E[(S + C)2]
= E[S 2] + 0 + E[C2]
d
= (w∗1 − w1)2E[φ1(X)2] + (w∗j − wj)2E[(φj(X))2].
j=2
In the last step we made use of the fact that the features are orthonormal so E[φj(X)φk(X)] = 0
for j k. Thus,
1
2 2 j=d+1
j=d+1 are uncorrelated and Var[R ] = D
(w∗)2. j
and R
in this case can be decomposed as,
Conclude that the prediction error
d
Epred = (α − w1)2 + (wj)2
j=2
= α2(1 − SU)2 + CN2.
Since the features are orthonormal we are able to calculate the prediction error by just looking at the errors in the estimation of coefficients (i.e estimation error).
(h) Approximation error. Finally consider the case where the true function cannot be expressed
by the featurization that we train with. Concretely suppose that the true function is given by,
f (X) = D φ(X)w∗ and we learn a predictor fˆ(X) = d φ(X)w where d < D. Assume that j=1 j j=1 j
the D-dimensional featurization is orthonormal with respect to the test distribution p and let E[φ(X)] = 0.
The residual between the true function and the predicted function can be decomposed as,
dD
R = f ( X ) − f ˆ ( X ) = ( w ∗j − w j ) φ j ( X ) + w ∗j φ j ( X ) .
j=1
Using a similar approach to the previous part show that if,
R = R1 + R2 d
R 1 = ( w ∗j − w j ) φ j ( X ) j=1
D
R2 = w∗jφj(X),
j=d+1
R
j=1
Here Var[R2] is the component corresponding to the approximation error.
We have implemented this decomposition for you in the Jupyter notebook. Interact with the demo and see how the two components of the error vary with d and the noise level in the system.
d
Epred =E(f(X)− fˆ(X))2 =Var[R2]+(w∗j −wj)2
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 22
5
dD
E(f(X)− fˆ(X))2 =E[((w∗j −wj)φj(X)+ w∗jφj(X))2]
j=1 j=d+1 d
= (w∗j − wj)2 + Var[R2], j=1
again using orthogonality of the features.
Isotropic gaussians
Solution: Following a similar approach to the previous part we have,
dD
E[R1R2] = E[((w∗j − wj)φj(X))( w∗jφj(X))] = 0 j=1 j=d+1
since E[φj(X)φk(X)] = 0 for j k. Also E[R1] = E[R2] = 0 since E[φ(X)] = 0.
The prediction error is given by,
D Var[R2] = E[R2] = (w∗j)2.
j=d+1
In this problem we are going to observe some peculiar behaviour of high-dimensional vectors. We are going to talk about the vectors from gaussian distribution, whose covariance is equal to identity matrix (they are also called isotropic). To obtain quantitative results, we will need some probabilistic bounds. First of all, we will need the following inequality for a standard normal random variable: for ξ ∼ N(0, 1) and any t > 0 it holds
P(ξ > t) ≤ exp−t2/2. (10)
Secondly, we will require analogous results for a χ2(k) distribution (recall that this is the distribu- tion of the sum of squares of k i.i.d. standard normals; in other words this is the distribution of the squared norm of an isotropic gaussian vector in Rk). The result that we will use is: if ξ ∼ χ2(k), then for any t > 0
√
P(ξ−k>2 kt+2t)≤exp(−t),
√ P(ξ−k<−2 kt)≤exp(−t)
We will use these so often, that it will be convenient to use the following notation: denote
√√ lχ2(k,t):=k−2 kt, uχ2(k,t):=k+2 kt+2t.
(a) Consider two independent random vectors: u, v ∼ N (0, Id ) in Rd . Show that for any t > 0 2
P ∥u∥2 ≥lχ2(d,t) ≥1−exp(−t),
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 23
√
P |u⊤v|/∥v∥2 ≤ 2t ≥1 − 2 exp(−t)
HINT: what is the distribution of the random variable u⊤ v ? ∥v∥2
Solution: The first inequality is exactly the upper bounds for χ2(d) random variables as ∥u∥2 has exactly that distribution.
The second inequality follows directly from the tails of the normal distribution since u⊤v ∼ ∥v∥2
N (0, 1). Why? The vector v has unit norm, and u ∼ N (0, Id ) is independent of it. Isotropic ∥v∥2
Gaussian random variables like u have the property that their projections onto any unit-normed √
vector is a standard normal N(0, 1) random variable. At that point, we just notice that ( 2t)2 = 2t and dividing by 2 while applying (10) gives us what we want.
(b) Use the result of the previous part to lower bound the angle between vectors v and u with probability at least 1 − 3 exp(−t) for any t ∈ (0, d/4). What result do you get for t = 4 and d = 400?
Solution:
To bound the angle, we look at | |u⊤v| | since this is the cosine of the angle between u and v.
∥u∥2 ∥v∥2
We want this to be a small number. We decide to ensure that the ∥u∥ in the denominator is big
enough and that the normalized u⊤v is small enough. ∥v∥2
Combining all the results from the previous part, we can use the union bound on the “bad events” that the inner product is too large or the norm of the vector u is too small. This gives us that with probability at least 1 − 3 exp(−t),
√ |u⊤ v| 2t
√ =: s(t, d). 22 d−2td
∥u∥ ∥v∥ ∥ ≤
Therefore, the angle between u and v lies in [arccos s(t, d), arccos−s(t, d)] with probability at least 1 − 3 exp(−t). When we plug in d = 400 and t = 4 it becomes: with probability at least 94.5% the angle between u and v is from 80.9◦ to 99.1◦. That is a window around a right angle of approximately ±10◦ degrees.
(c) Let’s check how sharp our theoretical guarantees are. Do the task from part A of the Jupyter notebook.
Solution: To compute the angle, you can do the following:
The number of angles that are not between 80.9 and 99.1 can be computed by
print(np.sum((np.array(angles) < 80.9) | (np.array(angles) > 99.1)) )
Usually there are from 0 to 3 elements that fall out of the region, so the true probability is more than 94.5%. However, 80.9 and 99.1 are good approximations of the edges of the histogram.
angle = 180 * np.arccos(np.sum(x * y) / np.linalg.norm(x) / np. → linalg.norm(y))/np.pi
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 24
(d) We’ve just seen that two random gaussian vectors in high dimensions are almost orthogonal to each other with high probability. What if we sample more than 2? Intuitively, if the dimension is high enough, and if we don’t sample too many vectors, then the matrix comprised of those vectors should be close to orthogonal. For example, it seems reasonable to assume that the singular values of that matrix should be all close to each other. In this and the following parts we will give quantitative bounds that show exactly that.
Suppose X ∈ Rn×d is a matrix with i.i.d. standard normal entries. In the next part we will use Gershgorin circle theorem (you can find its statement on this wikipedia page) to bound thesingularvaluesofXforn ≪ d. Tosetupforthat,showthatforanyt > 0foreach Gershgorin disc of the matrix XX⊤ its center lies in the interval lχ2 (d, t), uχ2 (d, t) with
probability at least 1 − 2 exp(−t), and its radius is not larger than nuχ2 (d, t)uχ2 (n, t) with probability at least 1 − 2 exp(−t).
HINT: when you compute the radius of the i-th Gershgorin disc of matrix XX⊤, pull the norm of the i-th row of matrix X out of brackets. Then use Cauchy-Schwartz to obtain a χ2(n − 1) random variable.
Solution: Let’s look at the matrix XX⊤. From the concentration inequality for χ2 distribution, we know that each diagonal element of this matrix lies in lχ2 (d, t), uχ2 (d, t) with probability at least 1 − 2 exp(−t), as those diagonal elements have χ2(d) distributions. Why? They are the sum of d squares of independent standard Gaussian random variables.
Now, let’s look at the sums of absolute values of non-diagonal elements in some row. Let’s denote the i-th row of X as Xi. We want to bound
| X i X ⊤j | . ji
First, the random variables ξi,j := XiX⊤j /∥Xi∥2 are just standard normal (why? They are the projection onto a unit vector of an Gaussian random variable with distribution N (0, Id )), inde- pendent of each other. Each ξi,j is also independent of ∥Xi∥2 since that length does not effect ξi, j at all. Next, we can use the Cauchy-Schwartz inequality to write
ji
The random variable ξ2 ji i,j
ji
|XX⊤|=∥X∥ |ξ |≤∥X∥ ij i2 i,j i2
(n−1)ξ2 . i,j
ji
has a χ2(n − 1) distribution, and the random variable ∥X ∥2 — i 2
χ2(d). Thus, we can use concentration inequality for chi-squared random variables to obtain
2
P∥X∥ ξ ≥ u 2(d,t)u 2(n,t) i2χχ
i,j
ji
22
≤P ∥X∥ ≥u2(d,t) +P ξ ≥u2(n,t) iχjχ
2 j i ≤2 exp(−t).
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 25
We see that for any t > 0 the radius of the i-th Gershgorin disk is not larger than nuχ2 (d, t)uχ2 (n, t) with probability at least 1 − 2 exp(−t).
(e) Usethepreviousparttoconcludethatwithprobabilityatleast1−4nexp(−t)alltheeigen- values of XX⊤ lie between lχ2 (d, t) − nuχ2 (d, t)uχ2 (n, t) and uχ2 (d, t) + nuχ2 (d, t)uχ2 (n, t).
What does this bound become for d = 1000, n = 10 and t = 4 + log n?
HINT: you need the bound from the previous part to hold for every Gershgorin disc at the same
time, so you should adjust the probability correspondingly.
Solution: We see from the previous part that for each i with probability at least 1 − 4 exp(−t) the center of the i-th Gershgorin disc is in the interval lχ2 (d, t), uχ2 (d, t), and its radius is not
largerthan nuχ2(d,t)uχ2(n,t).
Therefore, by the union bound, the same holds for all discs simultaneously with probability at least 1−4n exp(−t). On this event it follows directly from Gershgorin circle theorem that all the eigenvalues of XX⊤ lie between lχ2 (d, t)− nuχ2 (d, t)uχ2 (n, t) and uχ2 (d, t)+ nuχ2 (d, t)uχ2 (n, t).
For d = 1000, n = 10 and t = 4 + log n the statement becomes: with probability at least ≈ 93%, all the eigenvalues of XX⊤ lie between 170 and 1843.
(f) Once again, let’s check our bounds numerically. Do the task from part B of the Jupyter notebook.
Solution: To compute the minimum and maximum singular values of X you can do the fol- lowing:
The squared singular values of the simulated matrix are always from 750 to 1300. The bounds from the previous part are very conservative.
(g) The bounds in the previous parts showed that for n ≪ d the n × d matrix with i.i.d. standard normal entries is well-conditioned. In this part we will show that the opposite happens if n = d. Let’s denote the rows of X as X1, . . . , Xn. Those are n vectors in n-dimensional space. Consider also a vector u of unit length, which is orthogonal to X1, . . . , Xn−1.
Show that σmax(X) ≥ ∥Xn∥2 and σmin(X) ≤ ∥Xu∥2. Which upper bound on σmin(X) and lower bound on σmax(X) does that imply?
HINT: What is the distribution of the vector Xu?
Solution: The first n − 1 coordinates of Xu are zeros. The last coordinate is Xnu. Since Xn is
from N (0, In ) independent of u, and ∥u∥2 = 1, the random variable Xn u is standard normal. When it comes to the minimum singular value, we can write
σmin(X) = min ∥Xv∥2 ≤ ∥Xu∥2 ∼ χ2(1). ∥v∥=1
sigmas = np.linalg.svd(X)[1]
sigma_max = sigmas[0]
sigma_min = sigmas[-1]
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 26
So it implies an upper bound of order O(1) on the minimum singular value. We can do the same for the maximum singular value:
σmax(X) = max ∥v⊤X∥2 ≤ ∥e⊤n X∥2 = ∥Xn∥2 ∼ χ2(n), ∥v∥=1
where en is the vector whose last coordinate is 1, and others are 0. This implies lower bound of order Ω( √n) on the maximum singular value.
(h) In the previous part we showed that the square of the minimum singular value σ2min of X can be upper bounded by a χ2(1) random variable. However, one may notice that separating the last row to construct u in the previous part was somewhat arbitrary — any other row is just as good. So in fact, we can construct n different χ2(1) random variables (one for each row), each of which gives an upper bound on σ2min, so σmin should perhaps be even smaller than the bound that we derived before. However, those χ2(1) random variables are not independent, so it is rather tricky to develop a rigorous treatment.
In this part we will answer the question ”what would our bound be if those χ2(1) random variables were actually independent”. Then in the last part we will check numerically if our result is correct.
Suppose ξ1 , . . . , ξn are i.i.d. non-negative random variables with continuous density f . Suppose also that ∞ > f (0) > 0. Show that n mini≤n ξi converges in distribution to some non-zero random variable. What intuition does this computation provide about σmin? How do you think, is this intuition correct or our assumption of independence is far from being true?
HINT: a random variable with χ2(1) distribution has infinite density at zero, but its square root has finite density.
Solution: Let’s look at the distribution function of n mini≤n ξi: P(n min ξi ≥ x) =P(min ξi ≥ x/n)
i≤n i≤n
= P(ξ1 ≥ x/n)n
x/n n
=1− f(t)dt 0
x n = 1−n(f(0)+on(1))
→exp−f(0)x as n → ∞.
Thus, the min converges in distribution to an exponential random variable. This suggests that σmin(X) is of order at most O(1/n), as it’s upper bounded by min of square roots of n χ2(1) random variables, and min of i.i.d. random variables with finite non-zero density at zero has order 1/n.
We will see in the next part that this bound is wrong. The true order of magnitude of σmin(X) is 1/ √n.
(i) Finally, let’s check our intuition by numerical simulation. Do the task from part C of the Jupyter notebook.
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 27
Solution: To compute the minimum singular value one can do this:
sigma_min = np.linalg.svd(X)[1][-1]
When we fit a line it has slope −1/2. It means that the true scale of σmin(X) is 1/√n. Our intuition from the previous part was wrong and gave a bound which was too low — 1/n.
6 Your Own Question
Write your own question, and provide a thorough solution.
Writing your own problems is a very important way to really learn the material. The famous “Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highest level.
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You don’t have to achieve this every week. But unless you try every week, it probably won’t happen ever.
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 28
Contributors:
• Alexander Tsigler • Anant Sahai
• Chawin Sitawarin • Katia Patkin
• Raaz Dwivedi
• Rahul Arya
• Vignesh Subramanian
HW3, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 29