EECS 189 Introduction to Machine Learning Fall 2020
This homework is due Wednesday, October 7 at 11:59 p.m. 2 Kernel PCA
HW5
You have seen how to use PCA to do dimensionality reduction by projecting the data to a subspace that captures most of the variability visible in the observed features. The underlying hope is that these directions of variation are also relevant for prediction the quantities of interest.
Standard PCA works well for data that is roughly Gaussian shaped, but many real-world high dimensional datasets have underlying low-dimensional structure that is not well captured by linear subspaces. However, when we lift the raw data into a higher-dimensional feature space by means of a nonlinear transformation, the underlying low-dimensional structure once again can manifest as an approximate subspace. Linear dimensionality reduction can then proceed. As we have seen in class so far, kernels are an alternate way to deal with these kinds of nonlinear patterns without having to explicitly deal with the augmented feature space. This problem asks you to discover how to apply the “kernel trick” to PCA.
Let X ∈ Rn×l be the data matrix, where n is the number of samples and l is the dimension of the raw data. Namely, the data matrix contains the data points xj ∈ Rl as rows
⊤
x
1
⊤
x
.
2 n×l
X= ∈R . (1)
. .
x ⊤
n
(a) Compute XX⊤ in terms of the singular value decomposition X = UΣV⊤ where U ∈ Rn×n, Σ ∈ Rn×l and V ∈ Rl×l. Notice that XX⊤ is the matrix of pairwise Euclidean inner prod- ucts for the data points. How would you get U if you only had access to XX⊤? Solution: We
have X⊤X = VΣU⊤UΣV⊤ and XX⊤ = UΣV⊤VΣU⊤ = UΣ2U⊤. Notice that the columns of U are the eigenvectors of XX⊤.
(b) Given a new test point xtest ∈ Rl, one central use of PCA is to compute the projection of xtest onto the subspace spanned by the k top singular vectors v1 , v2 , . . . , vk .
Express the scalar projection zj = v⊤j xtest onto the j-th principal component as a function of the inner products
⟨x,x ⟩ 1 test .
Xx = . . (2) test .
⟨xn, xtest⟩
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 1
(c)
How would you define kernelized PCA for a general kernel function k(xi,xj) (to replace the ∥xi−xj∥2
Assume that all diagonal entries of Σ are nonzero and non-increasing, that is σ1 ≥ σ2 ≥ · · · > 0.
Hint: Express V⊤ in terms of the singular values Σ, the left singular vectors U and the data matrix X. If you want to use the compact form of the SVD, feel free to do so.
Solution: Observe that V⊤ = Σ−1 U⊤ cmp cmp cmp
X. We have
z =v⊤x =σ−1u⊤Xx .
j j test j j test
Euclidean inner product ⟨xi, xj⟩)? For example, the RBF kernel k(xi, xj) = exp − δ2 .
Describe this in terms of a procedure which takes as inputs the training data points x1, x2, . . . , xn ∈ Rl and the new test point xtest ∈ Rl, and outputs the analog of the previous part’s z j coordinate in the kernelized PCA setting. You should include how to compute U from the data, as well as how to compute the analog of Xxtest from the previous part.
Invoking the SVD or computing eigenvalues/eigenvectors is fine in your procedure, as long as it is clear what matrix is having its SVD or eigenvalues/eigenvectors computed. The kernel k(·, ·) can be used as a black-box function in your procedure as long as it is clear what arguments it is being given.
Solution:
(a) Obtain the vectors uj as eigenvectors from the eigendecomposition of K ∈ Rn×n with
Kij =k(xi,xj).
(b) Kernelize the inner products zj = 1 u⊤Xxtest via
3
.
Total Least Squares
σj j
k(x1, xtest)
k(x,x )
1 2 test ⊤
zj= uj . (3)
. σj .
Make sure to submit the code you write in this problem to “HW5 Code” on Gradescope.
In most of the models we have looked at so far, we’ve accounted for noise in the observed y measurement and adjusted accordingly. However, in the real world it could easily be that our feature matrix X of data is also corrupted or noisy. Total least squares is a way to account for this. Whereas previously we were minimizing the y distance from the data point to our predicted line because we had assumed the features were definitively accurate, now we are minimizing the entire distance from the data point to our predicted line. In this problem we will explore the mathematical
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2
k(xn, xtest)
intuition for the TLS formula. We will then apply the formula to adjusting the lighting of an image which contains noise in its feature matrix due to inaccurate assumptions we make about the image, such as the image being a perfect sphere.
Let X and y be the true measurements. Recall that in the least squares problem, we want to solve for w in minw ||Xw − y||. We measure the error as the difference between Xw and y, which can be viewed as adding an error term εy such that the equation Xw = y + εy has a solution:
min||εy||2, subject to Xw = y + εy (4) εy ,w
Although this optimization formulation allows for errors in the measurements of y, it does not allow for errors in the feature matrix X that is measured from the data. In this problem, we will explore a method called total least squares that allows for both error in the matrix X and the vector y, represented by εX and εy, respectively. For convenience, we absorb the negative sign into εy and εX and define true measurements y and X like so:
ytrue =y+εy (5) Xtrue =X+εX (6)
Specifically, the total least squares problem is to find the solution for w in the following minimiza- tion problem:
min||[εX,εy]||2F, subjectto(X+εX)w=y+εy (7) εy ,εX ,w
where the matrix [εX, εy] is the concatenation of the columns of εX with the column vector y. Notice that the minimization is over w because it’s a free parameter, and it does not necessarily have to be in the objective function. Intuitively, this equation is finding the smallest perturbation to the matrix of data points X and the outputs y such that the linear model can be solved exactly. The constraint in the minimization problem can be rewritten as
[X+ε ,y+ε] =0 (8) X y −1
(a) LetthematrixX∈Rn×d andy∈Rn andnotethatεX ∈Rn×d andεy ∈Rn.Assumingthatn>d and rank(X + εX) = d, explain why rank([X + εX, y + εy]) = d.
Solution:
Given that the constraint in the minimization problem from Equation (7) is satisfied, the last column of the matrix [X + εX, y + εy]. which is y + εy can be expressed as a linear combination of the remaining columns in the matrix. Therefore, the rank of the matrix is the rank of the remaining columns, which we previously assumed was d.
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3
w
(b) For the solution w to be unique, the matrix [X + εX, y + εy] must have exactly d linearly independent columns. Since this matrix has d+1 columns in total, it must be rank-deficient by 1. Recall that the Eckart-Young-Mirsky Theorem tells us that the closest lower-rank matrix in the Frobenius norm is obtained by discarding the smallest singular values. Therefore, the matrix [X + εX, y + εy] that minimizes
||[εX,εy]||2F =||[Xtrue,ytrue]−[X,y]||2F
where [X, y] = UΣV⊤.
Suppose we express the SVD of [X, y] in terms of submatrices and vectors:
⊤
is given by
Solution:
We have
Σd [X+ε ,y+ε ]=U V
⊤
Uxx [X,y] = u⊤
yx
uxy Σd u
yy
Vxx vxy
X y 0
uxy ∈Rn−1 isthefirst(n−1)elementsofthe(d+1)-thcolumnofU,u⊤yx ∈Rd isthefirstd elements of the n-th row of U, uyy is the n-th element of the (d + 1)-th column of U, Uxx ∈ R(n−1)×d is the (n − 1) × d top left submatrix of U.
Similarly, vxy ∈ Rd is the first d elements of the (d + 1)-th column of V, v⊤yx ∈ Rd is the first d elements of the (d + 1)-th row of V, vyy is the (d + 1)-th element of the (d + 1)-th column of V, Vxx ∈ Rd×d is the d × d top left submatrix of V. σd+1 is the (d + 1)-th eigenvalue of [X, y]. Σd is the diagonal matrix of the d largest singular values of [X, y]
Using this information show that
[ε,ε]=− σ X y uyy d+1vyy
[X,y]+[ε ,ε ]=
X y u⊤yx
Vxx
σ v⊤ v d+1 yx yy
⊤ uxy vxy
⊤ vxy
vyy
⊤
uyy
We can subtract the second equation from the first to solve for [εX, εy]:
Uxx [X,y]−([X,y]+[ε ,ε ])=
uxy Σd
uyy
Vxx
vxy
Uxx
uxy Σd
X y u⊤yx
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4
σd+1 v⊤yx
vyy
0 v⊤yx
Xy
u⊤yx uyy
σd+1 v⊤yx
w
X y vyy
vyy vyy
vxy yy
Uxx
⊤ vxy
v yy
uxy Σd u
yy
Uxx uxy Σd
u⊤ u
Vxx 0v⊤ yx
−u⊤ yx
−[ε , ε ] = Xy
⊤
solutionto[X+ε ,y+ε ] =0andthussolveforwinEquation(8). X y −1
HINT: Looking at the last column of the product [X, y]V may or may not be useful for this problem, depending on how you solve it.
Solution:
Following the hint, we look at the last column of the product [X, y]V, which is given by
− 0v⊤ v
yx yy ⊤
vxy
vyy
uyy d+1 vyy
(c) Using the result from the previous part and the fact that vyy is not 0, find a nonzero
yx yy d+1
Uxx uxy 0 [ε , ε ] = −
Vxx
uxy vxy =− σ
vxy uxy
[X,y] = σ vyy uyy d+1
and we can substitute this into the equation above to get
and thus
=0
This makes the nonzero v be in the nullspace, and hence almost gives us what we want
w
, except for scaling and having v instead of a −1. We know that v is not 0 almost surely −1 yy yy
for continuous noise, so we can just rescale to find w = −v v−1 from Equation (8). xy yy
[ε ,ε]=−[X,y]
X y vyyvyy
[X,y]+[ε ,ε ]=[X,y]−[X,y] [X+ε ,y+ε ] =[X,y] −[X,y]
σ
⊤ vxy vxy
X y vyyvyy
vxy vxy vxy
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5
⊤ Σd Vxx vxy
⊤ vxy vxy
w
(d) From the previous part, you can see that −1 is a right-singular vector of [X, y]. Show that (X⊤X − σ2d+1I)w = X⊤y (14)
Notice that this result is like ridge regression, but with an appropriately chosen negative reg- ularization constant! This can be hypothetically interepreted as coming from the belief that there is some ambient noise at the σ2d+1 level that is corrupting all measurements.
Solution:
Since −1 is a right-singular vector of
w
X y , it is a right-eigenvector of the matrix
⊤ ⊤
X X X yw 2 w
y⊤X y⊤y−1 = σd+1 −1 (12)
So to find it we solve
⊤
X ⊤ X X ⊤ y
⊤ ⊤
X y X y= (11)
Ignore the bottom equation and consider the solution of the top equation:
X⊤Xw − X⊤y = σ2d+1w (13)
which can be rewritten as
(X⊤X − σ2d+1I)w = X⊤y (14)
(e) In this problem, we will use total least squares to approximately learn the lighting in a photo- graph, which we can then use to paste new objects into the image while still maintaining the realism of the image. You will be estimating the lighting coefficients for the interior of St. Pe- ter’s Basillica, and you will then use these coefficients to change the lighting of an image of a tennis ball so that it can be pasted into the image. In Figure 1, we show the result of pasting the tennis ball in the image without adjusting the lighting on the ball. The ball looks too bright for the scene and does not look like it would fit in with other objects in the image.
To convincingly add a tennis ball to an image, we need to need to apply the appropriate lighting from the environment onto the added ball. To start, we will represent environment lighting as a spherical function f(n) where n is a 3 dimensional unit vector (||n||2 = 1), and f outputs a 3 dimensional color vector, one component for red, green, and blue light intensities. Because f(n) is a spherical function, the input n must correspond to a point on a sphere. The function f(n) represents the total incoming light from the direction n in the scene. The lighting function of a spherical object f(n) can be approximated by the first 9 spherical harmonic basis functions.
The first 9 unnormalized sperical harmonic basis functions are given by:
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6
yX yy
Figure 1: Tennis ball pasted on top of image of St. Peter’s Basillica without lighting adjustment (left) and with lighting adjustment (right)
Figure 2: Image of a spherical mirror inside of St. Peter’s Basillica
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7
L1 = 1
L2 = y
L3 = x
L4 = z
L5 = xy
L6 = yz
L7 = 3z2 − 1 L8 = xz
L9 = x2 − y2
where n = [x, y, z]⊤. The lighting function can then be approximated as
9
f(n) ≈ γiLi(n) i=1
— f(n1) — L1(n1) L2(n1) … L9(n1) — γ1 —
— f(n) — L(n) L(n) … L(n) — γ — 21222922 .=.. ... ...
— f(nn) —
where Li(n) is the ith basis function from the list above.
n×3
n×9
9×3
L1(nn) L2(nn) … L9(nn)
— γ9 —
The function of incoming light f(n) can be measured by photographing a spherical mirror placed in the scene of interest. In this case, we provide you with an image of the sphere as seen in Figure 2. In the code provided, there is a function extractNormals(img) that will extract the training pairs (ni, f(ni)) from the image. An example using this function is in the code.
Use the spherical harmonic basis functions to create a 9 dimensional feature vector for each sample. Use this to formulate an ordinary least squares problem and solve for the unknown coefficients γi. Report the estimated values for γi and include a visualization of the approximation using the provided code. We have provided the starter code in the iPython notebook. The code provided will load the images, extracts the training data, relights the tennis ball with incorrect coefficients, and saves the results. Your task is to compute the basis functions and solve the least squares problems to provide the code with the correct coefficients.
We mark the corresponding part to fill in as the “TODO”. In the following parts, you will also implement other least squares. Therefore we code in the style of first loading and pre- processing data, and then conduct each least square one by one.
If you run the starter code locally, you will need to Python packages numpy and scipy. Be- cause the resulting data set is large, we reduce it in the code by taking every 50th entry in the
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8
data. This is done for you in the starter code, but you can try using the entire data set or reduce it by a different amount.
Solution:
Implementing the first 9 basis functions should be straightforward.
L00 = np.ones(len(ns))
L11 = ns[:,0]
L10 = ns[:,2]
L1_1 = ns[:,1]
L21 = ns[:,0]*ns[:,2]
L2_1 = ns[:,1]*ns[:,2]
L2_2 = ns[:,0]*ns[:,1]
L20 = 3*ns[:,2]*ns[:,2]-1
L22 = ns[:,0]*ns[:,0]-ns[:,1]*ns[:,1]
B = np.stack([L00,L1_1,L10,L11,L2_2,L2_1,L20,L21,L22],-1)
Then we implement ordinary least squares.
def leastSquaresSolve(B, vs):
# B is n x 9
# vs is n x 3
XX = B.T.dot(B)
Xy = B.T.dot(vs)
coeff = np.linalg.solve(XX,Xy)
res = np.linalg.lstsq(B,vs)
return coeff
if __name__ == ’__main__’:
data,tennis,target = loadImages()
ns, vs = extractNormals(data)
B = computeBasis(ns)
# reduce the number of samples because computing the SVD on
# the entire data set takes too long
Bp = B[::50]
vsp = vs[::50]
coeffLSQ = leastSquaresSolve(Bp,vsp)
imgLSQ = relightSphere(tennis,coeffLSQ)
targetLSQ = compositeImages(imgLSQ,target)
print(’Least squares:\n’+str(coeffLSQ))
plt.figure()
plt.imshow(targetLSQ)
plt.title(’Least Squares’)
plt.show()
Output:
[ 202.31845431 162.41956802 149.07075034]
[ -27.66555164 -17.88905339 -12.92356688]
[ -1.08629293 0.42947012 1.15475569]
[ -5.15203925 -4.51375871 -4.24262639]
[ -3.14053107 -3.70269907 -3.74382934]
[ 23.67671768 23.15698002 21.94638397]
[ -3.82167171 0.57606634 1.81637483]
[ 4.7346737 1.4677692 -1.12253649]
[ -9.72739616 -5.75691108 -4.8395598 ]
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 9
(f) When we extract from the data the direction n to compute (ni, f(ni)), we make some approx- imations about how the light is captured on the image. We also assume that the spherical mirror is a perfect sphere, but in reality, there will always be small imperfections. Thus, our measurement for n contains some error, which makes this an ideal problem to apply total least squares. Solve this problem with total least squares by allowing perturbations in the ma- trix of basis functions. Report the estimated values for γi and include a visualization of the approximation. The output image will be visibly wrong, and we’ll explore how to fix this problem in the next part. Your implementation may only use the SVD and the matrix inverse functions from the linear algebra library in numpy as well as np.linalg.solve.
Solution: Theoretical derivation for the matrix case: As opposed to the theoretical TLS part, here we now have target vectors (per sample) instead of scalars corresponding to the rows of the matrix f(n) ∈ Rn×3 and the original weight vector w is now replaced by the matrix γ ∈ R9×3. The feature matrix is as always X ∈ Rn×9.
Inthiscaseagain,wewant[X+εX,Y+εY](withεY nowamatrix)tobethebestrankd=9
approximation to the matrix [X,Y] ∈ Rn×12, which by the Eckart-Young-Mirsky Theorem
is found by setting the three smallest singular values of [X, Y] to zero. Now by the same
γ argumentationasinthescalarcase,thematrix mustlieinthenullspaceof[X+ε ,Y+ε ].
−I3 X Y Since it is three dimensional, there is no unique answer. However, we know that, denoting the
Vxy
last three columns of singular matrix V by , the following holds when V
yy
is invertible
(which it is in this dataset)
Vyy 0=(X+εX)Vxy +(Y+εY)Vyy
=(X+ε )V V−1 +(Y+ε )I . X xy yy Y 3
−1 −VxyV yy
Therefore, −I3 isinthenullspaceof[X+εX,Y+εY]sothatbymatchingmatrices,one candidatesolutionforγisγ=−V V−1.
xy yy
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 10
def totalLeastSquaresSolve(B, vs, scale=1):
# Build the combined matrices
XY = np.concatenate([B,vs/scale],-1)
# Solve each system separately
coeff = []
#for xy in XY:
u,s,v = np.linalg.svd(XY)
v = v.T
Vxy = v[:B.shape[1],B.shape[1]:] # 9 x 3
Vyy = v[B.shape[1]:,B.shape[1]:] # 3 x 3
#Vyy = v[-1,-1]
#c = -Vxy*(1/Vyy) # 9 x 3
coeff = -Vxy.dot(np.linalg.inv(Vyy))
#coeff.append(c)
#coeff = np.asarray(coeff).T
print(str(coeff))
return coeff*scale
if __name__ == ’__main__’:
data,tennis,target = loadImages()
ns, vs = extractNormals(data)
B = computeBasis(ns)
# reduce the number of samples because computing the SVD on
# the entire data set takes too long
Bp = B[::50]
vsp = vs[::50]
coeffTLS = totalLeastSquaresSolve(Bp,vsp)
imgTLS = relightSphere(tennis,coeffTLS)
targetTLS = compositeImages(imgTLS,target)
print(’Total least squares:\n’+str(coeffTLS))
plt.figure()
plt.imshow(targetTLS)
plt.title(’Total Least Squares’)
plt.show()
imsave(’tls.png’,targetTLS)
[ 2.13318421e+02 1.70780299e+02 1.57126297e+02]
[ -3.23046362e+01 -2.02975310e+01 -1.45516114e+01]
[ -4.89811386e+00 -3.37684058e+00 -1.14207091e+00]
[ -4.31689131e+00 -3.80778081e+00 -4.83616306e+00]
[ -7.05901066e+03 -7.39934207e+03 -4.26448732e+03]
[ -3.05378224e+02 -1.56329401e+02 3.50285345e+02]
[ -9.76079364e+00 -5.33182216e+00 -1.55699782e+00]
[ 7.30792588e+02 3.52130316e+02 -6.11683200e+02]
[ -9.08887079e+00 -3.84309477e+00 -4.16456437e+00]
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 11
(g) In the previous part, you should have noticed that the visualization is drastically different than the one generated using least squares. Recall that in total least squares we are minimizing ||[εX,εy]||2F. Intuitively, to minimize the Frobenius norm of components of both the inputs and outputs, the inputs and outputs should be on the same scale. However, this is not the case here. Color values in an image will typically be in [0, 255], but the original image had a much larger range. We compressed the range to a smaller scale using tone mapping, but the effect of the compression is that relatively bright areas of the image become less bright. As a compromise, we scaled the image colors down to a maximum color value of 384 instead of 255. Thus, the inputs here are all unit vectors, and the outputs are 3 dimensional vectors where each value is in [0, 384]. Propose a value by which to scale the outputs f(ni) such that the values of the inputs and outputs are roughly on the same scale. Solve this scaled total least squares problem, report the computed spherical harmonic coefficients and provide a rendering of the relit sphere.
Solution:
Recall: total least squares assumes that the noise is the same in all directions. When we just have some data, it can be hard to think about what the noise model should be. The default is to take a “significant figures” mentality and assume that everything should be on the same scale, assuming that noise is roughly proportional to the size of the inputs. Because most of the basis functions lie within [−1, 1], we want to scale the image pixel values so that they lie in a similar range. For these results, we can scale the values by 1/384, but any reasonable value that scales the pixel values to a similar range is acceptable.
if __name__ == ’__main__’:
data,tennis,target = loadImages()
ns, vs = extractNormals(data)
B = computeBasis(ns)
# reduce the number of samples because computing the SVD on
# the entire data set takes too long
Bp = B[::50]
vsp = vs[::50]
coeffTLSScaled = totalLeastSquaresSolve(Bp,vsp,scale=255*1.5)
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 12
imgTLSScaled = relightSphere(tennis,coeffTLSScaled)
targetTLSScaled = compositeImages(imgTLSScaled,target)
print(’Total least squares scaled:\n’+str(coeffTLSScaled))
plt.figure()
plt.imshow(targetTLSScaled)
plt.title(’Total Least Squares Scaled’)
plt.show()
imsave(’tls_scaled.png’,targetTLSScaled)
[ 209.41539449 169.06782937 155.39642541]
[ -30.28100667 -20.3163958 -15.21596685]
[ -1.05621451 0.46391495 1.19212042]
[ -5.7563859 -5.08161145 -4.78411908]
[ -7.95607504 -8.25078526 -8.0969764 ]
[ 55.29299419 52.93568994 50.39069818]
[ -3.84934062 0.5565465 1.80231959]
[ 7.35375998 3.85567243 1.0984583 ]
[ -10.91282516 -6.85792251 -5.88064457]
4 Linear Regression in Overparameterized Regime
Note: For this problem we 0-index every vector so the first entry of a vector x ∈ Rn is referred to as x0 and the last entry is xn−1.
In this problem we study linear regression in the overparameterized regime where the number of features d is greater than the number of training points n. The setup for this problem is highly styl- ized and simplified to reduce the problem to the minimal example required to demonstrate the kind of effects from overparameterization that we want you to see and understand. Real problems are unlikely to be, for example, 1-sparse, but as you will see in many cases more complex interactions can be viewed through an appropriate change of basis to create an equivalent simpler problem.
For the sake of simplicity and ease of visualization, assume that we are trying to learn a one- dimensional function f(x) for x ∈ [0,1]. For mathematical ease, consider a standard complex
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 13
Fourier featurization:
φuk(x) = ej2πkx
0
Assume that the number of features d = (M+1)n for a positive integer M. Consider the case where the true function we are trying to learn is 1-sparse and consists of a single “low frequency” feature, i.e.
for k = {0, 1, . . . , d − 1}. Notice that this featurization is orthonormal relative to a uniform test- distribution on [0, 1]. (Here, we have to remember that when dealing with complex vectors, we need to take a conjugate transpose when computing an inner product, and thus when dealing with the relevant complex inner product between two functions relative to this test-distribution, ⟨ f , g⟩ = 1 g ̄(t)f(t)dt.)
y = φ ut ( x )
Such a restriction is required if we wish to have any hope of linear regresssion to succeed because
for 0 ≤ t < n.
the first n features together span the entire space of Cn and can perfectly fit any n training points.
Assume that the training points are evenly spaced, xi = i for i ∈ {0, 1, . . . , n − 1} in the interval
[0, 1).
We have access to observations y ∈ Cn where
n
y = Φ ut + ε c
where Φut ∈ Cn is the column entry with entries φut (x0), φut (x1), . . . , φut (xn−1). Here εc refers to the
noise in the training data and we assume εc ∼ CN(0, σ2In).
Note: If εc ∼ CN(0,σ2In) then it can be expressed as εc = εRe + jεIm where εRe and εIm are i.i.d
distributed as N(0, 1σ2In). Further, the kth entry of εc is distributed as εk ∼ CN(0,σ2) and εk are 2
independent across k.
Instead of performing linear regression using this featurization we first scale the features to obtain
scaled features,
φk(x) = ckφuk(x)
where ck ∈ R for k = 0, 1, . . . , d − 1. By scaling features appropriately we can favor some features (e.g. low frequency features) and then hope to preserve the true signal in these features. Note that these scalings ck are chosen, not learned. Let Φt denote the column vector with entries φt(x0),φt(x1),...,φt(xn−1),andsimilarlyΦut theunscaledversionofthesamething.Todolearning, we solve the minimum norm interpolation problem using the scaled features:
ˆ β = arg min β
2
β
d−1
s.t. y=βkΦk,
k=0
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 14
Having found βˆ we convert it into an equivalent set of coefficients αˆ to obtain, d−1
y = αˆ k Φ uk , k=0
whereαˆk =ckβˆk.
The test data xtest, ytest is drawn from the distribution p and is noiseless, i.e ytest = φut (xtest). Recall
that the features {φuk}k=0,...,d−1 are orthornormal with respect to the test distribution.
We make a prediction at test time using the coefficients αˆ , d−1
or equivalently,
ypred = βˆkφk(xtest). k=0
d−1
ypred = αˆkφuk(xtest).
k=0
You may ask why we would think about predicting using αˆ and the normalized features φuk(xtest)
rather than the scaled φk(xtest) and directly learned βˆ. In practice, we would always use βˆ since that’s what we see when we train, but for the purpose of this problem the prediction using αˆ is identical, and error analysis using αˆ is easier. We measure how well we learn the function by computing,
2 Epred = E |ytest − ypred| ,
where the expectation is both over the randomness in the test point and the randomness in the training noise.
To obtain low prediction error and for linear regression to succeed in this regime, we must be able to preserve the coefficient for the true feature and dissipate/absorb noise harmlessly across the rest of the coefficients. In other words, we need the contamination to stay small.
In this problem we will see that using an appropriate bi-level scaling model that favors lower frequency features we can achieve both.
The most notable thing about this feature family and the particular choice of training points that we have chosen is that the features are arranged into families of perfect aliases as far as these training points are concerned. If t is the index of a particular feature in the first n features, then A(t) captures all the indices of the perfect aliases of that feature. And R(t) includes the original feature as well.
and notice that:
A(t) = {t + n, t + 2n, . . . , t + Mn} R(t) = {t} ∪ A(t).
Φuk =Φut,k∈A(t) (9)
u u uH u n, k∈R(t)
⟨Φ,Φ⟩=(Φ) Φ = . (10)
tkkt
0, k R(t)
Here we use the (Φuk )H to denote the conjugate-transpose of the vector Φuk .
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 15
(a) 1-sparse noiseless setting. First consider the case where the true function is y = f (x) = φut (x) and we have no noise in the training data, i.e y = Φut . Show that
where
c 2
, k∈R(t), αˆ = V
k
k
0, otherwise, V = c 2k .
k∈R(t)
(Hint: The following steps might be useful. First show that it suffices to consider the problem restricted to the indices k ∈ R(t) and transform the problem to:
ˆ ξ = arg min ξ
2
where,c ̃k =ct+kn fork=0,1,...,M.
Then either use Cauchy-Schwartz, solve the optimization problem using the Lagrangian, or recognize this as a particularly simple case where the Moore-Penrose Pseudoinverse applies and has a particularly trivial form. )
Solution:
We have,
ξ
M
s.t. ξjc ̃j =1,
j=0
ˆ β = arg min β
β
2 d−1
s.t. y = Φut = βkckΦuk. k=0
Taking inner product with Φuk on both sides of the constraint and using Equations (9),(10) we have,
βkck = 1, k∈R(t)
and for k R(t), βk is a free parameter. Thus it is optimal to choose βˆk = 0,k R(t). Consequently αˆk = 0, k R(t).
Denoting c ̃k = ct+kn and restricting the problem to indices in R(t) we have, βˆt+kn = ξˆk where ˆ
ξ = arg min ξ
2
ξ
M
s.t. ξjc ̃j =1.
j=0
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 16
Method 1:
By Cauchy-Schwartz we have,
2
with equality iff ξ = ac ̃. Substituting this and solving for a we obtain a ≥
ξ ≥
|⟨c ̃, ξ⟩| ∥c ̃∥2
1 which gives us ∥c ̃∥2
and, thus,
Method 2:
We have
ξˆk = c ̃k ∥c ̃∥2
βˆ k = c k , k ∈ R ( t ) V
αˆ k = c 2k , k ∈ R ( t ) . V
arg min ∥ξ∥2 ξ
M
such that ξkc ̃k = 1. k=0
We can note that we can change the objective to 1 ∥ξ∥2 and since this is a convex problem, we 22
can work with the Lagrangian.
L(ξ,λ)=∥ξ∥ +μ ξc ̃ −1
k ∇ ξ L = ξ + μ c ̃ .
2 2
kk
The optimal solutions satisfies the first order conditions, so we have that ξ = −μc ̃ and thus our
21 constraint is −μ c ̃ = 1 which implies that μ = −
which plugging back in shows us
that
and, thus,
kk k∈R(t)c ̃2k ξˆk = c ̃k
∥c ̃∥2
βˆ k = c k , k ∈ R ( t )
V
αˆ k = c 2k , k ∈ R ( t ) . V
(b) Pure noise. Next consider the case where the true function y = f (x) = 0 but the observations are noisy. Thus we have pure noise, y = εc ∼ CN (0, σ2 In ), i.e. Show that we can represent εc using the first n features as
εc = Φuδ,
HW5, ©UCB EECS 189, Fall 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 17
where
and δ ∼ CN 0, 1σ2I . nn
Φu=Φu Φu ...Φu , 01 n−1
(Hint: You may use without proof that if εc ∼ CN(0, σ2In) then Bεc ∼ CN(0, σ2BBH). No- tice the connection between the property here and the orthogonality of the appropriate DFT matrix.)
Solution: One way to proceed is by using the least squares solution. We have δ = (Φu)HΦu−1 (Φu)Hεc
= 1(Φu)Hεc n
Φuδ = εc.
Since 1 (Φu)HΦu = 1I using the hint we have δ ∼ CN(0, 1σ2I ).
n2nn nn
(c) 1-sparse noisy setting. Computing αˆ
Finally consider the case where the true function is y = f (x) = φut (x) and the observations are noisy, y = Φut + εc. Show that the entries of αˆ are given by,
αˆk= c2k (1+δt),k∈R(q),q=t V (q)
αˆk= c2k δq,k∈R(q),qt, V (q)
whereV(q)= c2 andqrangesfrom0ton−1.Recallthat0≤t