Some Calculations from Bias Variance
Christopher Ré October 25, 2020
This note contains a reprise of the eigenvalue arguments to understand how variance is reduced by regularization. We also describe different ways regu- larization can occur including from the algorithm or initialization. This note contains some additional calculations from the lecture and Piazza, just so that we have typeset versions of them. They contain no new information over the lecture, but they do supplement the notes.
Recall we have a design matrix X ∈ Rn×d and labels y ∈ Rn. We are inter- ested in the underdetermined case n < d so that rank(X) ≤ n < d. We consider the following optimization problem for least squares with a regularization pa- rameter λ ≥ 0:
l(θ;λ)= min 1∥Xθ−y∥2 + λ∥θ∥2 θ∈Rd 2 2
Normal Equations Computing derivatives as we did for the normal equa- tions, we see that:
∇θl(θ; λ) = XT (Xθ − y) + λθ = (XT X + λI)θ − XT y
By setting ∇θ l(θ, λ) = 0 we can solve for the θˆ that minimizes the above prob-
lem. Explicitly, we have:
θˆ = (XT X + λI)−1 XT y (1)
To see that the inverse in Eq. 1 exists, we observe that XT X is a symmetric, real d×d matrix so it has d eigenvalues (some may be 0). Moreover, it is positive semidefinite, and we capture this by writing eig(X T X ) = {σ12 , . . . , σd2 }. Now, inspired by the regularized problem, we examine:
eig(XT X + λI) = {σ12 + λ,...,σd2 + λ}
Sinceσi2 ≥0foralli∈[d],ifwesetλ>0thenXTX+λI isfullrank,andthe
inverse of (XT X + λI) exists. In turn, this means there is a unique such θˆ.
Variance Recall that in bias-variance, we are concerned with the variance of θˆ as we sample the training set. We want to argue that as the regularization parameter λ increases, the variance in the fitted θˆ decreases. We won’t carry
1
out the full formal argument, but it suffices to make one observation that is immediate from Eq. 1: the variance of θˆ is proportional to the eigenvalues of (XT X + λI)−1. To see this, observe that the eigenvalues of an inverse are just the inverse of the eigenvalues:
(( T )−1) { 1 1 } eig X X+λI = σ12+λ,…,σd2+λ
Now, condition on the points we draw, namely X. Then, recall that ran- domness is in the label noise (recall the linear regression model y ∼ Xθ∗ + N(0,τ2I) = N(Xθ∗,τ2I)).
Recall a fact about the multivariate normal distribution: if y ∼ N(μ,Σ) then Ay ∼ N(Aμ,AΣAT )
Using linearity, we can verify that the expectation of θˆ is
E[θˆ] = E[(XT X + λI)−1XT y]
=E[(XTX+λI)−1XT(Xθ∗ +N(0,τ2,I))]
= E[(XT X + λI)−1XT (Xθ∗)]
= (XT X + λI)−1(XT X)θ∗ (essentially a “shrunk” θ∗)
The last line above suggests that the more regularization we add (larger the λ), the more the estimated θˆ will be shrunk towards 0. In other words, regulariza- tion adds bias (towards zero in this case). Though we paid the cost of higher bias, we gain by reducing the variance of θˆ. To see this bias-variance tradeoff concretely, observe the covariance matrix of θˆ:
C := Cov[θˆ]
= ((XT X + λI)−1XT ) (τ2I) (X(XT X + λI)−1)
and
{ τ2σ12 τ2σd2 } eig(C)= (σ12+λ)2,…,(σd2+λ)2
Notice that the entire spectrum of the covariance is a decreasing function of λ. By decomposing in the eigenvalue basis, we can see that actually E[∥θˆ − θ∗∥2] is a decreasing function of λ, as desired.
Gradient Descent We show that you can initialize gradient descent in a way that effectively regularizes undetermined least squares–even with no regulariza- tion penalty (λ = 0). Our first observation is that any point x ∈ Rd can be decomposed into two orthogonal components x0,x1 such that
x = x0 +x1 and x0 ∈ Null(X) and x1 ∈ Range(XT). 2
Recall that Null(X) and Range(XT ) are orthogonal subspaces by the fundamen- tal theory of linear algebra. We write P0 for the projection on the null and P1 for the projection on the range, then x0 = P0(x) and x1 = P1(x).
If one initializes at a point θ then, we observe that the gradient is orthogonal tothenullspace. Thatis,ifg(θ)=XT(Xθ−y)thengTP0(v)=0foranyv∈Rd. But, then:
P0(θ(t+1)) = P0(θt − αg(θ(t))) = P0(θt) − αP0g(θ(t)) = P0(θ(t))
That is, no learning happens in the null. Whatever portion is in the null that we initialize stays there throughout execution.
A key property of the Moore-Penrose pseudoinverse, is that if θˆ = (XT X)+XT y then P0(θˆ) = 0. Hence, the gradient descent solution initialized at θ0 can be written θˆ+ P0(θ0). Two immediate observations:
• Using the Moore-Penrose inverse acts as regularization, because it selects the solution θˆ.
• So does gradient descent–provided that we initialize at θ0 = 0. This is particularly interesting, as many modern machine learning techniques operate in these underdetermined regimes.
We’ve argued that there are many ways to find equivalent solutions, and that this allows us to understand the effect on the model fitting procedure as regu- larization. Thus, there are many ways to find that equivalent solution. Many modern methods of machine learning including dropout and data augmentation are not penalty, but their effect is understood as regularization. One contrast with the above methods is that they often depend on some property of the data or for how much they effectively regularization. In some sense, they adapt to the data. A final comment is that in the same sense above, adding more data regularizes the model as well!
3