Linear regression
Daniel Hsu (COMS 4771)
Maximum likelihood estimation
One of the simplest linear regression models is the following: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid random pairs taking values in Rd × R, and
Y |X =x∼N(xTw,σ2), x∈Rd.
Here, the vector w ∈ Rd and scalar σ2 > 0 are the parameters of the model. (The marginal distribution of
X is unspecified.)
The log-likelihood of (w,σ2) given (Xi,Yi) = (xi,yi) for i = 1,…,n is
ln√2πσ2 − 2σ2 w ∈ Rd (for any σ2 > 0) is the same as minimizing
1 n
w∈Rd n i=1 (It is not necessarily uniquely determined.)
n 1 (yi−xTiw)2
+T,
where T is some quantity that does not depend on (w,σ2). Therefore, maximizing the log-likelihood over
( x Ti w − y i ) 2 .
So, the maximum likelihood estimator (MLE) of w in this model is
i=1
n
i=1
1 n
wˆ ∈argmin (xTiw−yi)2.
Empirical risk minimization
Let Pn be the empirical distribution on (x1, y1), . . . , (xn, yn) ∈ Rd × R, i.e., the probability distribution over Rd × R with probability mass function pn given by
1 n
pn((x, y)) = n
The distribution assigns probability mass 1/n to each (xi , yi ) for i = 1, . . . , n; no mass is assigned anywhere
i=1
1{(x,y)=(xi,yi)}, (x, y) ∈ Rd × R.
(xTiw−yi)2; we call this the empirical risk of w on the data (x1, y1), . . . , (xn, yn).
i=1
else. Now consider (X ̃ , Y ̃ ) ∼ Pn. The expected squared loss of the linear function w ∈ Rd on (X ̃ , Y ̃ ) is ̃T ̃ 1n
R(w):=E[(X w−Y)2]= n 1
Empirical risk minimization is the method of choosing a function (from some class of functions) based on data by choosing a minimizer of the empirical risk on the data. In the case of linear functions, the empirical risk minimizer (ERM) is
1 n
wˆ ∈argminR(w)=argmin (xTiw−yi)2.
w∈Rd w∈Rd n i=1
This is the same as the MLE from above. (It is not necessarily uniquely determined.)
Normal equations
Let
We can write the empirical risk as
R(w) = ∥Aw − b∥2,
∇R(w) = ∇{(Aw − b)T(Aw − b)} = 2AT(Aw − b),
←xT → y
1.1 1.1 A:=√ . , b:=√ ..
n . ← xTn →
n . yn
The gradient of R is given by
it is equal to zero for w ∈ Rd satisfying
These linear equations in w, which define the critical points of R, are collectively called the normal equations.
ATAw = ATb.
It turns out the normal equations in fact determine the minimizers of R. To see this, let wˆ be any solution
to the normal equations. Now consider any other w ∈ Rd. We write the empirical risk of w as follows:
R(w) = ∥Aw − b∥2
=∥A(w−wˆ)+Awˆ −b∥2
=∥A(w−wˆ)∥2 +2(A(w−wˆ))T(Awˆ −b)+∥Awˆ −b∥2 =∥A(w−wˆ)∥2 +2(w−wˆ)T(ATAwˆ −ATb)+∥Awˆ −b∥2 =∥A(w−wˆ)∥2 +∥Awˆ −b∥2
≥ R ( wˆ ) .
The second-to-last step above uses the fact that wˆ is a solution to the normal equations. Therefore, we conclude that R(w) ≥ R(wˆ ) for all w ∈ Rd and all solutions wˆ to the normal equations. So the solutions to the normal equations are the minimizers of R.
Statistical interpretation
Suppose (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid random pairs taking values in Rd × R. The risk of a linear function w ∈ Rd is
R(w) := E[(XTw − Y )2]. Which linear functions have smallest risk?
The gradient of R is given by
∇R(w)=E ∇{(X w−Y) } =2E X(X w−Y) , w∈R ;
T2Td
2
w ∈ Rd.
w ∈ Rd;
it is equal to zero for w ∈ Rd satisfying
E[XXT]w = E[Y X].
These linear equations in w, which define the critical points of R, are collectively called the population normal
equations.
It turns out the population normal equations in fact determine the minimizers of R. To see this, let w⋆ be any solution to the population normal equations. Now consider any other w ∈ Rd. We write the empirical risk of w as follows:
R(w) = E[(XTw − Y )2]
=E[(XT(w−w⋆)+XTw⋆ −Y)2]
=E[(XT(w−w⋆))2 +2(XT(w−w⋆))(XTw⋆ −Y)+(XTw⋆ −Y)2]
= E[(XT(w − w⋆))2] + 2(w − w⋆)T E[XXT]w⋆ − E[Y X] + E[(XTw⋆ − Y )2] = E[(XT(w − w⋆))2] + E[(XTw⋆ − Y )2]
≥ R(w⋆).
The second-to-last step above uses the fact that w⋆ is a solution to the population normal equations. Therefore, we conclude that R(w) ≥ R(w⋆) for all w ∈ Rd and all solutions w⋆ to the population normal equations. So the solutions to the population normal equations are the minimizers of R.
The similarity to the previous section is no accident. The normal equations (based on (X1, Y1), . . . , (Xn, Yn)) are precisely
E[X ̃ X ̃ T]w = E[Y ̃X ̃ ]
for (X ̃ , Y ̃ ) ∼ Pn, where Pn is the empirical distribution on (X1, Y1), . . . , (Xn, Yn). By the Law of Large Numbers, the left-hand side E[X ̃ X ̃ T] converges to E[XXT] and the right-hand side E[Y ̃X ̃ ] converges to E[Y X] as n → ∞. In other words, the normal equations converge to the population normal equations as n → ∞. Thus, ERM can be regarded as a plug-in estimator for w⋆.
Using classical arguments from asymptotic statistics, one can prove that the distribution of √n(wˆ − w⋆) converges (as n → ∞) to a multivariate normal with mean zero and covariance E[XXT]−1 cov(εX)E[XXT]−1, where ε := Y −XTw⋆. (This assumes, along with some standard moment conditions, that E[XXT] is invertible so that w⋆ is uniquely defined. But it does not require the conditional distribution of Y | X to be normal.)
Geometric interpretation
Let aj ∈ Rn be the vector in the j-th column of A, so
↑ ↑ A=a1 ··· ad.
↓↓
Since range(A) = {Aw : w ∈ Rd}, minimizing ∥Aw − b∥2 is the same as finding the vector bˆ ∈ range(A) closest to b (in Euclidean distance), and then specifying the linear combination of a1, . . . , ad that is equal to bˆ, i.e., specifying wˆ = (wˆ1, . . . , wˆd) such that wˆ1a1 + · · · + wˆdad = bˆ. The solution bˆ is the orthogonal projection of b to range(A). This vector bˆ is uniquely determined; however, the coefficients wˆ are uniquely determined if and only if a1, . . . , ad are linearly independent. The vectors a1, . . . , ad are linearly independent exactly when the rank of A is equal to d.
We conclude that the empirical risk has a unique minimizer exactly when A has rank d.
3