CS计算机代考程序代写 Linear regression

Linear regression

Daniel Hsu (COMS 4771)

Maximum likelihood estimation
One of the simplest linear regression models is the following: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid random
pairs taking values in Rd × R, and

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

Here, the vector w ∈ Rd and scalar σ2 > 0 are the parameters of the model. (The marginal distribution of
X is unspecified.)

The log-likelihood of (w, σ2) given (Xi, Yi) = (xi, yi) for i = 1, . . . , n is

n∑
i=1

{
ln

1
√

2πσ2
−

(yi − xTi w)
2

2σ2

}
+ T,

where T is some quantity that does not depend on (w, σ2). Therefore, maximizing the log-likelihood over
w ∈ Rd (for any σ2 > 0) is the same as minimizing

1
n

n∑
i=1

(xTi w − yi)
2.

So, the maximum likelihood estimator (MLE) of w in this model is

ŵ ∈ arg min
w∈Rd

1
n

n∑
i=1

(xTi w − yi)
2.

(It is not necessarily uniquely determined.)

Empirical risk minimization
Let Pn be the empirical distribution on (x1, y1), . . . , (xn, yn) ∈ Rd × R, i.e., the probability distribution over
Rd × R with probability mass function pn given by

pn((x, y)) =
1
n

n∑
i=1

1{(x,y)=(xi,yi)}, (x, y) ∈ R
d × R.

The distribution assigns probability mass 1/n to each (xi, yi) for i = 1, . . . , n; no mass is assigned anywhere
else. Now consider (X̃, Ỹ ) ∼ Pn. The expected squared loss of the linear function w ∈ Rd on (X̃, Ỹ ) is

R̂(w) := E[(X̃
T
w − Ỹ )2] =

1
n

n∑
i=1

(xTi w − yi)
2;

we call this the empirical risk of w on the data (x1, y1), . . . , (xn, yn).

Empirical risk minimization is the method of choosing a function (from some class of functions) based on
data by choosing a minimizer of the empirical risk on the data. In the case of linear functions, the empirical
risk minimizer (ERM ) is

ŵ ∈ arg min
w∈Rd

R̂(w) = arg min
w∈Rd

1
n

n∑
i=1

(xTi w − yi)
2.

This is the same as the MLE from above. (It is not necessarily uniquely determined.)

Normal equations
Let

A :=
1
√
n



← xT1 →

…
← xTn →


 , b := 1√n



y1
…
yn


 .

We can write the empirical risk as

R̂(w) = ‖Aw − b‖22, w ∈ R
d.

The gradient of R̂ is given by

∇R̂(w) = ∇{(Aw − b)T(Aw − b)} = 2AT(Aw − b), w ∈ Rd;

it is equal to zero for w ∈ Rd satisfying
ATAw = ATb.

These linear equations in w, which define the critical points of R̂, are collectively called the normal equations.

It turns out the normal equations in fact determine the minimizers of R̂. To see this, let ŵ be any solution
to the normal equations. Now consider any other w ∈ Rd. We write the empirical risk of w as follows:

R̂(w) = ‖Aw − b‖22
= ‖A(w − ŵ) + Aŵ − b‖22
= ‖A(w − ŵ)‖22 + 2(A(w − ŵ))

T(Aŵ − b) + ‖Aŵ − b‖22
= ‖A(w − ŵ)‖22 + 2(w − ŵ)

T(ATAŵ −ATb) + ‖Aŵ − b‖22
= ‖A(w − ŵ)‖22 + ‖Aŵ − b‖

2
2

≥ R̂(ŵ).

The second-to-last step above uses the fact that ŵ is a solution to the normal equations. Therefore, we
conclude that R̂(w) ≥ R̂(ŵ) for all w ∈ Rd and all solutions ŵ to the normal equations. So the solutions to
the normal equations are the minimizers of R̂.

Statistical interpretation
Suppose (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid random pairs taking values in Rd × R. The risk of a linear
function w ∈ Rd is

R(w) := E[(XTw − Y )2].

Which linear functions have smallest risk?

The gradient of R is given by

∇R(w) = E
[
∇{(XTw − Y )2}

]
= 2E

[
X(XTw − Y )

]
, w ∈ Rd;

it is equal to zero for w ∈ Rd satisfying

E[XXT]w = E[YX].

These linear equations in w, which define the critical points of R, are collectively called the population normal
equations.

It turns out the population normal equations in fact determine the minimizers of R. To see this, let w? be
any solution to the population normal equations. Now consider any other w ∈ Rd. We write the empirical
risk of w as follows:

R(w) = E[(XTw − Y )2]
= E[(XT(w −w?) + XTw? − Y )2]
= E[(XT(w −w?))2 + 2(XT(w −w?))(XTw? − Y ) + (XTw? − Y )2]
= E[(XT(w −w?))2] + 2(w −w?)T

(
E[XXT]w? − E[YX]

)
+ E[(XTw? − Y )2]

= E[(XT(w −w?))2] + E[(XTw? − Y )2]
≥ R(w?).

The second-to-last step above uses the fact that w? is a solution to the population normal equations. Therefore,
we conclude that R(w) ≥ R(w?) for all w ∈ Rd and all solutions w? to the population normal equations. So
the solutions to the population normal equations are the minimizers of R.

The similarity to the previous section is no accident. The normal equations (based on (X1, Y1), . . . , (Xn, Yn))
are precisely

E[X̃X̃
T
]w = E[Ỹ X̃]

for (X̃, Ỹ ) ∼ Pn, where Pn is the empirical distribution on (X1, Y1), . . . , (Xn, Yn). By the Law of Large
Numbers, the left-hand side E[X̃X̃

T
] converges to E[XXT] and the right-hand side E[Ỹ X̃] converges to

E[YX] as n → ∞. In other words, the normal equations converge to the population normal equations as
n→∞. Thus, ERM can be regarded as a plug-in estimator for w?.

Using classical arguments from asymptotic statistics, one can prove that the distribution of
√
n(ŵ −w?)

converges (as n→∞) to a multivariate normal with mean zero and covariance E[XXT]−1 cov(εX)E[XXT]−1,
where ε := Y −XTw?. (This assumes, along with some standard moment conditions, that E[XXT] is invertible
so that w? is uniquely defined. But it does not require the conditional distribution of Y |X to be normal.)

Geometric interpretation
Let aj ∈ Rn be the vector in the j-th column of A, so

A =


 ↑ ↑a1 · · · ad
↓ ↓


 .

Since range(A) = {Aw : w ∈ Rd}, minimizing ‖Aw − b‖22 is the same as finding the vector b̂ ∈ range(A)
closest to b (in Euclidean distance), and then specifying the linear combination of a1, . . . ,ad that is equal
to b̂, i.e., specifying ŵ = (ŵ1, . . . , ŵd) such that ŵ1a1 + · · ·+ ŵdad = b̂. The solution b̂ is the orthogonal
projection of b to range(A). This vector b̂ is uniquely determined; however, the coefficients ŵ are uniquely
determined if and only if a1, . . . ,ad are linearly independent. The vectors a1, . . . ,ad are linearly independent
exactly when the rank of A is equal to d.

We conclude that the empirical risk has a unique minimizer exactly when A has rank d.

Maximum likelihood estimation
Empirical risk minimization
Normal equations
Statistical interpretation
Geometric interpretation

Related Posts