Introduction
I In the last lecture, we reviewed linear regression and discussed model selection techniques.
I The idea was to find a“minimal”model that robustly predicts out of sample.
I We e↵ectively tried out a lot of di↵erent models.
Copyright By PowCoder代写 加微信 powcoder
I In this section, we discuss methods of how we can get rid or reduce the influence of some predictors that do not explain much of the variation in y, by“shrinking”them to zero
Ridge Regression
Ridge regression
I The reason for high variance in the test error may be due to large variance of some coe cient estimates.
I We can“shrink”them to zero, which can dramatically improve the performance for prediction.
I We first discuss ridge regression. The objective function here is to find a vector , such that the following expression is minimized:
Xp argmin (yi 0 xij j)2 + j2
i=1 j=1 j=1
Ridge regression
I The reason for high variance in the test error may be due to large variance of some coe cient estimates.
I We can“shrink”them to zero, which can dramatically improve the performance for prediction.
I We first discuss ridge regression. The objective function here is to find a vector , such that the following expression is minimized:
Xp argmin (yi 0 xij j )2 + j2
i=1 k=1 j=1 | {z }
Residual Sum of Squares
Ridge regression
I The reason for high variance in the test error may be due to large variance of some coe cient estimates.
I We can“shrink”them to zero, which can dramatically improve the performance for prediction.
I We first discuss ridge regression. The objective function here is to find a vector , such that the following expression is minimized:
Xp argmin (yi 0 xk k )2 + j2
i=1 j=1 j=1
| {z } | {z }
Residual Sum of Squares PenalityTerm
Ridge regression [a bit mathy]
Xn X Xp argmin (yi xij j)2+ j2
| {z }|{z}
Residual Sum of Squares PenalityTerm
We can rewrite the objective function in matrix notation as:
argmin (y X )0(y X ) + 0
I This is very close to the objective function for plain OLS.
I The First Order Conditions are:
2X0(y X )+2
I Setting equal to zero
X0y = ( I + X0X) I From this follows directly:
R = ( I + X0X) 1X0y
Ridge regression
I You can see directly that if = 0, then the solution is just OLS.
I As !1, !0.
I For every value of you get a di↵erent estiamted coe cient
I Note that we dont have a penality for the intercept 0, which provides the mean value of the dependent variable y, when all the xi are zero.
I But how to choose , we will explore this further below.
Another way to write down the Ridge regression problem
This problem
argmin (yi 0 xij j )2 + j2 (1)
Xp i=1 j=1 j=1
| {z } | {z }
Residual Sum of Squares PenalityTerm
min (yi 0 xij j)2 subjectto j2s (2)
i=1 k=1 j=1 | {z }
Residual Sum of Squares
I This is a constrained optimization problem, which you should know from Microeconomics (maximize utility subject to a budget constraint)
is the same as
Some graphical intuition for p = 2
Suppose you only have p = 2, i.e. there are two features
(yi 0 1xi1 2xi2)2 subjectto 12+ 2 s (3) | {z }
Residual Sum of Squares
What shape does the objective function and the constraint have?
Some graphical intuition for p = 2
Some graphical intuition for p = 2
Some graphical intuition for p = 2
Some graphical intuition for p = 2
Some graphical intuition for p = 2
Ridge regression is not scale invariant
I Remember: The standard OLS coe cient estimates are scale equivariant: multiplying xj by a constant c simply leads to a scaling of the least squares coe cient estimates by a factor of 1/c. In other words, regardless of how the jth predictor is scaled, Xj ˆj will remain the same.
I For ridge regression, this is not the case since the ’s are directly in the penality term.
I You dont want to penalize coe cients that are large due to the scaling of the underlying x variable.
I We need to make the di↵erent j ’s comparable, which we can do easily by standardizing them.
Ridge regression is not scale invariant
I We can scale the xj by its standard deviation, then all the xi j values are expressed in terms of standard deviations.
I i.e. the common unit is a standard deviation.
I Divide the value by an estimate of the standard deviation,
x ̃ i j = q n1 P ni = 1 ( x i j x ̄ j ) 2
Shrinking hedonic pricing coe cients?
Lets plot some of the coe cients as they evolve with increasing.
baths bedrooms builtyryr distgas2009 distgas2010 lat
The vertical axis presents the estimated coe cients i , while the horizontal axis plots out log( ). A large value means a high value of , which imply significant shrinking of coe cients. Due to the very large values that takes in this example, I used a logaritmic transformation to condense the picture.
Shrinking hedonic pricing coe cients?
Lets plot some of the coe cients as they evolve with increasing.
baths bedrooms distgas2009 distgas2010 sqft
The vertical axis presents the estimated coe cients i , while the horizontal axis plots out log( ). A large value means a high value of , which imply significant shrinking of coe cients. Due to the very large values that takes in this example, I used a logaritmic transformation to condense the picture.
Shrinking hedonic pricing coe cients?
An alternative way to plot these is to express the penality term relative to the benchmark OLS, i.e. to normalize the horizontal axis to range from 0 – 1, where 1 corresponds to OLS.
k ˆ k 2 k ˆ O L S k 2
vuXp k k2 = t j2
This is called the L2 norm, which measuqres the Euclidian distance
from origin. For p = 2, this is: k k2 = 12 + 2.
Sowhen !0, k ˆ k2 !1. k ˆOLS k2
Shrinking hedonic pricing coe cients?
An alternative way to plot these is to express the penality term relative to the benchmark OLS, i.e. to compute:
0.50 0.75 1.00
l2betaratio
baths bedrooms builtyryr distgas2009 distgas2010 lat
The vertical axis presents the estimated coe cients i , while the horizontal axis plots out k ˆ k2 . The closer the value is to 1 on
the horizontal axis, the closer we get to the OLS solution.
Shrinking hedonic pricing coe cients?
An alternative way to plot these is to express the penality term relative to the benchmark OLS, i.e. to compute:
0.50 0.75 1.00
l2betaratio
baths bedrooms distgas2009 distgas2010 sqft
The vertical axis presents the estimated coe cients i , while the horizontal axis plots out k ˆ k2 . The closer the value is to 1 on
the horizontal axis, the closer we get to the OLS solution.
What do we gain using Ridge regression?
I This course revolves around the bias variance tradeo↵.
I Ridge regression, by shrinking coe cients, helps reduce the variance of the prediction and thus, reduces prediction error.
What do we dislike about Ridge regression?
I As opposed to subset selection, ridge regression does not actually result in simpler models.
I Some estimated coe cients are simply reduced in their absolute value, but we dont get completely rid of irrelevant predictors as we did wit subset selection.
I Ideally, we want to allow for coe cients to be exactly equal to zero.
I This can be achieved, if we make the optimization problem slightly more complicated.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com