程序代写 Introduction

Introduction
I In the last lecture, we reviewed linear regression and discussed model selection techniques.
I The idea was to find a“minimal”model that robustly predicts out of sample.
I We e↵ectively tried out a lot of di↵erent models.

Copyright By PowCoder代写 加微信 powcoder

I In this section, we discuss methods of how we can get rid or reduce the influence of some predictors that do not explain much of the variation in y, by“shrinking”them to zero

Ridge Regression

Ridge regression
I The reason for high variance in the test error may be due to large variance of some coecient estimates.
I We can“shrink”them to zero, which can dramatically improve the performance for prediction.
I We first discuss ridge regression. The objective function here is to find a vector , such that the following expression is minimized:
Xp argmin (yi 0 xijj)2 + j2
i=1 j=1 j=1

Ridge regression
I The reason for high variance in the test error may be due to large variance of some coecient estimates.
I We can“shrink”them to zero, which can dramatically improve the performance for prediction.
I We first discuss ridge regression. The objective function here is to find a vector , such that the following expression is minimized:
Xp argmin (yi 0 xij j )2 + j2
i=1 k=1 j=1 | {z }
Residual Sum of Squares

Ridge regression
I The reason for high variance in the test error may be due to large variance of some coecient estimates.
I We can“shrink”them to zero, which can dramatically improve the performance for prediction.
I We first discuss ridge regression. The objective function here is to find a vector , such that the following expression is minimized:
Xp argmin (yi 0 xk k )2 + j2
i=1 j=1 j=1
| {z } | {z }
Residual Sum of Squares PenalityTerm

Ridge regression [a bit mathy]
Xn X Xp argmin (yi xijj)2+ j2
| {z }|{z}
Residual Sum of Squares PenalityTerm
We can rewrite the objective function in matrix notation as:
argmin(y X)0(y X) + 0
I This is very close to the objective function for plain OLS.
I The First Order Conditions are:
2X0(y X)+2
I Setting equal to zero
X0y = (I + X0X) I From this follows directly:
R = (I + X0X)1X0y

Ridge regression
I You can see directly that if = 0, then the solution is just OLS.
I As!1,!0.
I For every value of you get a di↵erent estiamted coecient
I Note that we dont have a penality for the intercept 0, which provides the mean value of the dependent variable y, when all the xi are zero.
I But how to choose , we will explore this further below.

Another way to write down the Ridge regression problem
This problem
argmin (yi 0 xij j )2 + j2 (1)
Xp i=1 j=1 j=1
| {z } | {z }
Residual Sum of Squares PenalityTerm
min (yi0 xijj)2 subjectto j2s (2)
i=1 k=1 j=1 | {z }
Residual Sum of Squares
I This is a constrained optimization problem, which you should know from Microeconomics (maximize utility subject to a budget constraint)
is the same as

Some graphical intuition for p = 2
Suppose you only have p = 2, i.e. there are two features
(yi 01xi12xi2)2 subjectto12+2 s (3) | {z }
Residual Sum of Squares
What shape does the objective function and the constraint have?

Some graphical intuition for p = 2

Some graphical intuition for p = 2

Some graphical intuition for p = 2

Some graphical intuition for p = 2

Some graphical intuition for p = 2

Ridge regression is not scale invariant
I Remember: The standard OLS coecient estimates are scale equivariant: multiplying xj by a constant c simply leads to a scaling of the least squares coecient estimates by a factor of 1/c. In other words, regardless of how the jth predictor is scaled, Xj ˆj will remain the same.
I For ridge regression, this is not the case since the ’s are directly in the penality term.
I You dont want to penalize coecients that are large due to the scaling of the underlying x variable.
I We need to make the di↵erent j ’s comparable, which we can do easily by standardizing them.

Ridge regression is not scale invariant
I We can scale the xj by its standard deviation, then all the xi j values are expressed in terms of standard deviations.
I i.e. the common unit is a standard deviation.
I Divide the value by an estimate of the standard deviation,
x ̃ i j = q n1 P ni = 1 ( x i j x ̄ j ) 2

Shrinking hedonic pricing coecients?
Lets plot some of the coecients as they evolve with increasing.
baths bedrooms builtyryr distgas2009 distgas2010 lat
The vertical axis presents the estimated coecients i , while the horizontal axis plots out log(). A large value means a high value of , which imply significant shrinking of coecients. Due to the very large values that takes in this example, I used a logaritmic transformation to condense the picture.

Shrinking hedonic pricing coecients?
Lets plot some of the coecients as they evolve with increasing.
baths bedrooms distgas2009 distgas2010 sqft
The vertical axis presents the estimated coecients i , while the horizontal axis plots out log(). A large value means a high value of , which imply significant shrinking of coecients. Due to the very large values that takes in this example, I used a logaritmic transformation to condense the picture.

Shrinking hedonic pricing coecients?
An alternative way to plot these is to express the penality term relative to the benchmark OLS, i.e. to normalize the horizontal axis to range from 0 – 1, where 1 corresponds to OLS.
k ˆ k 2 k ˆ O L S k 2
vuXp kk2 = t j2
This is called the L2 norm, which measuqres the Euclidian distance
from origin. For p = 2, this is: kk2 = 12 + 2.
Sowhen!0, kˆk2 !1. kˆOLS k2

Shrinking hedonic pricing coecients?
An alternative way to plot these is to express the penality term relative to the benchmark OLS, i.e. to compute:
0.50 0.75 1.00
l2betaratio
baths bedrooms builtyryr distgas2009 distgas2010 lat
The vertical axis presents the estimated coecients i , while the horizontal axis plots out kˆk2 . The closer the value is to 1 on
the horizontal axis, the closer we get to the OLS solution.

Shrinking hedonic pricing coecients?
An alternative way to plot these is to express the penality term relative to the benchmark OLS, i.e. to compute:
0.50 0.75 1.00
l2betaratio
baths bedrooms distgas2009 distgas2010 sqft
The vertical axis presents the estimated coecients i , while the horizontal axis plots out kˆk2 . The closer the value is to 1 on
the horizontal axis, the closer we get to the OLS solution.

What do we gain using Ridge regression?
I This course revolves around the bias variance tradeo↵.
I Ridge regression, by shrinking coecients, helps reduce the variance of the prediction and thus, reduces prediction error.

What do we dislike about Ridge regression?
I As opposed to subset selection, ridge regression does not actually result in simpler models.
I Some estimated coecients are simply reduced in their absolute value, but we dont get completely rid of irrelevant predictors as we did wit subset selection.
I Ideally, we want to allow for coecients to be exactly equal to zero.
I This can be achieved, if we make the optimization problem slightly more complicated.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com