Lecture 5: Linear regression (1) CS 189 (CDSS offering)
2022/01/28
Today’s lecture
Copyright By PowCoder代写 加微信 powcoder
• Last lecture, we ended by relating least squares linear regression to MLE
• Today, we will continue our discussion of linear regression, starting by recapping our solution from last time and examining it in greater detail
• We will see how different assumptions of the underlying data generation process lead to different MLE formulations of linear regression
• Time permitting, we will also briefly take a geometric view of linear regression
From MLE to least squares linear regression
adda 1totheendofX
to put b into let’s do some linear algebra on the MLE objective for this setup:
arg max Ii arg min É wt xi
w wtx y12 w yo
this looks like a le norm squared Hull I VE
the i th element of what vector
then exw di wix
argmin Il wy
Solving least squares linear regression
argumin IXw yIli again Xu y Xw y argumin WTX’X W 2 yTX w t yTy
objective:
let’s take the gradient and set equal to zero:
settoOXXWmeXTyWME XXXTy
Hessian (second derivative) check:
2XXisPSD becauseforanyveRd
FLLXTX v 2 XV XV LIXuIli 4
Examining the least squares solution
• The solution we found needs X!X to be invertible — when is this not the case?
• When the columns of X are not linearly independent — i.e., when certain
features of the input are (linearly) determined by a set of other features
• If this is the case, there are actually an infinite number of solutions!
• Intuitively, we may think to simply remove features that are already determined by other features — this is a form of feature selection
• We can also consider other constraints on the solution, such as minimizing the norm of the solution — this is a form of regularization (next week)
An aside: feature selection
• If we believe that we don’t need the entire set of features, can we “prune” them?
• There are many ideas for how to do so; one of the simplest ideas is to either
select or prune features greedily
• Forward selection: initialize an empty set of features; for each feature, add it
to the set and train the model; pick the best feature to add; repeat
• Backward elimination: initialize the full set of features; for each feature, remove it from the set and train; pick the best feature to remove; repeat
• Most machine learning relies on regularization rather than feature selection 6
Other MLE formulations for linear regression
• The specific assumption of i.i.d. Gaussian noise on each output, conditioned on the input, led to MLE being equivalent to least squares linear regression
• What about other assumptions?
• What if the i.i.d. noise instead follows a Laplace distribution?
• We will see that this leads to least absolute deviation linear regression • What if the noise is not i.i.d.?
• We will examine the case where there are different variances for each data point and how this leads to weighted linear regression
From MLE to least absolute deviation
assume that the output given the input is generated i.i.d. as
Y X n Laplace
where Laplace y m r
I exp 1421
the objective is:
arg max logLaplace yo wt xi w
argmax IIwtxi yiI const
least absolutedeviations
Least absolute deviations vs. least squares
• Sadly, least absolute deviations has no analytical closed form solution
• We must use iterative optimization — the most commonly used methods for
this problem are simplex methods, which are discussed in optimization classes • Why might we choose this over least squares?
• One reason is that least absolute deviations is more robust to outliers
• This robustness is thanks to the properties of
heavy tailed distributions
From MLE to weighted linear regression
assume that the output given the input is generated independently (not i.i.d.!) as
exerciseiwritethisat
YiIXi N wtXi Gi
the objective is:
asI II wtxi yj
this is useful for more than just when we “trust” some of the data more than other data —
we may, e.g., “care more” about getting the right answer for some data
wy Islogx yo É
If time: a geometric perspective on least squares
• What predictions y” = Xw are even possible for the model to make? • The predictions y” are constrained to be in the column space of X
• The targets y almost certainly are not in the column space of X, due to noise, outliers, etc.
• So we wish to find predictions y” in the column space of X that are as close as possible (in terms of !2) to y
• This is the projection of y into the column space of X! 11
If time: a geometric perspective on least squares
gjgywitninf.sif ‘istimistienal
remember: the orthogonal complement of the column space is the (left) null space
yy our model Y
XTly XW O XTy XTXW O W XX XTy 12
now substitutein
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com