代写代考 Introduction to Machine Learning Regression and Regularization

Introduction to Machine Learning Regression and Regularization
Prof. Kutty

Supervised Learning

– Given data ! (in feature space) and the labels ” – Learn to predict ” from !
• Labels could be discrete or continuous – Discrete labels: classification
– Continuous labels: regression
product price
average star rating (fraction of 5)

Regression vs Classification
Classification problem: y ∈ {−1, 1} or y ∈ {0, 1}
Regression problem: y ∈ R
Regression function f : Rd → R where f ∈ F
a cRd parameter feature space
label space
I predicted

Linear Regression
A linear regression function is simply a linear function of the feature vector:
f ( x ̄ ; θ , b ) = θ · x ̄ + b
Learning task: t vector parameter
Choose parameters in response to training set
S = { x ̄ ( i ) , y ( i ) } n i d c i s
n i=1 x ̄∈Ry∈R

Empirical risk for Linear Regression
Recall empirical risk for linear classification
(i) ̄ (i) Loss(y (✓·x ̄ ))
Rn(✓)=n Loss(y (✓·x ̄ ))

Least Squares Loss function
̄ (i) ̄(i)
Squared Loss:
Loss(z) = !” #
Rn(✓)=n Loss(y (✓·x ̄ )) i=1
permit small discrepancies
predicted real valued M label
heavily penalize large deviations
actual realvalued lanbel f
1 X ( y ( i ) ( ✓ ̄ · x ̄ ( i ) ) ) 2 n i=1 2

Stochastic Gradient Descent
for Optimizing the Least Squares Criterion
k=0,”̅# =0
while convergence criteria is not met
randomly shuffle points for i = 1, …,n
“̅(#%&)=”̅# −*#
∇ -, . / 0 0 1 2 − ” ̅ ⋅ 4 ̅ 2
1 Xn ( y ( i ) ( ✓ ̄ · x ̄ ( i ) ) ) 2 ∇
R n ( ✓ ) = n -, 2

Least Squares Loss function
Recall chain rule 1Xn (y(i)(✓ ̄·x ̄(i)))2 d O a
i=1 cis I 2 y E Ici c a
K yDoLosssqdCy OI q ca
=n∇#”2 Joo

Stochastic Gradient Descent
for Optimizing the Least Squares Criterion
k=0,”̅# =0
while convergence criteria is not met
randomly shuffle points for i = 1, …,n
✓(k+1) = ✓(k) + ⌘k(y(i) ✓(k) · x ̄(i))x ̄(i) ̄ ̄ ̄

Lecture 5: Regression and Regularization
Closed form solution for Empirical Risk with Squared Loss

Optimal value of “̅ for #$(“̅)
label true zei
IM (i) (i) 2
1Xn (y (✓ ̄·x ̄ )) predicted
n ̄ zcis R (✓) = n i=1 2
1. Find gradient wrt “̅ to Rnc 0
2. Set it to zero and solve for “̅

Find gradient, set to 0 and solve for “̅
r ✓ ̄ R n ( ✓ ) = n x ̄ ( i ) y ( i ) + n ( ✓ ̄ · x ̄ ( i ) ) x ̄ ( i )
vector and scalar
dot product commutes by definition
associates
̄ IciIciT 0
rR(✓)| ⇤=0 ̄n ̄ ̄
Ili It Ici 0
i=1 multiplication i=1
b ̅∗ &’ “=%(
n x ̄ ( i ) y ( i )
dimension: d x 1
dimension: d x d
n x ̄(i) (x ̄(i) )T

Alternative notation
input she Ee ya i $̅ (&) n !=⋮
1 X (i) (i) $̅())
b=n x ̄ y $&(&) ⋯ $+(&)
I Z y 1 a h yin $&()) ⋯ $+())
=1 XTy ̄ n
dimension: n x d
.(&) .-= ⋮
dimension: n x 1

Alternative notation
x ̄(i)(x ̄(i))T = n1 X T X
convince yourself of this!
X = [x ̄(1),…,x ̄(n)]T dimension: n x d
y ̄ = [y(1),…,y(n)]T d i m e n s i o n : n x 1
11 ✓⇤=(XTX)1 XTy ̄
✓⇤ = (XT X)1XT y ̄

Exact Solution for Regression
The parameter value computed as
✓⇤ = (XT X)1XT y ̄
exactly minimizes Empirical Risk with Squared Loss
1 Xn ( y ( i ) ( ✓ ̄ · x ̄ ( i ) ) ) 2 n i=1 2

If an exact solution exists, why use SGD?
short answer: efficiency

What if XTX is singular?
– columns are linearly dependent.
• implication: features are redundant
• Solution:
– identify and remove offending features! – use regularization

Lecture 5: Regression and Regularization
Section 2: Regularization and Ridge Regression

Regularization
Idea: prefer a simpler hypothesis
will push parameters toward some default value (typically zero)
resists setting parameters away from default value when data
weakly tells us otherwise
hyper parameter
Snick IE Ici
̄ ̄ Jn,(✓) = + Rn(✓)
regularization term/penalty; λ ≥ 0
·x nci f na ncis

What should !($̅) be?
• Desirable characteristics:
– should force components of $̅ to be small (close to zero) – Convex, Smooth
• A popular choice – l’ norms
– Let’s use l( norm as the penalty function

Ridge Regression

Ridge regression
squared loss
̄ ̄ ̄ Jn,(✓) = Z(✓) + Rn(✓)
1 Xn ( y ( i ) ( ✓ ̄ · x ̄ ( i ) ) ) 2
L2 regularization
| | ✓ | | 2 1 Xn ( y ( i ) ( ✓ ̄ · x ̄ ( i ) ) ) 2 Jn,(✓)= 2 +ni=1 2

Ridge regression
| | ✓ | | 2 1 Xn ( y ( i ) ( ✓ ̄ · x ̄ ( i ) ) ) 2 Jn,(✓)= 2 +ni=1 2
– this is linear regression with squared loss
– this is minimized at $̅ = 0
• picking an appropriate λ balances between these two extremes

Lecture 5: Regression and Regularization
Section 3: Closed form solution for Ridge Regression
XXx5 E μ LosssadCy

Ridge regression Closed form solution
1. Find gradient wrt “̅ 8Jn 10 2. Set it to zero and solve for “̅
̄2 Xn (i) ̄(i)2 ||✓|| 1 (y(✓·x ̄))
Jn,(✓)= 2 +ni=1 2

Ridge regression Closed form solution
̄2 Xn (i) ̄(i)2 ||✓|| 1 (y(✓·x ̄))
Jn,(✓)= 2 +ni=1 2
n x ̄ ( i ) y ( i )
r✓ ̄Jn,(✓) = + ✓⇤ + !#̅
=!I+A#̅−b O
n x ̄(i) (x ̄(i) )T

Ridge regression Closed form solution
| | ✓ | | 2 1 Xn ( y ( i ) ( ✓ ̄ · x ̄ ( i ) ) ) 2 Jn,(✓)= 2 +ni=1 2
(“̅) = “̅∗
rJ (✓)| ⇤=0 ̄n, ̄ ̄
We say arg min :
̅ 98;,= Nsf
“∗ = $I+A()b
= $ ′ I + X – X ( ) X – /.
invertible as long as $ > 0

Ridge regression Closed form solution
invertible as long as ! > 0
• A matrix is positive definite iff all its eigenvalues are positive.
• A positive definite matrix is invertible.
• A matrix is positive semi-definite matrix (PSD) iff all its eigenvalues are non-negative.
• CDC is positive semi-definite (PSD).
• If matrix E has eigenvalue F,
then E + !G has eigenvalue F + !.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts