留学生考试辅导 Feb. 09, 2022

Feb. 09, 2022
Recitation 3 Feb. 09, 2022 1 / 24
Recitation 3
Regularization: Motivation and Effect

Regularization and its effects
Recitation 3 Feb. 09, 2022 5 / 24

Motivation
Hard to choose a good hypothesis space. Knowing too little about the data/truth
If the space is too small
Cannot model the data accurately
If the space is too large
Overfit the training data
Amazing in training; Useless when deployed
Start with a large space, then shrink it down
Recitation 3
Feb. 09, 2022

Types of Regularization
Implicit Regularization Initialization
Training strategy Model structure
Explicit Regularization (what we refer to in this course) Classics (what we will discuss today)
L1 & L2 & Elastic Net
Early stopping Data augmentation Dropouts
Recitation 3
Feb. 09, 2022

Examples of other types of regularization
Recitation 3 Feb. 09, 2022 8 / 24

L2 (Ridge) and L1(Lasso) Regularization
Recitation 3 Feb. 09, 2022 9 / 24
L2 (Ridge)
wˆ = arg min n1 􏲿ni =1 􏳓w T xi − yi 􏳔2 + λ∥w ∥2 w∈Rd
L1 (Lasso)
wˆ = arg min n1 􏲿ni =1 􏳓w T xi − yi 􏳔2 + λ∥w ∥1 w∈Rd

Effect on linearly dependent features
Suppose we have one feature x1. Response variable y.
The ERM is
fˆ(x1) = 4×1
What happens if we get a new feature x2, but x2 = x1?
Recitation 3
Feb. 09, 2022

Effect on linearly dependent features
New feature x2 gives no new information. ERM is still
fˆ(x1) = 4×1 Now there are some more ERMs:
fˆ ( x 1 , x 2 ) = 2 x 1 + 2 x 2 fˆ(x1,x)2)=x1 +3×2
fˆ ( x 1 , x 2 ) = 8 x 1 − 4 x 2 What if we introduce L1 or L2 regularization?
Recitation 3
Feb. 09, 2022

Effect on linearly dependent features
f(x1,x2)=w1x1+w2x2 isanERMiffw1+w2 =4. Consider the L1 and L2 norms of various solutions:
|w|1 doesn’t discriminate, as long as all have same sign |w|2 minimized when weight is spread equally
Recitation 3 Feb. 09, 2022 12 / 24

L2 Contour Line
Recitation 3 Feb. 09, 2022 13 / 24

L1 Contour Line
Recitation 3 Feb. 09, 2022 14 / 24

Effect on linearly dependent features
Now lets consider the case where x2 = 2×1
Then any model satisfies 2w2 + w1 = 4 will be an ERM.
Suppose we are still dealing with the previous setup
fˆ(x1,x2)=2×1 +x2 fˆ(x1,x2)=3×2 +0.5×2 fˆ(x1,x2)=6×2 −x2
How would the regularization change the outcome?
Recitation 3
Feb. 09, 2022

L2 Contour Line
Recitation 3 Feb. 09, 2022 16 / 24

L1 Contour Line
Recitation 3 Feb. 09, 2022 17 / 24

For identical features
L1 regularization spreads weight arbitrarily (all weights same sign) L2 regularization spreads weight evenly
Linearly related features
L1 regularization chooses variable with larger scale, 0 weight to others
L2 prefers variables with larger scale – spreads weight proportional to scale
Recitation 3 Feb. 09, 2022 18 / 24

Note on contour lines
Recall our discussion of linear predictors f (x ) = w T x and square loss. Sets of w giving same empirical risk (i.e. level sets) formed ellipsoids
around the ERM.
With x1 and x2 linearly related, X T X has a 0 eigenvalue.
So the level set 􏳓 wˆ | (w − wˆ )T X T X (w − wˆ ) = c 􏳔 is no longer an ellipsoid.
It’s a degenerate ellipsoid – that’s why level sets were pairs of lines in this case
Recitation 3 Feb. 09, 2022 19 / 24

Note on contour lines
Recitation 3 Feb. 09, 2022 20 / 24

Elastic Net
A way of combining L1 and L2 regularization: 1􏳇n􏳣T􏳤2 2
wˆ=argminn w xi−yi +λ1∥w∥1+λ2∥w∥2 w∈d i=1
Recitation 3 Feb. 09, 2022 21 / 24

Elastic Net
A not so inspiring way of compromising.
Recitation 3 Feb. 09, 2022 22 / 24

Elastic Net
Ratio of L2 to L1 regularization roughly 2 : 1.
Recitation 3 Feb. 09, 2022 23 / 24

Generalization into more complicated models
The goal is to make model remember only the relevant information. Reduce the model’s dependency of each feature as much as possible.
Methods may vary when we have billions of parameters.
Recitation 3 Feb. 09, 2022 24 / 24

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts