代写代考 GA 1003 Feb 8, 2022 1 / 41

Controling Complexity: Feature Selection and Regularization
Feb 8, 2022
DS-GA 1003 Feb 8, 2022 1 / 41

Copyright By PowCoder代写 加微信 powcoder

Complexity of Hypothesis Spaces
What is the trade-off between approximation error and estimation error?
Bigger F: better approximation but can overfit
Smaller F: less likely to overfit but can be farther from the Bayes prediction function
To control the “size” of F, we need some measure of its complexity: Number of variables / features
Degree of polynomial
DS-GA 1003 Feb 8, 2022 2 / 41

General Approach to Control Complexity
1. Learn a sequence of models varying in complexity from the training data F1 ⊂F2 ⊂Fn···⊂F
Example: Polynomial Functions
F = {all polynomial functions}
Fd = {all polynomials of degree 􏳫 d}
2. Select one of these models based on a score (e.g. validation error)
DS-GA 1003
Feb 8, 2022

Feature Selection in Linear Regression
Nested sequence of hypothesis spaces: F1 ⊂ F2 ⊂ Fn · · · ⊂ F F = {linear functions using all features}
Fd = {linear functions using fewer than d features}
Best subset selection:
Choose the subset of features that is best according to the score (e.g. validation error)
Example with two features: Train models using {}, {X1}, {X2}, {X1,X2}, respectively
No efficient search algorithm; iterating over all subsets becomes very expensive with a large number of features
DS-GA 1003 Feb 8, 2022

Complexity Penalty
An objective that balances number of features and prediction performance:
score(S ) = training_loss(S ) + λ|S |
λ balances the training loss and the number of features used:
Adding an extra feature must be justified by at least λ improvement in training loss Larger λ → complex models are penalized more heavily
Criteria such as AIC and BIC use different values of λ
DS-GA 1003 Feb 8, 2022

Greedy Selection Methods
Forward selection:
1. Start with an empty set of features S
2. For each feature i not in S
Learn a model using features S ∪ i
Compute score of the model: αi
3. Find the candidate feature with the highest score: j = arg maxi αi
4. If αj improves the current best score, add feature j : S ← S ∪ j and go to step 2; return S otherwise.
Backward Selection:
Start with all features; in each iteration, remove the worst feature
DS-GA 1003 Feb 8, 2022 6 / 41

Feature Selection: Discussion
Number of features as a measure of the complexity of a linear prediction function
General approach to feature selection:
Define a score that balances training error and complexity
Find the subset of features that maximizes the score
The search problem over 2k feature subsets is hard
Forward selection is often used in practice, but isn’t guaranteed to find the best solution Forward and backward selection do not in general result in the same subset
DS-GA 1003 Feb 8, 2022 7 / 41

l2 and l1 Regularization
DS-GA 1003 Feb 8, 2022 8 / 41

Complexity Penalty
Goal: Balance the complexity of the hypothesis space F and the training loss Complexity measure: Ω : F → [0, ∞), e.g. number of features
Penalized ERM (Tikhonov regularization)
For complexity measure Ω : F → [0, ∞) and fixed λ 􏳬 0, 1 􏳇n
min l(f (xi),yi)+λΩ(f ) f ∈F n i =1
As usual, we find λ using the validation data.
Number of features as complexity measure is hard to optimize—other measures?
DS-GA 1003
Feb 8, 2022

Weight Shrinkage: Intuition
(Draw on board)
Why would we prefer a regression line with smaller slope (unless the data strongly supports a larger slope)?
More conservative: small change in the input does not cause large change in the output
If we push the estimated weights to be small, re-estimating them on a new dataset wouldn’t cause the prediction function to change dramatically (less sensitive to noise in data)
Bayesian intuition: pull the regression weights towards a prior centered at 0
DS-GA 1003 Feb 8, 2022 10 / 41

Weight Shrinkage: Polynomial Regression
Large weights are needed to make the curve wiggle sufficiently to overfit the data yˆ = 0.001×7 +0.003×3 +1 less likely to overfit than yˆ = 1000×7 +500×3 +1
(Adapated from ’s slide)
DS-GA 1003 Feb 8, 2022 11 / 41

Linear Regression with l2 Regularization We have a linear model
F = 􏳓f : Rd → R | f (x ) = w T x for w ∈ Rd 􏳔 Square loss: l(yˆ, y ) = (y − yˆ)2
Training data Dn = ((x1,y1),…,(xn,yn))
Linear least squares regression is ERM for square loss over F:
wˆ =argmin (wTxi −yi)2
w∈Rd n i=1
This often overfits, especially when d is large compared to n (e.g. in NLP one can have 1M features for 10K documents).
DS-GA 1003 Feb 8, 2022 12 / 41

Linear Regression with L2 Regularization
Penalizes large weights:
wˆ=argmin1􏳇n 􏳓wTx−y􏳔2+λ∥w∥2, w∈Rdn ii 2
where ∥w∥2 = w12 +···+wd2 is the square of the l2-norm.
Also known as ridge regression.
Equivalent to linear least square regression when λ = 0.
l2 regularization can be used for other models too (e.g. neural networks).
DS-GA 1003
Feb 8, 2022

l2 regularization reduces sensitivity to changes in input
fˆ(x) = wˆT x is Lipschitz continuous with Lipschitz constant L = ∥wˆ∥2: when moving
from x to x + h, fˆ changes no more than L∥h∥.
l2 regularization controls the maximum rate of change of fˆ. Proof:
􏳕􏳕fˆ(x +h)−fˆ(x)􏳕􏳕 = |wˆT (x +h)−wˆT x| = 􏳕􏳕wˆT h􏳕􏳕
􏳫 ∥wˆ∥2∥h∥2 (Cauchy-Schwarz inequality)
Other norms also provide a bound on L due to the equivalence of norms: ∃C > 0 s.t. ∥wˆ∥2 􏳫 C∥wˆ∥p
DS-GA 1003 Feb 8, 2022 14 / 41

Linear Regression vs. Ridge Regression
Objective:
Linear: L(w) = 12∥Xw −y∥2
Ridge: L(w) = 12∥Xw −y∥2 + λ2 ∥w∥2
Linear: ∇L(w) = XT (Xw −y)
Ridge: ∇L(w) = XT (Xw −y)+λw
Also known as weight decay in neural networks
Closed-form solution:
Linear: XTXw =XTy
Ridge: (XTX+λI)w=XTy
(X T X + λI ) is always invertible
DS-GA 1003
Feb 8, 2022

Ridge Regression: Regularization Path
Modified from Hastie, Tibshirani, and Wainwright’s Statistical Learning with Sparsity, Fig 2.1. About predicting crime in 50 US cities.
DS-GA 1003 Feb 8, 2022 16 / 41

Penalize the l1 norm of the weights: (Tikhonov Form)
i=1 where ∥w∥1 = |w1|+···+|wd| is the l1-norm.
(“Least Absolute Shrinkage and Selection Operator”)
wˆ=argmin1􏳇n 􏳓wTx−y􏳔2+λ∥w∥, w∈Rdn ii 1
DS-GA 1003
Feb 8, 2022

Ridge vs. Lasso: Regularization Paths
Lasso yields sparse weights:
Modified from Hastie, Tibshirani, and Wainwright’s Statistical Learning with Sparsity, Fig 2.1. About predicting crime in 50 US cities.
DS-GA 1003 Feb 8, 2022 18 / 41

The Benefits of Sparsity
The coefficient for a feature is 0 =⇒ the feature is not needed for prediction. Why is that useful?
Faster to compute the features; cheaper to measure or annotate them Less memory to store features (deployment on a mobile device) Interpretability: identifies the important features
Prediction function may generalize better
Feature-selection step for training a slower non-linear model
DS-GA 1003
Feb 8, 2022

Why does l1 Regularization Lead to Sparsity?
DS-GA 1003 Feb 8, 2022 20 / 41

Regularization as Constrained Empirical Risk Minimization
Constrained ERM (Ivanov regularization)
For complexity measure Ω : F → [0, ∞) and fixed r 􏳬 0, 1 􏳇n
(Ivanov Form)
min l(f (xi ), yi ) f∈F ni=1
s.t.Ω(f)􏳫r
The lasso regression solution for complexity parameter r 􏳬 0 is wˆ=argmin1􏳇n 􏳓wTx−y􏳔2.
r has the same role as λ in penalized ERM (Tikhonov). DS-GA 1003
Feb 8, 2022
∥w∥1􏳫r n i i i=1

Ivanov vs. Tikhonov Regularization
Let L : F → R be any performance measure of f e.g. L(f ) could be the empirical risk of f
For many L and Ω, Ivanov and Tikhonov are equivalent:
Any solution f ∗ we can get from Ivanov, we can also get from Tikhonov.
Any solution f ∗ we can get from Tikhonov, we can also get from Ivanov. The conditions for this equivalence can be derived from Lagrangian duality theory.
In practice, both approaches are effective: we will use whichever one is more convenient for training or analysis.
DS-GA 1003 Feb 8, 2022 22 / 41

The l1 and l2 Norm Constraints
Let’s consider F = {f (x) = w1x1 +w2x2} space)
We can represent each function in F as a point (w1,w2) ∈ R2.
Where in R2 are the functions that satisfy the Ivanov regularization constraint for l1 and l2?
l2 contour: w 12 + w 2 2 = r
l1 contour:
| w 1 | + | w 2 | = r
Where are the sparse solutions?
DS-GA 1003
Feb 8, 2022 23 / 41

Visualizing Regularization
∗ 􏲿n􏳑T􏳒2 22
fr = argminw∈R2 i=1 w xi −yi subject to w1 +w2 􏳫 r
Blue region: Area satisfying complexity constraint: w12 + w2 􏳫 r
Red lines: contours of the empirical risk Rˆn (w ) = 􏲿ni =1 􏳑w T xi − yi 􏳒2 .
KPM Fig. 13.3
DS-GA 1003
Feb 8, 2022

Why Does l1 Regularization Encourage Sparse Solutions? fr∗ =argminw∈R2 n1 􏲿ni=1􏳑wTxi −yi􏳒2 subject to|w1|+|w2|􏳫r
Blue region: Area satisfying complexity constraint: |w1| + |w2| 􏳫 r
Red lines: contours of the empirical risk Rˆn (w ) = 􏲿ni =1 􏳑w T xi − yi 􏳒2 .
l1 solution tends to touch the corners. KPM Fig. 13.3
DS-GA 1003
Feb 8, 2022

Why Does l1 Regularization Encourage Sparse Solutions? Geometric intuition: Projection onto diamond encourages solutions at corners.
wˆ in red/green regions are closest to corners in the l1 “ball”.
Fig from Mairal et al.’s Sparse Modeling for Image and Vision Processing Fig 1.6
DS-GA 1003 Feb 8, 2022 26 / 41

Why Does l1 Regularization Encourage Sparse Solutions?
Geometric intuition: Projection onto l2 sphere favors all directions equally.
Fig from Mairal et al.’s Sparse Modeling for Image and Vision Processing Fig 1.6
DS-GA 1003 Feb 8, 2022 27 / 41

Why does l2 Encourage Sparsity? Optimization Perspective For l2 regularization,
As wi becomes smaller, there is less and less penalty What is the l2 penalty for wi = 0.0001?
The gradient—which determines the pace of optimization—decreases as wi approaches zero
Less incentive to make a small weight equal to exactly zero For l1 regularization,
The gradient stays the same as the weights approach zero
This pushes the weights to be exactly zero even if they are already small
DS-GA 1003
Feb 8, 2022

􏳑lq􏳒 Regularization
We can generalize to lq : (∥w∥q)q = |w1|q +|w2|q.
Note: ∥w∥q is only a norm if q􏳬1, but not for q∈(0,1)
When q < 1, the lq constraint is non-convex, so it is hard to optimize; lasso is good enough in practice l0 (∥w∥0) is defined as the number of non-zero weights, i.e. subset selection DS-GA 1003 Feb 8, 2022 29 / 41 Minimizing the lasso objective DS-GA 1003 Feb 8, 2022 30 / 41 Minimizing the lasso objective The ridge regression objective is differentiable (and there is a closed form solution) Lasso objective function: 􏳑wTxi −yi􏳒 +λ∥w∥1 ∥w∥1 = |w1|+...+|wd| is not differentiable! We will briefly review three approaches for finding the minimum: Quadratic programming Projected SGD Coordinate descent DS-GA 1003 Feb 8, 2022 Rewriting the Absolute Value Consider any number a ∈ R. Let the positive part of a be Let the negative part of a be a+ = a1(a 􏳬 0). a− = −a1(a 􏳫 0). Is it always the case that a+ 􏳬0 and a− 􏳬0? How do you write a in terms of a+ and a−? How do you write |a| in terms of a+ and a−? DS-GA 1003 Feb 8, 2022 The Lasso as a Quadratic Program Substituting w = w+ −w− and |w| = w+ +w− results in an equivalent problem: 􏳇n􏳖􏳑+ −􏳒T 􏳗2 T􏳑+ −􏳒 min w −w xi−yi +λ1 w +w subject to wi+ 􏳬0 for all i and wi− 􏳬0 for all i, This objective is differentiable (in fact, convex and quadratic) How many variables does the new objective have? This is a quadratic program: a convex quadratic objective with linear constraints. Quadratic programming is a very well understood problem; we can plug this into a generic QP solver. DS-GA 1003 Feb 8, 2022 33 / 41 Are we missing some constraints? We have claimed that the following objective is equivalent to the lasso problem: 􏳇n􏳖􏳑+ −􏳒T 􏳗2 T􏳑+ −􏳒 min w −w xi−yi +λ1 w +w subject to wi+ 􏳬0 for all i wi− 􏳬0 for all i, When we plug this optimization problem into a QP solver, it just sees 2d variables and 2d constraints. Doesn’t know we want wi+ and wi− to be positive and negative parts of wi . Turns out that these constraints will be satisfied anyway! To make it clear that the solver isn’t aware of the constraints of wi+ and wi−, let’s denote them ai and bi DS-GA 1003 Feb 8, 2022 34 / 41 The Lasso as a Quadratic Program (Trivially) reformulating the lasso problem: subjectto ai 􏳬0foralli a−b = w a + b = |w | Claim: Don’t need the constraint a + b = |w |. (a+b) bi 􏳬0foralli, 􏳇n􏳖 T 􏳗2 T Exercise: Prove by showing that the optimal solutions a∗ and b∗ satisfies min(a∗,b∗) = 0, hence a∗ +b∗ = |w|. xi −yi +λ1 DS-GA 1003 Feb 8, 2022 35 / 41 The Lasso as a Quadratic Program Claim: Can remove minw and the constraint a − b = w . Exercise: Prove by switching the order of the minimization. 􏳇n􏳖 T 􏳗2 T (a−b) xi −yi +λ1 (a+b) subjectto ai 􏳬0foralli bi 􏳬0foralli, DS-GA 1003 Feb 8, 2022 Projected SGD Now that we have a differentiable objective, we could also use gradient descent But how do we handle the constraints? 􏳇n􏳖􏳑+ −􏳒T 􏳗2 T􏳑+ −􏳒 w −w xi−yi +λ1 w +w subject to wi+ 􏳬 0 for all i wi− 􏳬 0 for all i Projected SGD is just like SGD, but after each step We project w+ and w− into the constraint set. In other words, if any component of w+ or w− becomes negative, we set it back to 0. DS-GA 1003 Feb 8, 2022 37 / 41 Coordinate Descent Method Goal: Minimize L(w) = L(w1,...,wd) over w = (w1,...,wd) ∈ Rd. In gradient descent or SGD, each step potentially changes all entries of w. In coordinate descent, each step adjusts only a single coordinate wi . wnew = argminL(w ,...,w ,w ,w ,...,w ) 1 i−1 i i+1 d Solving this argmin may itself be an iterative process. Coordinate descent is an effective method when it’s easy or easier to minimize w.r.t. one coordinate at a time DS-GA 1003 Feb 8, 2022 38 / 41 Coordinate Descent Method Goal: Minimize L(w) = L(w1,...wd) over w = (w1,...,wd) ∈ Rd. Initialize w (0) = 0 while not converged: Choose a coordinate j ∈ {1,...,d} wnew ← argmin L(w(t),...,w(t) ,w ,w(t) ,...,w(t)) j wj1j−1jj+1d w(t+1) ← wnew and w(t+1) ← w(t) jj Random coordinate choice =⇒ stochastic coordinate descent Cyclic coordinate choice =⇒ cyclic coordinate descent DS-GA 1003 Feb 8, 2022 Coordinate Descent Method for lasso objective coordinate minimization has a closed form! If wˆj =argmin 􏳑wTxi −yi􏳒 +λ|w|1 wj∈R i=1  (cj+λ)/aj ifcj<−λ c=2 x(y−wTx ) j i,j i −j i,−j i=1 (cj − λ)/aj ifcj ∈[−λ,λ] if cj > λ
where w−j is w without the j-th component, and xi,−j is xi without the j-th component.
DS-GA 1003 Feb 8, 2022 40 / 41

Coordinate Descent in General
In general, coordinate descent is not competitive with gradient descent: its convergence rate is slower and the iteration cost is similar
But it works very well for certain problems Very simple and easy to implement
Example applications: lasso regression, SVMs
DS-GA 1003 Feb 8, 2022 41 / 41

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com