Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 22
Regression II: Regularization
Outline
I Inductive biases in linear regression
I Regularization
I Model averaging
I Bayesian interpretation of regularization
1 / 22
Inductive bias
I In linear regression, possible for least square solution to be
non-unique, in which case there are infinitely-many solutions.
I Which one should we pick?
I Possible answer: Pick shortest solution, i.e., of minimum
(squared) Euclidean norm ‖w‖22.
I Small norm ⇒ small changes in output in response to changes
in input:
|wTx− wTx′|︸ ︷︷ ︸
change in output
≤ ‖w‖2 · ‖x− x′‖2︸ ︷︷ ︸
change in input
(easy consequence of Cauchy-Schwarz)
I Note: data does not give reason to choose shorter w over
longer w.
I Preference for short w is an example of an inductive bias.
I All learning algorithms encode some form of inductive bias.
2 / 22
Example of minimum norm inductive bias
I Trigonometric feature expansion
ϕ(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x)) ∈ R64
I n = 32 training examples
I Infinitely many solutions to normal equations
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
arbitrary solution
least Euclidean norm solution
least weighted norm solution
training data
Figure 1: Fitted linear models with trigonometric feature expansion
3 / 22
Representation of minimum norm solution (1)
I Claim: The minimum (Euclidean) norm solution to normal
equations lives in span of the xi’s (i.e., in range(AT)).
I I.e., can write
w = ATα =
n∑
i=1
αixi
for some α = (α1, . . . , αn) ∈ Rn.
I (Replace xi with ϕ(xi) if using feature map ϕ.)
I Proof: If we have any solution of the form w = s+ r, where
s ∈ range(AT), and r 6= 0 is in null(A) (i.e., Ar = 0), we can
remove r and have a shorter solution:
ATb = ATAw = ATA(s+ r) = ATAs+AT(Ar) = ATAs.
(Recall Pythagorean theorem: ‖w‖22 = ‖s‖22 + ‖r‖22)
4 / 22
Representation of minimum norm solution (2)
I In fact, minimum Euclidean norm solution is unique!
I If two distinct solutions w and w′ have the same length, then
averaging them gives another solution 12 (w + w
′) of shorter
length.
5 / 22
Regularization
I Combine two concerns: making both R̂(w) and ‖w‖22 small
I Pick λ ≥ 0, and minimize
R̂(w) + λ‖w‖22
I If λ > 0, solution is always unique (even if n < d). I Called ridge regression. I λ = 0 is OLS/ERM. I λ controls how much to pay attention to regularizer ‖w‖22 relative to data fitting term R̂(w) I λ is hyperparameter to tune (e.g., using cross-validation) I Solution is also in span of the xi’s (i.e., in range(AT)) 6 / 22 Example of regularization with squared norm penality I Trigonometric feature expansion ϕ(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x)) I Trade-off between fit to data and regularizer min w∈R64 1 n n∑ i=1 ( wTϕ(xi)− yi )2 + λ 32∑ j=1 2j(w2sin,j + w 2 cos,j) 7 / 22 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 squared weighted norm penalized solution ( =0.1) squared weighted norm penalized solution ( =0.01) squared weighted norm penalized solution ( =0) training data Figure 2: Trading-off between data fitting term and regularizer 8 / 22 Data augmentation (1) I Let à = [ A√ λI ] ∈ R(n+d)×d and b̃ = [ b 0 ] ∈ Rn+d I Then ‖Ãw − b̃‖22 = R̂(w) + λ‖w‖22 (ridge regression objective) I Interpretation: I d “fake” data points, ensures augmented à has rank d I All corresponding labels are zero. I ÃTà = ATA+ λI and ÃTb̃ = ATb I So ridge regression solution is ŵ = (ATA+ λI)−1ATb 9 / 22 Data augmentation (2) I Domain-specific data augmentation: e.g., image transformations Figure 3: What data augmentations make sense for OCR digit recognition? 10 / 22 Lasso I Lasso: minimize R̂(w) + λ‖w‖1 I Here, ‖v‖1 = ∑n j=1 |vj |, sum of absolute values of vector entries I Prefers short w, where length is measured using different norm I Tends to produce w that are sparse (i.e., have few non-zero entries), or at least are well-approximated by sparse vectors. I A different inductive bias: |wTx− wTx′| ≤ ‖w‖1 · ‖x− x′‖∞ 11 / 22 Lasso vs ridge regression I Example: coefficient profile of Lasso vs ridge I x = clinical measurements, y = level of prostate cancer antigen I Horizontal axis: varying λ (large λ to left, small λ to right). I Vertical axis: coefficient value in Lasso and ridge solutions, for eight different features 12 / 22 Inductive bias from minimum `1 norm I Theorem: Pick any w ∈ Rd and any ε ∈ (0, 1). Form w̃ ∈ Rd by including the d1/ε2e largest (by magnitude) coefficients of w, and setting remaining entries to zero. Then ‖w̃ − w‖2 ≤ ε‖w‖1. I If ‖w‖1 is small (compared to ‖w‖2), then theorem says w is well-approximated by sparse vector. 13 / 22 Sparsity I Lasso also tries to make coefficients small. What if we only care about sparsity? I Subset selection: minimize empirical risk among all k-sparse solutions I Greedy algorithms: repeatedly choose new variables to “include” in support of w until k variables are included. I Forward stepwise regression / orthogonal matching pursuit: Each time you “include” a new variable, re-fit all coefficients for included variables. I Often works as well as Lasso I Why do we care about sparsity? 14 / 22 Detour: Model averaging I Suppose we have M real-valued predictors, f̂1, . . . , f̂M I How to take advantage of all of them? I Model selection: pick the best one, e.g., using hold-out method I Model averaging: form “ensemble” predictor f̂avg, where for any x, f̂avg(x) := 1 M M∑ j=1 f̂j(x). 15 / 22 Risk of model averaging I R(f) := E[(f(X)− Y )2] for some random variable (X,Y ) taking values in X × R. I Theorem: For any f̂1, . . . , f̂M : X → R, the ensemble predictor f̂avg := 1M ∑M j=1 f̂j satisfies R(f̂avg) = 1 M M∑ j=1 R(f̂j)− 1 M M∑ j=1 E [ (f̂avg(X)− f̂j(X))2 ] . I Better than model selection when: I all f̂j have similar risks, and I all f̂j predict very differently from each other 16 / 22 Stacking and features I In model averaging, “weights” of 1/M for all f̂j seems arbitrary I Can “learn” weights using linear regression! I Use feature expansion ϕ(x) = (f̂1(x), . . . , f̂M (x)) I Called stacking I Use additional data (independent of f̂1, . . . , f̂M ) I Upshot: Any function (even learned functions) can be a feature I Conversely: Behind every feature is a deliberate modeling choice 17 / 22 Detour: Bayesian statistics I Bayesian inference: probabilistic approach to updating beliefs I Posit a (parametric) statistical model for data (likelihood) I Start with some beliefs about the parameters of model (prior) I Update beliefs after seeing data (posterior) Pr(w | data)︸ ︷︷ ︸ posterior(w) = 1 Zdata Pr(w)︸ ︷︷ ︸ prior(w) ·Pr(data | w)︸ ︷︷ ︸ likelihood(w) I (Finding normalization constant Zdata is often the computationally challenging part of belief updating.) I Basis for reasoning in humans (maybe?), robots, etc. 18 / 22 Beyond Bayesian inference I Can use Bayesian inference framework for designing estimation/learning algorithms (even if you aren’t a Bayesian!) I E.g., instead of computing entire posterior distribution, find the w with highest posterior probability I Called maximum a posteriori (MAP) estimator I Just find w to maximize prior(w)× likelihood(w). I (Avoids issue with finding normalization constant.) 19 / 22 Bayesian approach to linear regression I In linear regression model, express prior belief about w = (w1, . . . , wd) using a probability distribution with density function I Simple choice: prior(w1, . . . , wd) = ∏d j=1 1√ 2πσ2 exp(− w 2 j 2σ2 ) I I.e., treat w1, . . . , wd as independent N(0, σ2) random variables I Likelihood model: (X1, Y1), . . . , (Xn, Yn) are conditionally independent given w, and Yi | (Xi, w) ∼ N(XTi w, 1). I What is the MAP? 20 / 22 MAP for Bayesian linear regression I Find w to maximize d∏ j=1 1 √ 2πσ2 exp ( − w2j 2σ2 ) ︸ ︷︷ ︸ prior(w) · n∏ i=1 p(xi) · 1 √ 2π exp(−(yi − xTiw) 2/2) ︸ ︷︷ ︸ likelihood(w) . (Here, p is marginal density of X; unimportant.) I Take logarithm and omit terms not involving w: − 1 2σ2 d∑ i=1 w2j − 1 2 n∑ i=1 (yi − xTiw) 2. I For σ2 = 1 nλ , same as minimizing 1 n n∑ i=1 (xTiw − yi) 2 + λ‖w‖22, which is the ridge regression objective! 21 / 22 Example: Dartmouth data example I Dartmouth data example, where we considered intervals for the HS GPA variable: (0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · · I Use ϕ(x) = (1{x∈(0.00,0.25]},1{x∈(0.25,0.50]}, . . . ) with a linear function I Regularization: λ ∑d j=1(wj − µ)2 where µ = 2.46 is mean of College GPA values. I What’s the Bayesian interpretation of minimizing the following objective? 1 n n∑ i=1 (ϕ(xi)Tw − yi)2 + λ d∑ j=1 (wj − µ)2 22 / 22 Regression II: Regularization