Machine learning lecture slides
COMS 4771 Fall 2020
0/22
Regression II: Regularization
Outline
Inductive biases in linear regression
Regularization
Model averaging
Bayesian interpretation of regularization
1/22
Inductive bias
In linear regression, possible for least square solution to be non-unique, in which case there are infinitely-many solutions.
Which one should we pick?
Possible answer: Pick shortest solution, i.e., of minimum
(squared) Euclidean norm ∥w∥2.
Small norm ⇒ small changes in output in response to changes
in input:
|wTx−wTx′|≤∥w∥2 · ∥x−x′∥2
change in output change in input
(easy consequence of Cauchy-Schwarz)
Note: data does not give reason to choose shorter w over
longer w.
Preference for short w is an example of an inductive bias.
All learning algorithms encode some form of inductive bias.
2/22
Example of minimum norm inductive bias
Trigonometric feature expansion
φ(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x)) ∈ R64
n = 32 training examples
Infinitely many solutions to normal equations
3 2 1 0
-1 -2 -3
arbitrary solution
least Euclidean norm solution
least weighted norm solution
training data
-3 -2 -1 0 1 2 3
Figure 1: Fitted linear models with trigonometric feature expansion
3/22
Representation of minimum norm solution (1)
Claim: The minimum (Euclidean) norm solution to normal equations lives in span of the xi’s (i.e., in range(AT)).
I.e., can write
n
w = ATα = αixi
i=1 for some α = (α1,…,αn) ∈ Rn.
(Replace xi with φ(xi) if using feature map φ.)
Proof: If we have any solution of the form w = s + r, where s ∈ range(AT), and r ̸= 0 is in null(A) (i.e., Ar = 0), we can remove r and have a shorter solution:
ATb = ATAw = ATA(s + r) = ATAs + AT(Ar) = ATAs. (Recall Pythagorean theorem: ∥w∥2 = ∥s∥2 + ∥r∥2)
4/22
Representation of minimum norm solution (2)
In fact, minimum Euclidean norm solution is unique!
If two distinct solutions w and w′ have the same length, then
averaging them gives another solution 1 (w + w′) of shorter 2
length.
5/22
Regularization
Combine two concerns: making both R(w) and ∥w∥2 small Pick λ ≥ 0, and minimize
R(w) + λ∥w∥2
If λ > 0, solution is always unique (even if n < d).
Called ridge regression.
λ = 0 is OLS/ERM.
λ controls how much to pay attention to regularizer ∥w∥2
relative to data fitting term R(w)
λ is hyperparameter to tune (e.g., using cross-validation)
Solution is also in span of the xi’s (i.e., in range(AT))
6/22
Example of regularization with squared norm penality
Trigonometric feature expansion
φ(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x))
Trade-off between fit to data and regularizer n 32
min 1 wTφ(x )−y 2 +λ2j(w2 w∈R64 n i i sin,j
+w2 ) cos,j
i=1 j=1
7/22
3 2 1 0
-1 -2 -3
Figure 2: Trading-off between data fitting term and regularizer
squared weighted norm penalized solution (
squared weighted norm penalized solution (
squared weighted norm penalized solution (
training data
=0.1)
=0.01)
=0)
-3 -2 -1 0 1 2 3
8/22
Data augmentation (1)
A b
LetA= √λI ∈R(n+d)×d and ̃b= 0 ∈Rn+d
Then ∥Aw − ̃b∥2 = R (w) + λ∥w∥2 (ridge regression objective)
Interpretation:
d “fake” data points, ensures augmented A has rank d All corresponding labels are zero.
ATA = ATA + λI and AT ̃b = ATb
So ridge regression solution is wˆ = (ATA + λI)−1ATb
9/22
Data augmentation (2)
Domain-specific data augmentation: e.g., image transformations
Figure 3: What data augmentations make sense for OCR digit recognition?
10/22
Lasso
Lasso: minimize R(w) + λ∥w∥1
Here, ∥v∥1 = nj=1 |vj|, sum of absolute values of vector
entries
Prefers short w, where length is measured using different norm
Tends to produce w that are sparse (i.e., have few non-zero
entries), or at least are well-approximated by sparse vectors.
A different inductive bias:
|wTx−wTx′| ≤ ∥w∥1 ·∥x−x′∥∞
11/22
Lasso vs ridge regression
Example: coefficient profile of Lasso vs ridge
x = clinical measurements, y = level of prostate cancer antigen Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in Lasso and ridge solutions, for
eight different features
12/22
Inductive bias from minimum l1 norm
Theorem: Pickanyw∈Rd andanyε∈(0,1). Formw ̃∈Rd by including the ⌈1/ε2⌉ largest (by magnitude) coefficients of w, and setting remaining entries to zero. Then
∥w ̃ − w∥2 ≤ ε∥w∥1.
If ∥w∥1 is small (compared to ∥w∥2), then theorem says w is well-approximated by sparse vector.
13/22
Sparsity
Lasso also tries to make coefficients small. What if we only care about sparsity?
Subset selection: minimize empirical risk among all k-sparse solutions
Greedy algorithms: repeatedly choose new variables to “include” in support of w until k variables are included.
Forward stepwise regression / orthogonal matching pursuit: Each time you “include” a new variable, re-fit all coefficients for included variables.
Often works as well as Lasso Why do we care about sparsity?
14/22
Detour: Model averaging
Suppose we have M real-valued predictors, fˆ , . . . , fˆ 1M
How to take advantage of all of them?
Model selection: pick the best one, e.g., using hold-out method
Model averaging: form “ensemble” predictor fˆ avg
any x,
1M
fˆ (x):= fˆ(x).
avg M j j=1
, where for
15/22
Risk of model averaging
R(f):=E[(f(X)−Y)2]forsomerandomvariable(X,Y) taking values in X × R.
Theorem: For any fˆ,...,fˆ : X → R, the ensemble 1M
predictor fˆ := 1 M fˆ satisfies avg M j=1 j
1M1M R(fˆ )= R(fˆ)− E (fˆ (X)−fˆ(X))2 .
avg M j M avg j j=1 j=1
Better than model selection when: all fˆ have similar risks, and
all fˆ predict very differently from each other j
j
16/22
Stacking and features
In model averaging, “weights” of 1/M for all fˆ seems arbitrary j
Can “learn” weights using linear regression!
Usefeatureexpansionφ(x)=(fˆ(x),...,fˆ (x))
Called stacking
Useadditionaldata(independentoffˆ,...,fˆ )
1M
Upshot: Any function (even learned functions) can be a feature Conversely: Behind every feature is a deliberate modeling
choice
1M
17/22
Detour: Bayesian statistics
Bayesian inference: probabilistic approach to updating beliefs
Posit a (parametric) statistical model for data (likelihood)
Start with some beliefs about the parameters of model (prior)
Update beliefs after seeing data (posterior)
Pr(w | data) =
posterior(w)
1 Pr(w) · Pr(data | w) Zdata
prior(w) likelihood(w)
(Finding normalization constant Zdata is often the computationally challenging part of belief updating.)
Basis for reasoning in humans (maybe?), robots, etc.
18/22
Beyond Bayesian inference
Can use Bayesian inference framework for designing estimation/learning algorithms (even if you aren’t a Bayesian!)
E.g., instead of computing entire posterior distribution, find the
w with highest posterior probability
Called maximum a posteriori (MAP) estimator
Just find w to maximize
prior(w) × likelihood(w).
(Avoids issue with finding normalization constant.)
19/22
Bayesian approach to linear regression
In linear regression model, express prior belief about
w = (w1, . . . , wd) using a probability distribution with density
function
d√1wj2 Simple choice: prior(w1,...,wd) = j=1 2πσ2 exp(−2σ2 )
I.e., treat w1, . . . , wd as independent N(0, σ2) random variables
Likelihood model: (X1, Y1), . . . , (Xn, Yn) are conditionally
independent given w, and Yi | (Xi, w) ∼ N(XiTw, 1).
What is the MAP?
20/22
MAP for Bayesian linear regression
Find w to maximize d2n
1 wj 1 T2 √2πσ2exp −2σ2 · p(xi)·√2πexp(−(yi−xiw)/2).
j=1 i=1
prior(w) likelihood(w) (Here, p is marginal density of X; unimportant.)
Take logarithm and omit terms not involving w: 1d 1n
− 2 σ 2 w j2 − 2 ( y i − x Ti w ) 2 . i=1 i=1
For σ2 = 1 , same as minimizing nλ
1 n
(xTi w − yi)2 + λ∥w∥2, which is the ridge regression objective!
n
i=1
21/22
Example: Dartmouth data example
Dartmouth data example, where we considered intervals for the HS GPA variable:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
Use φ(x) = (1{x∈(0.00,0.25]}, 1{x∈(0.25,0.50]}, . . . ) with a linear
function
Regularization: λ dj=1(wj − μ)2 where μ = 2.46 is mean of
College GPA values.
What’s the Bayesian interpretation of minimizing the following
objective?
1nd (φ(xi)Tw − yi)2 + λ (wj − μ)2
n i=1 j=1
22/22