Machine learning lecture slides
COMS 4771 Fall 2020
0/22
Regression II: Regularization
Outline
I Inductive biases in linear regression
I Regularization
I Model averaging
I Bayesian interpretation of regularization
1/22
Inductive bias
I In linear regression, possible for least square solution to be non-unique, in which case there are infinitely-many solutions.
I Which one should we pick?
I Possible answer: Pick shortest solution, i.e., of minimum (squared) Euclidean norm ÎwÎ2.
I Small norm ∆ small changes in output in response to changes
in input:
(easy consequence of Cauchy-Schwarz)
I Note: data does not give reason to choose shorter w over
longer w.
I Preference for short w is an example of an inductive bias.
I All learning algorithms encode some form of inductive bias.
|wTx≠wTxÕ|ÆÎwÎ2 · Îx≠xÕÎ2 ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋
change in output change in input
2/22
Example of minimum norm inductive bias
I Trigonometric feature expansion
Ï(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x)) œ R64
I n = 32 training examples
I Infinitely many solutions to normal equations
3 2 1 0
-1 -2 -3
arbitrary solution
least Euclidean norm solution
least weighted norm solution
training data
-3 -2 -1 0 1 2 3
Figure 1: Fitted linear models with trigonometric feature expansion
3/22
Representation of minimum norm solution (1)
I Claim: The minimum (Euclidean) norm solution to normal equations lives in span of the xi’s (i.e., in range(AT)).
I I.e., can write ÿn
w = AT– = –ixi
i=1 for some – = (–1,…,–n) œ Rn.
I (Replace xi with Ï(xi) if using feature map Ï.)
I Proof: If we have any solution of the form w = s + r, where s œ range(AT), and r ”= 0 is in null(A) (i.e., Ar = 0), we can remove r and have a shorter solution:
ATb = ATAw = ATA(s + r) = ATAs + AT(Ar) = ATAs. (Recall Pythagorean theorem: ÎwÎ2 = ÎsÎ2 + ÎrÎ2)
4/22
Representation of minimum norm solution (2)
I In fact, minimum Euclidean norm solution is unique!
I If two distinct solutions w and wÕ have the same length, then averaging them gives another solution 1 (w + wÕ) of shorter length. 2
5/22
Recitationtomorrowy
Jupiter released
we’ll post a HW’t grades
notebook
Q
please type
ng
questions
in
If you want to ask verbally
in chat
write
that
tonight
on Gradescople
e for online people
chat
Regularization
I Combine two concerns: making both R‚(w) and ÎwÎ2 small I Pick ⁄ Ø 0, and minimize
R‚(w) + ⁄ÎwÎ2
I If ⁄ > 0, solution is always unique (even if n < d).
I Called ridge regression.
I ⁄ = 0 is OLS/ERM.
I ⁄ controls how much to pay ‚attention to regularizer ÎwÎ2
relative to data fitting term R(w)
I ⁄ is hyperparameter to tune (e.g., using cross-validation)
I Solution is also in span of the xi’s (i.e., in range(AT))
6/22
Definition
from
Wikipedia
prior experts
“[...] regularization is the process of adding information in order to solve an
ill-posed problem or to prevent overfitting”r y
try to
Example of regularization with squared norm penality
I Trigonometric feature expansion
Ï(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x))
Wass i Wsini32 was 22 I Trade-o between fit to data and regularizer
W sin
1n 32
min ÿ!wTÏ(x )≠y "2 +⁄ÿ2j(w2 +w2 ) wœR64 n i i sin,j cos,j
i=1 j=1 Rti
7/22
3 2 1 0
-1 -2 -3
Figure 2: Trading-o between data fitting term and regularizer
squared weighted norm penalized solution (
squared weighted norm penalized solution (
squared weighted norm penalized solution (
training data
=0.1)
=0.01)
=0)X
O
-3 -2 -1 0 1 2 3
EAT KI KAI AtAt ATA t
EISAI XI
8/22
Data augmentation (I1)
CDAt CD
l
Oe
I Then ÎAÂw ≠ ̃bÎ2 = R‚ (w) + ⁄ÎwÎ2 (ridge regression objective)
I LetAÂ= ÔA œR(n+d)◊d and ̃b= b œRn+d ⁄I 0
h www.ozijiiijEI0
IÂ
I d “fake” data points, ensures augmented A has rank d
I All corresponding labels are zero.
I AÂ T AÂ = A T A + ⁄ I a n d AÂ T ̃b = A T b
I So ridge regression solution is wˆ = (ATA + ⁄I)≠1ATb
Mo
i
Interpretation:
orb
Inf
9/22
Data augmentation (2)
I Domain-specific data augmentation: e.g., image transformations
4
Figure 3: What data augmentations make sense for OCR digit recognition?
10/22
Lasso
norm
I Lasso: minimize R‚q(w) + ⁄ÎwÎ1
ml
I Here, ÎvÎ1 = nj=1 |vj|, sum of absolute values of vector entries
I Prefers short w, where length is measured using di erent norm I Tends to produce w that are sparse (i.e., have few non-zero entries), or at least are well-approximated by sparse vectors.
I A di erent inductive bias:
|wTx≠wTxÕ| Æ ÎwÎ1 ·Îx≠xÕÎŒ
inequality maxlxi x.tl
Holder's
it D
G norm
11/22
Lasso vs ridge regression
I Example: coe cient profile of Lasso vs ridge
I x = clinical measurements, y = level of prostate cancer antigen I Horizontal axis: varying ⁄ (large ⁄ to left, small ⁄ to right). I Vertical axis: coe cient value in Lasso and ridge solutions, for
eight di erent features
it
a
12/22
Inductive bias from minimum ̧1 norm
I Theorem: PickanywœRd andanyÁœ(0,1). Formw ̃œRd by including the Á1/Á2Ë largest (by magnitude) coe cients of w, and setting remaining entries to zero. Then
Îw ̃ ≠ wÎ2 Æ ÁÎwÎ1.
I If ÎwÎ1 is small (compared to ÎwÎ2), then theorem says w is well-approximated by sparse vector.
13/22
Sparsity
I Lasso also tries to make coe cients small. What if we only care about sparsity?
I Subset selection: minimize empirical risk among all k-sparse solutions 1 fa k
a
I Forward stepwise regression / orthogonal matching pursuit: Each time you “include” a new variable, re-fit all coe cients for included variables.
I Often works as well as Lasso I Why do we care about sparsity?
W X twzxc.tw Xz 0
fE
I Greedy algorithms: repeatedly choose new variables to “include” in support of w until k variables are included.
Wy
w
We06 I
14/22
Detour: Model averaging
I Suppose we have M real-valued predictors, fˆ , . . . , fˆ
I How to take advantage of all of them? 1 M
I Model selection: pick the best one, e.g., using hold-out method
I Model averaging: form “ensemble” predictor fˆ
, where for
any x,
avg
ˆ 1ÿMˆ favg(x) := M fj(x).
j=1
15/22
Risk of model averaging
I R(f):=E[(f(X)≠Y)2]forsomerandomvariable(X,Y) taking values in X ◊ R.
ˆq ˆ
I Theorem: For any f1,...,fM : X æ R, the ensemble
predictor fˆ := 1 M
avg M j=1 j
fˆ satisfies
1 ÿM 1 ÿM
M ˆ amuseˆ ˆ ˆ 2
ËÈ
R(favg) = M j=1 R(fj) ≠ M j=1 E (favg(X) ≠ fj(X)) .
I Better than model selection when: I all fˆ have similar risks, and
Ty
EffIi fj Y
j
I all fˆ predict very di erently from each other j
16/22
Stacking and features
I In model averaging, “weights” of 1/M for all fˆ seems arbitrary I Can “learn” weights using linear regression! j
I UsefeatureexpansionÏ(x)=(fˆ(x),...,fˆ (x)) I Called stacking 1 M
I Useadditionaldata(independentoffˆ,...,fˆ ) 1M
I Upshot: Any function (even learned functions) can be a feature I Conversely: Behind every feature is a deliberate modeling
choice
17/22
Detour: Bayesian statistics
Likelihoodmodel I Bayesian inference: probabilistic approach to updating beliefs
Pw w EW
I Posit a (parametric) statistical model for data (likelihood)
I Start with some beliefs about the parameters of model (prior)
I Update beliefs after seeing data (posterior) rl
pruddata
Pr(w | data) = ̧ ̊ ̇ ̋
posterior(w)
1 Pr(w) · Pr(data | w) Zdata ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋
O
r prior(w) likelihood(w)
I (Finding normalization constant Zdata is often the computationally challenging part of belief updating.)
I Basis for reasoning in humans (maybe?), robots, etc.
2data
Praful Prcdataff 18/22
Beyond Bayesian inference
I Can use Bayesian inference framework for designing
estimation/learning algorithms (even if you aren’t a Bayesian!)
I E.g., instead of computing entire posterior distribution, find the w with highest posterior probability
I Called maximum a posteriori (MAP) estimator I Just find w to maximize
prior(w) ◊ likelihood(w).
I (Avoids issue with finding normalization constant.)
19/22
Bayesian approach to linear regression
I In linear regression model, express prior belief about
w = (w1, . . . , wd) using a probability distribution with density function r 1 j2
I Likelihood model: (X1, Y1), . . . , (Xn, Yn) are conditionally independent given w, and Yi | (Xi, w) ≥ N(XiTw, 1).
I Simple choice: prior(w ,...,w ) = d Ô exp(≠ w )
I 1
d j=1 2fi‡2 2‡2
I.e., treat w1, . . . , wd as independent N(0, ‡2) random variables
I What is the MAP?
bn prior
w
thulikelihood
w
many
20/22
MAP for Bayesian linear regression
I Find w to maximize
Ÿd1Aw2BŸn 1 T2
Ô exp≠j · p(x)·Ô exp(≠(y≠xw)/2).
j=1 2fi‡2 2‡2 i=1 i 2fi i
i
̧ ̊ ̇ ̋
̧ ̊ ̇ ̋
likelihood(w) (Here, p is marginal density of X; unimportant.)
prior(w)
I Take logarithm and omit terms not involving w:
1ÿd21ÿn T2 ≠2‡2 wj ≠ 2 (yi ≠ xi w) .
i=1 i=1 I For ‡2 = 1 , same as minimizing
n⁄
1 ÿn ( x Ti w ≠ y i ) 2 + ⁄ Î w Î 2 2 ,
n i=1
which is the ridge regression objective!
21/22
Example: Dartmouth data example
I Dartmouth data example, where we considered intervals for the HS GPA variable:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
I Use Ï(x) = (1{xœ(0.00,0.25]}, 1{xœ(0.25,0.50]}, . . . ) with a linear
function qd 2
I Regularization: ⁄ j=1(wj ≠ μ) where μ = 2.46 is mean of
College GPA values.
I What’s the Bayesian interpretation of minimizing the following
objective?
1 ÿn ( Ï ( x i ) T w ≠ y i ) 2 + ⁄ ÿd ( w j ≠ μ ) 2 n i=1 j=1
22/22