程序代写代做代考 Bayesian algorithm Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/22

Regression II: Regularization

Outline
􏰛 Inductive biases in linear regression
􏰛 Regularization
􏰛 Model averaging
􏰛 Bayesian interpretation of regularization
1/22

Inductive bias
􏰛 In linear regression, possible for least square solution to be non-unique, in which case there are infinitely-many solutions.
􏰛 Which one should we pick?
􏰛 Possible answer: Pick shortest solution, i.e., of minimum
(squared) Euclidean norm ∥w∥2.
􏰛 Small norm ⇒ small changes in output in response to changes
in input:
|wTx−wTx′|≤∥w∥2 · ∥x−x′∥2 􏰱􏰰􏰯􏰲 􏰱􏰰􏰯􏰲
change in output change in input
(easy consequence of Cauchy-Schwarz)
􏰛 Note: data does not give reason to choose shorter w over
longer w.
􏰛 Preference for short w is an example of an inductive bias.
􏰛 All learning algorithms encode some form of inductive bias.
2/22

Example of minimum norm inductive bias
􏰛 Trigonometric feature expansion
φ(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x)) ∈ R64
􏰛 n = 32 training examples
􏰛 Infinitely many solutions to normal equations
3 2 1 0
-1 -2 -3
arbitrary solution
least Euclidean norm solution
least weighted norm solution
training data
-3 -2 -1 0 1 2 3
Figure 1: Fitted linear models with trigonometric feature expansion
3/22

Representation of minimum norm solution (1)
􏰛 Claim: The minimum (Euclidean) norm solution to normal equations lives in span of the xi’s (i.e., in range(AT)).
􏰛 I.e., can write
n
w = ATα = 􏰊αixi
i=1 for some α = (α1,…,αn) ∈ Rn.
􏰛 (Replace xi with φ(xi) if using feature map φ.)
􏰛 Proof: If we have any solution of the form w = s + r, where s ∈ range(AT), and r ̸= 0 is in null(A) (i.e., Ar = 0), we can remove r and have a shorter solution:
ATb = ATAw = ATA(s + r) = ATAs + AT(Ar) = ATAs. (Recall Pythagorean theorem: ∥w∥2 = ∥s∥2 + ∥r∥2)
4/22

Representation of minimum norm solution (2)
􏰛 In fact, minimum Euclidean norm solution is unique!
􏰛 If two distinct solutions w and w′ have the same length, then
averaging them gives another solution 1 (w + w′) of shorter 2
length.
5/22

Regularization
􏰛 Combine two concerns: making both R􏰓(w) and ∥w∥2 small 􏰛 Pick λ ≥ 0, and minimize
R􏰓(w) + λ∥w∥2
􏰛 If λ > 0, solution is always unique (even if n < d). 􏰛 Called ridge regression. 􏰛 λ = 0 is OLS/ERM. 􏰛 λ controls how much to pay attention to regularizer ∥w∥2 relative to data fitting term R􏰓(w) 􏰛 λ is hyperparameter to tune (e.g., using cross-validation) 􏰛 Solution is also in span of the xi’s (i.e., in range(AT)) 6/22 Example of regularization with squared norm penality 􏰛 Trigonometric feature expansion φ(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x)) 􏰛 Trade-off between fit to data and regularizer n 32 min 1 􏰊􏰏wTφ(x )−y 􏰐2 +λ􏰊2j(w2 w∈R64 n i i sin,j +w2 ) cos,j i=1 j=1 7/22 3 2 1 0 -1 -2 -3 Figure 2: Trading-off between data fitting term and regularizer squared weighted norm penalized solution ( squared weighted norm penalized solution ( squared weighted norm penalized solution ( training data =0.1) =0.01) =0) -3 -2 -1 0 1 2 3 8/22 Data augmentation (1) 􏰥A􏰦 􏰥b􏰦 􏰛 LetA􏰪= √λI ∈R(n+d)×d and ̃b= 0 ∈Rn+d 􏰛 Then ∥A􏰪w − ̃b∥2 = R􏰓 (w) + λ∥w∥2 (ridge regression objective) 􏰛 Interpretation: 􏰛 d “fake” data points, ensures augmented A􏰪 has rank d 􏰛 All corresponding labels are zero. 􏰛 A􏰪TA􏰪 = ATA + λI and A􏰪T ̃b = ATb 􏰛 So ridge regression solution is wˆ = (ATA + λI)−1ATb 9/22 Data augmentation (2) 􏰛 Domain-specific data augmentation: e.g., image transformations Figure 3: What data augmentations make sense for OCR digit recognition? 10/22 Lasso 􏰛 Lasso: minimize R􏰓(w) + λ∥w∥1 􏰛 Here, ∥v∥1 = 􏰉nj=1 |vj|, sum of absolute values of vector entries 􏰛 Prefers short w, where length is measured using different norm 􏰛 Tends to produce w that are sparse (i.e., have few non-zero entries), or at least are well-approximated by sparse vectors. 􏰛 A different inductive bias: |wTx−wTx′| ≤ ∥w∥1 ·∥x−x′∥∞ 11/22 Lasso vs ridge regression 􏰛 Example: coefficient profile of Lasso vs ridge 􏰛 x = clinical measurements, y = level of prostate cancer antigen 􏰛 Horizontal axis: varying λ (large λ to left, small λ to right). 􏰛 Vertical axis: coefficient value in Lasso and ridge solutions, for eight different features 12/22 Inductive bias from minimum l1 norm 􏰛 Theorem: Pickanyw∈Rd andanyε∈(0,1). Formw ̃∈Rd by including the ⌈1/ε2⌉ largest (by magnitude) coefficients of w, and setting remaining entries to zero. Then ∥w ̃ − w∥2 ≤ ε∥w∥1. 􏰛 If ∥w∥1 is small (compared to ∥w∥2), then theorem says w is well-approximated by sparse vector. 13/22 Sparsity 􏰛 Lasso also tries to make coefficients small. What if we only care about sparsity? 􏰛 Subset selection: minimize empirical risk among all k-sparse solutions 􏰛 Greedy algorithms: repeatedly choose new variables to “include” in support of w until k variables are included. 􏰛 Forward stepwise regression / orthogonal matching pursuit: Each time you “include” a new variable, re-fit all coefficients for included variables. 􏰛 Often works as well as Lasso 􏰛 Why do we care about sparsity? 14/22 Detour: Model averaging 􏰛 Suppose we have M real-valued predictors, fˆ , . . . , fˆ 1M 􏰛 How to take advantage of all of them? 􏰛 Model selection: pick the best one, e.g., using hold-out method 􏰛 Model averaging: form “ensemble” predictor fˆ avg any x, 1M fˆ (x):= 􏰊fˆ(x). avg M j j=1 , where for 15/22 Risk of model averaging 􏰛 R(f):=E[(f(X)−Y)2]forsomerandomvariable(X,Y) taking values in X × R. 􏰛 Theorem: For any fˆ,...,fˆ : X → R, the ensemble 1M predictor fˆ := 1 􏰉M fˆ satisfies avg M j=1 j 1M1M􏰔􏰕 R(fˆ )= 􏰊R(fˆ)− 􏰊E (fˆ (X)−fˆ(X))2 . avg M j M avg j j=1 j=1 􏰛 Better than model selection when: 􏰛 all fˆ have similar risks, and 􏰛 all fˆ predict very differently from each other j j 16/22 Stacking and features 􏰛 In model averaging, “weights” of 1/M for all fˆ seems arbitrary j 􏰛 Can “learn” weights using linear regression! 􏰛 Usefeatureexpansionφ(x)=(fˆ(x),...,fˆ (x)) 􏰛 Called stacking 􏰛 Useadditionaldata(independentoffˆ,...,fˆ ) 1M 􏰛 Upshot: Any function (even learned functions) can be a feature 􏰛 Conversely: Behind every feature is a deliberate modeling choice 1M 17/22 Detour: Bayesian statistics 􏰛 Bayesian inference: probabilistic approach to updating beliefs 􏰛 Posit a (parametric) statistical model for data (likelihood) 􏰛 Start with some beliefs about the parameters of model (prior) 􏰛 Update beliefs after seeing data (posterior) Pr(w | data) = 􏰱 􏰰􏰯 􏰲 posterior(w) 1 Pr(w) · Pr(data | w) Zdata 􏰱 􏰰􏰯 􏰲 􏰱 􏰰􏰯 􏰲 prior(w) likelihood(w) 􏰛 (Finding normalization constant Zdata is often the computationally challenging part of belief updating.) 􏰛 Basis for reasoning in humans (maybe?), robots, etc. 18/22 Beyond Bayesian inference 􏰛 Can use Bayesian inference framework for designing estimation/learning algorithms (even if you aren’t a Bayesian!) 􏰛 E.g., instead of computing entire posterior distribution, find the w with highest posterior probability 􏰛 Called maximum a posteriori (MAP) estimator 􏰛 Just find w to maximize prior(w) × likelihood(w). 􏰛 (Avoids issue with finding normalization constant.) 19/22 Bayesian approach to linear regression 􏰛 In linear regression model, express prior belief about w = (w1, . . . , wd) using a probability distribution with density function 􏰛 􏰧d√1wj2 Simple choice: prior(w1,...,wd) = j=1 2πσ2 exp(−2σ2 ) 􏰛 I.e., treat w1, . . . , wd as independent N(0, σ2) random variables 􏰛 Likelihood model: (X1, Y1), . . . , (Xn, Yn) are conditionally independent given w, and Yi | (Xi, w) ∼ N(XiTw, 1). 􏰛 What is the MAP? 20/22 MAP for Bayesian linear regression 􏰛 Find w to maximize d􏰅2􏰆n 􏰋1 wj􏰋 1 T2 √2πσ2exp −2σ2 · p(xi)·√2πexp(−(yi−xiw)/2). j=1 i=1 􏰱 􏰰􏰯 􏰲􏰱 􏰰􏰯 􏰲 prior(w) likelihood(w) (Here, p is marginal density of X; unimportant.) 􏰛 Take logarithm and omit terms not involving w: 1􏰊d 1􏰊n − 2 σ 2 w j2 − 2 ( y i − x Ti w ) 2 . i=1 i=1 􏰛 For σ2 = 1 , same as minimizing nλ 1 􏰊n (xTi w − yi)2 + λ∥w∥2, which is the ridge regression objective! n i=1 21/22 Example: Dartmouth data example 􏰛 Dartmouth data example, where we considered intervals for the HS GPA variable: (0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · · 􏰛 Use φ(x) = (1{x∈(0.00,0.25]}, 1{x∈(0.25,0.50]}, . . . ) with a linear function 􏰛 Regularization: λ 􏰉dj=1(wj − μ)2 where μ = 2.46 is mean of College GPA values. 􏰛 What’s the Bayesian interpretation of minimizing the following objective? 1nd 􏰊(φ(xi)Tw − yi)2 + λ 􏰊(wj − μ)2 n i=1 j=1 22/22