程序代写代做代考 algorithm Bayesian Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/22

Regression II: Regularization

Outline
I Inductive biases in linear regression
I Regularization
I Model averaging
I Bayesian interpretation of regularization
1/22

Inductive bias
I In linear regression, possible for least square solution to be non-unique, in which case there are infinitely-many solutions.
I Which one should we pick?
I Possible answer: Pick shortest solution, i.e., of minimum (squared) Euclidean norm ÎwÎ2.
I Small norm ∆ small changes in output in response to changes
in input:
(easy consequence of Cauchy-Schwarz)
I Note: data does not give reason to choose shorter w over
longer w.
I Preference for short w is an example of an inductive bias.
I All learning algorithms encode some form of inductive bias.
|wTx≠wTxÕ|ÆÎwÎ2 · Îx≠xÕÎ2 ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋
change in output change in input
2/22

Example of minimum norm inductive bias
I Trigonometric feature expansion
Ï(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x)) œ R64
I n = 32 training examples
I Infinitely many solutions to normal equations
3 2 1 0
-1 -2 -3
arbitrary solution
least Euclidean norm solution
least weighted norm solution
training data
-3 -2 -1 0 1 2 3
Figure 1: Fitted linear models with trigonometric feature expansion
3/22

Representation of minimum norm solution (1)
I Claim: The minimum (Euclidean) norm solution to normal equations lives in span of the xi’s (i.e., in range(AT)).
I I.e., can write ÿn
w = AT– = –ixi
i=1 for some – = (–1,…,–n) œ Rn.
I (Replace xi with Ï(xi) if using feature map Ï.)
I Proof: If we have any solution of the form w = s + r, where s œ range(AT), and r ”= 0 is in null(A) (i.e., Ar = 0), we can remove r and have a shorter solution:
ATb = ATAw = ATA(s + r) = ATAs + AT(Ar) = ATAs. (Recall Pythagorean theorem: ÎwÎ2 = ÎsÎ2 + ÎrÎ2)
4/22

Representation of minimum norm solution (2)
I In fact, minimum Euclidean norm solution is unique!
I If two distinct solutions w and wÕ have the same length, then averaging them gives another solution 1 (w + wÕ) of shorter length. 2
5/22

Recitationtomorrowy
Jupiter released
we’ll post a HW’t grades
notebook
Q
please type
ng
questions
in
If you want to ask verbally
in chat
write
that
tonight
on Gradescople
e for online people
chat

Regularization
I Combine two concerns: making both R‚(w) and ÎwÎ2 small I Pick ⁄ Ø 0, and minimize
R‚(w) + ⁄ÎwÎ2
I If ⁄ > 0, solution is always unique (even if n < d). I Called ridge regression. I ⁄ = 0 is OLS/ERM. I ⁄ controls how much to pay ‚attention to regularizer ÎwÎ2 relative to data fitting term R(w) I ⁄ is hyperparameter to tune (e.g., using cross-validation) I Solution is also in span of the xi’s (i.e., in range(AT)) 6/22 Definition from Wikipedia prior experts “[...] regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting”r y try to Example of regularization with squared norm penality I Trigonometric feature expansion Ï(x) = (sin(x), cos(x), . . . , sin(32x), cos(32x)) Wass i Wsini32 was 22 I Trade-o between fit to data and regularizer W sin 1n 32 min ÿ!wTÏ(x )≠y "2 +⁄ÿ2j(w2 +w2 ) wœR64 n i i sin,j cos,j i=1 j=1 Rti 7/22 3 2 1 0 -1 -2 -3 Figure 2: Trading-o between data fitting term and regularizer squared weighted norm penalized solution ( squared weighted norm penalized solution ( squared weighted norm penalized solution ( training data =0.1) =0.01) =0)X O -3 -2 -1 0 1 2 3 EAT KI KAI AtAt ATA t EISAI XI 8/22 Data augmentation (I1) CDAt CD l Oe I Then ÎAÂw ≠ ̃bÎ2 = R‚ (w) + ⁄ÎwÎ2 (ridge regression objective) I LetAÂ= ÔA œR(n+d)◊d and ̃b= b œRn+d ⁄I 0 h www.ozijiiijEI0 I I d “fake” data points, ensures augmented A has rank d I All corresponding labels are zero. I A T A = A T A + ⁄ I a n d A T ̃b = A T b I So ridge regression solution is wˆ = (ATA + ⁄I)≠1ATb Mo i Interpretation: orb Inf 9/22 Data augmentation (2) I Domain-specific data augmentation: e.g., image transformations 4 Figure 3: What data augmentations make sense for OCR digit recognition? 10/22 Lasso norm I Lasso: minimize R‚q(w) + ⁄ÎwÎ1 ml I Here, ÎvÎ1 = nj=1 |vj|, sum of absolute values of vector entries I Prefers short w, where length is measured using dierent norm I Tends to produce w that are sparse (i.e., have few non-zero entries), or at least are well-approximated by sparse vectors. I A dierent inductive bias: |wTx≠wTxÕ| Æ ÎwÎ1 ·Îx≠xÕÎŒ inequality maxlxi x.tl Holder's it D G norm 11/22 Lasso vs ridge regression I Example: coecient profile of Lasso vs ridge I x = clinical measurements, y = level of prostate cancer antigen I Horizontal axis: varying ⁄ (large ⁄ to left, small ⁄ to right). I Vertical axis: coecient value in Lasso and ridge solutions, for eight dierent features it a 12/22 Inductive bias from minimum ̧1 norm I Theorem: PickanywœRd andanyÁœ(0,1). Formw ̃œRd by including the Á1/Á2Ë largest (by magnitude) coecients of w, and setting remaining entries to zero. Then Îw ̃ ≠ wÎ2 Æ ÁÎwÎ1. I If ÎwÎ1 is small (compared to ÎwÎ2), then theorem says w is well-approximated by sparse vector. 13/22 Sparsity I Lasso also tries to make coecients small. What if we only care about sparsity? I Subset selection: minimize empirical risk among all k-sparse solutions 1 fa k a I Forward stepwise regression / orthogonal matching pursuit: Each time you “include” a new variable, re-fit all coecients for included variables. I Often works as well as Lasso I Why do we care about sparsity? W X twzxc.tw Xz 0 fE I Greedy algorithms: repeatedly choose new variables to “include” in support of w until k variables are included. Wy w We06 I 14/22 Detour: Model averaging I Suppose we have M real-valued predictors, fˆ , . . . , fˆ I How to take advantage of all of them? 1 M I Model selection: pick the best one, e.g., using hold-out method I Model averaging: form “ensemble” predictor fˆ , where for any x, avg ˆ 1ÿMˆ favg(x) := M fj(x). j=1 15/22 Risk of model averaging I R(f):=E[(f(X)≠Y)2]forsomerandomvariable(X,Y) taking values in X ◊ R. ˆq ˆ I Theorem: For any f1,...,fM : X æ R, the ensemble predictor fˆ := 1 M avg M j=1 j fˆ satisfies 1 ÿM 1 ÿM M ˆ amuseˆ ˆ ˆ 2 ËÈ R(favg) = M j=1 R(fj) ≠ M j=1 E (favg(X) ≠ fj(X)) . I Better than model selection when: I all fˆ have similar risks, and Ty EffIi fj Y j I all fˆ predict very dierently from each other j 16/22 Stacking and features I In model averaging, “weights” of 1/M for all fˆ seems arbitrary I Can “learn” weights using linear regression! j I UsefeatureexpansionÏ(x)=(fˆ(x),...,fˆ (x)) I Called stacking 1 M I Useadditionaldata(independentoffˆ,...,fˆ ) 1M I Upshot: Any function (even learned functions) can be a feature I Conversely: Behind every feature is a deliberate modeling choice 17/22 Detour: Bayesian statistics Likelihoodmodel I Bayesian inference: probabilistic approach to updating beliefs Pw w EW I Posit a (parametric) statistical model for data (likelihood) I Start with some beliefs about the parameters of model (prior) I Update beliefs after seeing data (posterior) rl pruddata Pr(w | data) = ̧ ̊ ̇ ̋ posterior(w) 1 Pr(w) · Pr(data | w) Zdata ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋ O r prior(w) likelihood(w) I (Finding normalization constant Zdata is often the computationally challenging part of belief updating.) I Basis for reasoning in humans (maybe?), robots, etc. 2data Praful Prcdataff 18/22 Beyond Bayesian inference I Can use Bayesian inference framework for designing estimation/learning algorithms (even if you aren’t a Bayesian!) I E.g., instead of computing entire posterior distribution, find the w with highest posterior probability I Called maximum a posteriori (MAP) estimator I Just find w to maximize prior(w) ◊ likelihood(w). I (Avoids issue with finding normalization constant.) 19/22 Bayesian approach to linear regression I In linear regression model, express prior belief about w = (w1, . . . , wd) using a probability distribution with density function r 1 j2 I Likelihood model: (X1, Y1), . . . , (Xn, Yn) are conditionally independent given w, and Yi | (Xi, w) ≥ N(XiTw, 1). I Simple choice: prior(w ,...,w ) = d Ô exp(≠ w ) I 1 d j=1 2fi‡2 2‡2 I.e., treat w1, . . . , wd as independent N(0, ‡2) random variables I What is the MAP? bn prior w thulikelihood w many 20/22 MAP for Bayesian linear regression I Find w to maximize Ÿd1Aw2BŸn 1 T2 Ô exp≠j · p(x)·Ô exp(≠(y≠xw)/2). j=1 2fi‡2 2‡2 i=1 i 2fi i i ̧ ̊ ̇ ̋ ̧ ̊ ̇ ̋ likelihood(w) (Here, p is marginal density of X; unimportant.) prior(w) I Take logarithm and omit terms not involving w: 1ÿd21ÿn T2 ≠2‡2 wj ≠ 2 (yi ≠ xi w) . i=1 i=1 I For ‡2 = 1 , same as minimizing n⁄ 1 ÿn ( x Ti w ≠ y i ) 2 + ⁄ Î w Î 2 2 , n i=1 which is the ridge regression objective! 21/22 Example: Dartmouth data example I Dartmouth data example, where we considered intervals for the HS GPA variable: (0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · · I Use Ï(x) = (1{xœ(0.00,0.25]}, 1{xœ(0.25,0.50]}, . . . ) with a linear function qd 2 I Regularization: ⁄ j=1(wj ≠ μ) where μ = 2.46 is mean of College GPA values. I What’s the Bayesian interpretation of minimizing the following objective? 1 ÿn ( Ï ( x i ) T w ≠ y i ) 2 + ⁄ ÿd ( w j ≠ μ ) 2 n i=1 j=1 22/22