程序代写代做代考 C algorithm ER decision tree Agda Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/52

HWI
HW2
Recitation
Shared w you on Google Drive
Quiz
1
available on courseworks on Friday
at 11 20 am See instructions on Piazza
solutions
due Nov 18th
linear regression
Course works time
course works
week This week
Only
link
on
Last
Zoom
on
WS 1 tomorrow
sour
one
section
WS 2 thin
limit
Be done by H2O au on Saturday
Open notes
Do it by yourself
40 min time

Regression I: Linear regression

Outline
I Statistical model for regression
I College GPA example
I Ordinary least squares for linear regression I The expected mean squared error
I Dierent views of ordinary least squares I Features and linearity
I Over-fitting
I Beyond empirical risk minimization
1/52

i
L
Figure 1: Galton board
2/52

Real-valued predictions
YouNfu
F
I Example: Galton board J purely dy Pr YeCo D I Physical model: hard
I Statistical model: final position of ball is random
I Normal (Gaussian) distribution with mean μ and variance ‡2
I Written N(μ, ‡2)
I Probability density function is
Muffinanke 02 1 (y≠μ)2
pμ,‡2(y)= Ô2fi‡2e≠ 2‡2 , yœR.
I Goal: predict final position accurately, measure squared loss
(also called squared error) prediction outcome (prediction ≠ outcome)2
I Outcome is random, so look at expected squared loss (also called mean squared error)
3/52

Optimal prediction for mean squared error
ragdaorm.mx
f fax aEfx
X Y EY
EW EAT EAT0
I Predict yˆ œ R; true final position is Y (random variable) with meanE(Y)=μandvariancevar(Y)=E[(Y ≠E(Y))2]=‡2.
I Squared error is (yˆ ≠ Y )2. D I Bias-variance decomposition: ffeptfuYHZGnkfe
E [ ( yˆ ≠ Y ) 2 ] = E [ ( yˆ ≠ μ + μ ≠ Y ) 2 ]
= (yˆ≠μ)2 +2(yˆ≠μ)E[(μ≠Y)]+E[(μ≠Y)2]
= ( yˆ ≠ μ ) 2 + ‡ 2 . squaredbigger Tuariance of’t
I This is true for any random variable Y ; don’t need normality assumption.
I So optimal prediction is yˆ = μ.
I When parameters are unknown, can estimate from related data,

I Can also do an analysis of a plug-in prediction . . .
C.RO
AZ
4/52

Statistical model for regression
I Setting is same as for classification except:
I Label is real number, rather than {0,1} or {1,2,…,K} I Care about squared loss, rather than whether prediction is
correct prediction function I Mean squared error of f: or regressor
mse(f) := E[(f(X) ≠ Y )2],
the expected squared loss of f on random example
5/52

Optimal prediction function for regression
I If (X, Y ) is random test example, then optimal prediction function is
fı(x) = E[Y | X = x]
es
se
I Also called the regression function or conditional mean function 1I
I Prediction
I Depends on conditional distribution of Y given X
function with smallest MSE
6/52

Test MSE (1)
mseff
ERICH H’t
I
f c depends on training examples
I Just like in classification, we can use test data to estimate
ˆˆ
mse(f) for a function f that depends only on training data.
I IID model:
(X1,Y1),…,(Xn,Yn),(X1Õ,Y1Õ),…,(XmÕ ,YmÕ ),(X,Y) are iid
I Training examples (that you have): S := ((X1,Y1),…,(Xn,Yn))
I Test examples (that you have): T := ((X1Õ , Y1Õ), . . . , (XmÕ , YmÕ )) I Test example (that you don’t have) used to define error rate:
(X,Y) mse
I Predictor fˆ is based only on training examples
I Hence, test examples are independent of fˆ (very
important!) ˆ I We would like to estimate mse(f)
7/52

Test MSE (2)
ˆ 1qmˆÕ Õ2 I Test MSE mse(f,T) = i=1(f(Xi) ”= Yi )
mˆaˆ
I By law of large numbers, mse(f,T) æ mse(f) as m æ Œ
8/52

Example: College GPA
I Data from 750 Dartmouth students’ College GPA I Mean: 2.46
I Standard deviation: 0.746
I Assume this data is iid sample from the population of
Dartmouth students (false)
I Absent any other features, best constant prediction of a
uniformly random Dartmouth student’s College GPA is yˆ := 2.46.
Figure 2: Histogram of College GPA
9/52

Predicting College GPA from HS GPA (1)
I Students represented in data have High School (HS) GPA I Maybe HS GPA is predictive of College GPA?
I Data: S := ((x1,y1),…,(xn,yn))
I xi is HS GPA of i-th student
I yi is College GPA of i-th student
10/52

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 3: Plot of College GPA vs HS GPA
College GPA
11/52

Predicting College GPA from HS GPA (2)
I First attempt:
I Define intervals of possible HS GPAs:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
I For each such interval I, record the mean μˆI of the College GPAs of students whose HS GPA falls in I.
Y
_μˆ(0.00,0.25]
]μˆ(0.25,0.50] f(x) := _[μˆ
if x œ (0.00, 0.25] if x œ (0.25, 0.50] if x œ (0.50, 0.75] .
ˆ
_ (0.50,0.75]
I (What to do about an interval I that contains no student’s HS GPA?)
12/52

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 4: Plot of mean College GPA vs binned HS GPA
College GPA
13/52

Predicting College GPA from HS GPA (3)
I Define mse(f,S):= 1 ÿ (f(x)≠y)2, |S| (x,y)œS
the mean squared error of predictions made by f on examples in S.
ˆ Òmse(f,S) = 0.376
I “mean” is with respect to the uniform distribution on examples in S.
yMSEKOYI.IT hdit
ˆ
mse(f,S) = 0.613 < 0.746 (the standard deviation of the yi’s) I Piece-wise constant function fˆ is an improvement over the constant function (i.e., just predicting the mean 2.46 for all x)! I 14/52 Predicting College GPA from HS GPA (4) I But fˆ has some quirks. I E.g., those with HS GPA between 2.50 and 2.75 are predicted to have a lower College GPA than those with HS GPA between 2.25 and 2.50. I E.g., something unusual with the student who has HS GPA of 4.5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1.5 2 2.5 3 3.5 4 4.5 5 HS GPA College GPA Figure 5: Plot of mean College GPA vs binned HS GPA 15/52 Least squares linear regression (1) I Suppose we’d like to only consider functions with a specific functional form, e.g., a linear function: fCarl a fed f(x) = mx + ◊ a for m, ◊ œ R. I Technically,x‘æmx+◊islineari◊=0. If◊”=0,the function is not linear but ane. I Semantics: Positive m means higher HS GPA gets a higher prediction of College GPA. o Koko 16/52 Least squares linear regression (2) I What is the linear function with smallest MSE on (x1,y1),...,(xn,yn)œR◊R? Thisistheproblemof least squares linear regression. I Find (m, ◊) œ R2 to minimize 1ÿn (mxi+◊≠yi)2. n i=1 I Also called ordinary least squares (OLS) 17/52 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1.5 2 2.5 3 3.5 4 4.5 5 HS GPA Figure 6: Plot of least squares linear regression line College GPA 18/52 Computing OLS (1) Eas I Eal I I Derivatives equal zero conditions (normal equations): ˆ Y]1ÿn (mxi+◊≠yi)2Z^=2ÿn (mxi+◊≠yi)=0 ˆ◊ [n i=1 \ n i=1 ˆ Y]1ÿn (mxi+◊≠yi)2Z^=2ÿn (mxi+◊≠yi)xi=0. ˆm [n i=1 \ n i=1 I System of two linear equations with two unknowns (m, ◊). IDefine 1ÿn x := n i=1 1ÿn 2 x2 := n xi , xi, xiyi, so system can be re-written as xm + ◊ = y x2m + x◊ = xy. i=1 1 ÿn xy := n i=1 y := n yi, i=1 1 ÿn 19 / 52 Computing OLS (2) I Write in matrix notation: Cx 1DCmD=CyD. x2 x ◊ xy I 0.2267 COVERiiyif.gg TTg 0.3086 VarXi ˆXT2 ISolution:(mˆ,◊)œR givenby mˆ:=xy≠x·y, ◊ˆ:=y≠xy≠x·y·x. x2 ≠ x2 x2 ≠ x2 t.net o 337 20/52 Computing OLS (3) I Catch: The above solution only makes sense if x2 ≠ x2 ”= 0, i.e., the variance of the xi’s is non-zero. Ei I If x2 ≠ x2 = 0, then the matrix defining the LHS of system of equations is singular. tn Yi optimally Set in 0 Choose i.e I 21/52 Computing OLS (4) IIn general, “derivative equals zero” is only a necessary I condition for a solution to be optimal; not necessarily a sucient condition! I Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem. 22/52 Decomposition of expected MSE (1) I Two dierent functions of HS GPA for predicting College GPA. I What makes them dierent? I We care about prediction of College GPA for student we haven’t seen before based on their HS GPA. I IID model: (X1,Y1),...,(Xn,Yn), (X,Y) are iid I Say training examples (X1, Y1), . . . , (Xn, Yn) are used to determine f. I What is E[mse(f)]? ˆ ˆ 23/52 Decomposition of expected MSE (2) 0ˆ ËÈ NH aye EMT E[mse(f)] bias variancedecoup xD'tvar ˆ 2 2 ˆ =EËE[(f(X)≠Y) |f] È ˆ22ˆ =EËE[(f(X)≠Y) |f,X] È CECI Ix Ix ˆ 2 var =EËvar(Y |X)+(f(X)≠E[Y |X]) È ˆ2 =EËvar(Y |X)+E[(f(X)≠E[Y |X]) |X] È ˆˆ2 =E var(Y |X)+var(f(X)|X)+(E[f(X)|X]≠E[Y |X]) #$ˈÈˈ 2È =E var(Y |X) +E var(f(X)|X) +E (E[f(X)|X]≠E[Y |X]) unavoidable error ̧ ̊ ̇ ̋ ̧ ̊ ̇ˆ ̋ ̧ ̊ ̇ˆ ̋ variability of f approximation error of f these two terns are affected by your choice of learning alorithm 24 / 52 thumb in w m Some rules of fi to piecewiseconst pieces 4 t 745 I 21I f 9 2 g Decomposition of expected MSE (3) g I First term is quantifies inherent unpredictability of Y (even after seeing X) ˆ I Second term measures the “variability” of f due to the random nature of training data. Depends on: I probability distribution of training data, I type of function being fit (e.g., piecewise constant, linear), I method of fitting (e.g., OLS), I etc. I Third term quantifies how well a function produced by the fitting procedure can approximate the regression function, even after removing the “variability” of f. ˆ 25/52 Multivariate linear regression (1) I For Dartmouth data, also have SAT Score for all students. I Can we use both predictor variables (HS GPA and SAT Score) to get an even better prediction of College GPA? I Binning approach: instead of a 1-D grid (intervals), consider a 2-D grid (squares). I Linear regression: a function f : R2 æ R of the form f(x)=m1x1 +m2x2 +◊ forsome(m1,m2)œR2 and◊œR. ftcan decision tree u 26/52 Multivariate linear regression (2) I The general case: a (homogeneous) linear function f : Rd æ R of the form forsomewœRd. are d f(x) = xTw I Kiwi I What about inhomogeneous linear functions? L I Just always include a “feature” that always has value 1. Then the corresponding weight acts like ◊ from before. I I Kiwi 1 WI I w is called the weight vector or coecient vector. Kyd 27/52 Multivariate ordinary least squares (1) I What is the linear function with smallest MSE on (x1,y1),...,(xn,yn)œRd ◊R? I Find w œ Rd to minimize R‚ ( w ) : = n I Notation warning: xi œ Rd 1 ÿn i=1 ( x Ti w ≠ y i ) 2 . 28/52 Multivariate ordinary least squares (2) I In matrix notation: nxdyydxt R‚(w) := ÎAw ≠ bÎ2 SΩ≠ xT ≠æT n.IT Sy T O Ω≠ xTn ≠æ yn nxd matrix where 1111 A:=ÔnWU . XVœRn◊d, b:=ÔnWU.XVœRn. I If we put vector v œ Rd in the context of matrix multiplication, it is treated as a column vector by default! AwY HEecitargis TX 29/52 I If we want a row vector, we write vT. I Therefore 1 x1w ≠ y1 SW T O . Aw≠b=ÔW . X nUT.V xnw ≠ yn lengths of I E T Figure 7: Geometric picture of least squares linear regression Mse w average squared 30/52 Multivariate normal equations (1) I Like the one-dimensional case, optimal solutions are characterized by a system of linear equations (the “derivatives equal zero” conditions) called the normal equations: d S ˆ R‚ ( w ) T WEIR W ˆw1 X w was Yd ÒwR‚(w)=W . X=2AT(Aw≠b)=0, U ˆ R‚ ( w ) V ˆwd which is equivalent to ATAw = ATb. AtbelRd A'AelRd'd 31/52 Multivariate normal equations (2) I If ATA is non-singular (i.e., invertible), then there is a unique solution given by wˆ := (ATA)≠1ATb. I If ATA is singular, then there are infinitely many solutions! I Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem. 32/52 Algorithm for least squares linear regression I How to solve least squares linear regression problem? I Just solve the normal equations, a system of d linear equations in d unknowns. I Time complexity (naïve) of Gaussian elimination algorithm: O(d3). I Actually, also need to count time to form the system of equations, which is O(nd2). din Eid Ttb 0Cnd7 time nd't d 0 33/52 Announcements Courseworks Chaugetooverallcontributionato file o.yxttwlt jtttw4 Z checkpiazzaregularkgforannoouce.me Quiz check scores on to.rs x to 3x Final Quiz1tQutz2tQuTz3 min Quiet Quit2Quiz PrediclingareatutoomeY f Treat mean squared Regression Features X 45 Y as error random ruse g Effy y 5 the badness of a prediction J variable Judge by its squared error outcome available at time of prediction to Apply a prediction function F to X get prediction of Y ousel f Eff fly 45 Linearregression with assume Regression functions Ordinaryleastsquar Given training find WEIRD that minimizes h mselw S th 2 xTw yd TI Xia Yi X takes values inRd linear or affine tbh WEIRD is the weight vector f x sew ests data Gn yal call this CRdx R S Classical statistics view of OLS (1) I Normal linear regression model I Model training examples (X1, Y1), . . . , (Xn, Yn) as iid random variables taking values in Rd ◊ R, where Yi |Xi =xi ≥N(xTiw,‡2) c I w œ Rd and ‡2 > 0 are the parameters of the model.
I The least squares linear regression problem is the same as the problem of finding the maximum likelihood value for w.
xiw Ski
w
34/52

Classical statistics view of OLS (2)
I Suppose your data really does come from a distribution in this statistical model, say, with parameters w and ‡2.
I Then the function with smallest MSE is the linear function fı(x) = xTw, and its MSE is mse(fı) = ‡2.
I So estimating w is a sensible idea! (Plug-in principle. . . )
35/52

Statistical learning view of OLS (1)
I IIDmodel: (X1,Y1),…,(Xn,Yn),(X,Y)≥iid P areiid random variables taking values in Rd ◊ R
I (X, Y ) is the (unseen) “test” example
I Goal: find a (linear) function w œ Rd with small MSE
mse(w) = E[(XTw ≠ Y )2].
I We cannot directly minimize mse(w) as a function of w œ Rd, since it is an expectation (e.g., integral) with respect to the unknown distribution P
36/52

Statistical learning view of OLS (2)
I However, we have an iid sample S := ((X1, Y1), . . . , (Xn, Yn)). I We swap out P in the definition of mse(f), and replace it with
the empirical distribution on S: distr over dxR Pn(x, y) := 1 ÿn 1{(x,y)=(x ,y )}.
I This is the distribution that puts probability mass 1/n on the i-th training example.
I Resulting objective function is
E [ ( X ̃ T w ≠ Y ̃ ) 2 ] = 1 ÿn ( X i T w ≠ Y i ) 2
n XiYi i=1
ii
w h e r e ( X ̃ , Y ̃ ) ≥ P n .
n i=1 Empirical risk
37/52

Statistical learning view of OLS (3)
I In some circles:
I (True/population) ris‚k of w: R(qw) := E[(XTw ≠ Y )2]
replace it with our estimate Pn.
Minimization
I Empiricalriskofw: R(w):= 1 n (XTw≠Y)2 ni=1i i
I This is another instance of the plug-in principle!
I We want to minimize mse(w) but we don’t know P, so we
OLS
Empirical
Risk
38/52

Statistical learning view of OLS (4)
I This is not specific to linear regression; also works for other types of functions, and also other types of prediction problems, including classification.
I For classification: f X oil I (True/population) ris‚k of f: R(qf) := E[1{f(X)”=Y }]
I Empirical risk of f: R(f) := 1 n 1{f(X )”=Y } n i=1 i i
I All that changed is the loss function (squared loss versus zero/one loss)
I Procedure that minimizes empirical risk: Empirical risk minimization (ERM)
39/52

Upgrading linear regression (1)
I Make linear regression more powerful by being creative about features
I We are forced to do this if x is not already provided as a vector of numbers
I Instead of using x directly, use Ï(x) for some transformation Ï (possibly vector-valued)
40/52

Upgrading linear regression (2)
I Examples:
I Ane feature expansion, e.g., Ï(x) = (1, x), to accommodate intercept
I Standardization, e.g., Ï(x) = (x ≠ μ)/‡ where (μ, ‡2) are (estimates of) the mean and variance of the feature value
I Non-linear scalar transformations, e.g., Ï(x) = ln(1 + x)
I Logical formula, e.g., Ï(x) = (x1 · x5 · ¬x10) ‚ (¬x2 · x7) I Trigonometric expansion, e.g.,
Ï(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ) I Polynomial expansion, e.g.,
Ï(x) = (1,×1,…,xd,x21,…,x2d,x1x2,…,xd≠1xd) I Headless neural network Ï(x) = N(x) œ Rk, where
N : Rd æ Rk is a map computed by a intermediate layer of a neural network
I (Later, we’ll talk about how to “learn” N.)
41/52

Example: Taking advantage of linearity
I Example: y is health outcome, x is body temperature
I Physician suggests relevant feature is (square) deviation from normal body temperature (x ≠ 98.6)2
I What if you didn’t know the magic constant 98.6? (Apparently it is wrong in the US anyway)
I UseÏ(x)=(1,x,x2)
I Can learn coecients w œ R3 such that Ï(x)Tw = (x ≠ 98.6)2,
or any other quadratic polynomial in x (which could be better!)
Tw GXR2 w wew3 x 98.6 98.6 w t wzKtWzKZ
a
98.6l2
Cfl
OIi
7
98.6
42/52

Example: Binning features
I Dartmouth data example, where we considered intervals for the HS GPA variable:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
I Use Ï(x) = (1{xœ(0.00,0.25]}, 1{xœ(0.25,0.50]}, . . . ) with a linear
I Ï(x)Tw = wj if x is in the j-th interval.
function
I What is Ï(x)Tw?
43/52

Eect of feature expansion on expected MSE
ˆ
E[mse(f)] Ë È Ë È
#$ˆˆ2 =E var(Y |X) +E var(f(X)|X) +E (E[f(X)|X]≠E[Y |X])
unavoidable error
I Feature expansion can help reduce the third term (approximation error)
I But maybe at the cost of increasing the second term (variability)
̧ ̊ ̇ ̋
̧ ̊ ̇ˆ ̋ ̧ ̊ ̇ˆ ̋ variability of f approximation error of f
44/52

Performance of OLS (1)
I Study in context of IID model
I (X1,Y1),…,(Xn,Yn),(X,Y) are iid, and assume E[XXT] is
invertible (WLOG).withoutlossof generality
I Let wú denote the minimizer of mse(w) over all w œ Rd.
I Inductive bias assumption: mse(wú) is small, i.e., there is a linear function with low MSE.
I This is a fairly “weak” modeling assumption, especially compared to the normal regression model.
I How much larger is mse(wˆ) compared to mse(wú)?
45/52

Performance of OLS (2)
I Theorem: In the IID model, the OLS solution wˆ satisfies ! ” trace
n E[mse(wˆ)] ≠ mse(wú) æ tr(cov(ÁW ))
T≠1/2 n Tú
as n æ Œ, where W = E[XX ] X and Á = Y ≠ X w . I Corollary: If, in addition, (X, Y ) follows the normal linear
regression model Y | X = x ≥ N(xTwú, ‡2), then n !E[mse(wˆ)] ≠ mse(wú)” æ ‡2d,
which is more typically written as
E[mse(wˆ)] æ A1 + d B mse(wú). n
46/52

Linear algebraic view of OLS (1)
SW ø ø TX IWriteA=Ua1 ··· adV
¿¿
I aj œRn isj-thcolumnofA
A I n f
c
er
bIn17g e1pm I Span of a1,…,ad is range(A), a subspace of Rn
I Minimizing R‚(w) = ÎAw ≠ bÎ2 over w œ Rd is same as finding vector ˆb in range(A) closest to b
ad
aj c spanGa WWi Wd
Aw
Wj
XI
47/52

b
EA
a1 bˆ
a2 Figure 8: Orthogonal projection of b onto range(A)
48/52

Linear algebraic view of OLS (2)
I Solution ˆb is orthogonal projection of b onto range(A)
I ˆb i s u n i q u e ˆ ˆ
I Residual b ≠ b is orthogonal to b
I Togetwfromˆb,solveAw=ˆbforw.
I If rank(A) < d (always the case if n < d), then infinitely-many and n < d guarantees non-uniqueness! Hbk ways to write ˆb as linear combination of a1, . . . , ad. I Upshot: Uniqueness of least squares solution requires n Ø d, 115112 s 49/52 Over-fitting (1) I In the IID model, over-fitting is the phenomenon where the true risk is much worse than the empirical risk. 50/52 Over-fitting (2) I Example: I Ï(x)=(1,x,x2,...,xk),degree-kpolynomialexpansion I Dimensionisd=k+1 I Any function of Æ k + 1 points can be interpolated by polynomial of degree Æ k I SoifnÆk+1=d,leastsquaressolutionwˆwillhavezero empirical risk, regardless of its true risk (assuming no two training examples with distinct xi’s have dierent yi’s). 3 2 1 0 -1 -2 -3 Figure 9: Polynomial interpolation 0 0.2 0.4 0.6 0.8 1 x 51/52 y Beyond empirical risk minimization I Recall plug-in principle I Want to minimize risk with respect to (unavailable) P; use Pn instead I What if we can’t regard data as iid from P ? I Example: Suppose we know P = 1M + 1F I (mixture distribution) 2 2 We get size n1 iid sample from M, and size n2 iid sample from F, n2 π n1 kik 104in 2 I How to implement plug-in principle? xqn.Yn.nl n Uit hz GiD t EIgcx.ytcxi.yitgtzntz.iou fEaaD HicYil 2h in 52/52