Machine learning lecture slides
COMS 4771 Fall 2020
0/52
HWI
HW2
Recitation
Shared w you on Google Drive
Quiz
1
available on courseworks on Friday
at 11 20 am See instructions on Piazza
solutions
due Nov 18th
linear regression
Course works time
course works
week This week
Only
link
on
Last
Zoom
on
WS 1 tomorrow
sour
one
section
WS 2 thin
limit
Be done by H2O au on Saturday
Open notes
Do it by yourself
40 min time
Regression I: Linear regression
Outline
I Statistical model for regression
I College GPA example
I Ordinary least squares for linear regression I The expected mean squared error
I Di erent views of ordinary least squares I Features and linearity
I Over-fitting
I Beyond empirical risk minimization
1/52
i
L
Figure 1: Galton board
2/52
Real-valued predictions
YouNfu
F
I Example: Galton board J purely dy Pr YeCo D I Physical model: hard
I Statistical model: final position of ball is random
I Normal (Gaussian) distribution with mean μ and variance ‡2
I Written N(μ, ‡2)
I Probability density function is
Muffinanke 02 1 (y≠μ)2
pμ,‡2(y)= Ô2fi‡2e≠ 2‡2 , yœR.
I Goal: predict final position accurately, measure squared loss
(also called squared error) prediction outcome (prediction ≠ outcome)2
I Outcome is random, so look at expected squared loss (also called mean squared error)
3/52
Optimal prediction for mean squared error
ragdaorm.mx
f fax aEfx
X Y EY
EW EAT EAT0
I Predict yˆ œ R; true final position is Y (random variable) with meanE(Y)=μandvariancevar(Y)=E[(Y ≠E(Y))2]=‡2.
I Squared error is (yˆ ≠ Y )2. D I Bias-variance decomposition: ffeptfuYHZGnkfe
E [ ( yˆ ≠ Y ) 2 ] = E [ ( yˆ ≠ μ + μ ≠ Y ) 2 ]
= (yˆ≠μ)2 +2(yˆ≠μ)E[(μ≠Y)]+E[(μ≠Y)2]
= ( yˆ ≠ μ ) 2 + ‡ 2 . squaredbigger Tuariance of’t
I This is true for any random variable Y ; don’t need normality assumption.
I So optimal prediction is yˆ = μ.
I When parameters are unknown, can estimate from related data,
…
I Can also do an analysis of a plug-in prediction . . .
C.RO
AZ
4/52
Statistical model for regression
I Setting is same as for classification except:
I Label is real number, rather than {0,1} or {1,2,…,K} I Care about squared loss, rather than whether prediction is
correct prediction function I Mean squared error of f: or regressor
mse(f) := E[(f(X) ≠ Y )2],
the expected squared loss of f on random example
5/52
Optimal prediction function for regression
I If (X, Y ) is random test example, then optimal prediction function is
fı(x) = E[Y | X = x]
es
se
I Also called the regression function or conditional mean function 1I
I Prediction
I Depends on conditional distribution of Y given X
function with smallest MSE
6/52
Test MSE (1)
mseff
ERICH H’t
I
f c depends on training examples
I Just like in classification, we can use test data to estimate
ˆˆ
mse(f) for a function f that depends only on training data.
I IID model:
(X1,Y1),…,(Xn,Yn),(X1Õ,Y1Õ),…,(XmÕ ,YmÕ ),(X,Y) are iid
I Training examples (that you have): S := ((X1,Y1),…,(Xn,Yn))
I Test examples (that you have): T := ((X1Õ , Y1Õ), . . . , (XmÕ , YmÕ )) I Test example (that you don’t have) used to define error rate:
(X,Y) mse
I Predictor fˆ is based only on training examples
I Hence, test examples are independent of fˆ (very
important!) ˆ I We would like to estimate mse(f)
7/52
Test MSE (2)
ˆ 1qmˆÕ Õ2 I Test MSE mse(f,T) = i=1(f(Xi) ”= Yi )
mˆaˆ
I By law of large numbers, mse(f,T) æ mse(f) as m æ Œ
8/52
Example: College GPA
I Data from 750 Dartmouth students’ College GPA I Mean: 2.46
I Standard deviation: 0.746
I Assume this data is iid sample from the population of
Dartmouth students (false)
I Absent any other features, best constant prediction of a
uniformly random Dartmouth student’s College GPA is yˆ := 2.46.
Figure 2: Histogram of College GPA
9/52
Predicting College GPA from HS GPA (1)
I Students represented in data have High School (HS) GPA I Maybe HS GPA is predictive of College GPA?
I Data: S := ((x1,y1),…,(xn,yn))
I xi is HS GPA of i-th student
I yi is College GPA of i-th student
10/52
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 3: Plot of College GPA vs HS GPA
College GPA
11/52
Predicting College GPA from HS GPA (2)
I First attempt:
I Define intervals of possible HS GPAs:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
I For each such interval I, record the mean μˆI of the College GPAs of students whose HS GPA falls in I.
Y
_μˆ(0.00,0.25]
]μˆ(0.25,0.50] f(x) := _[μˆ
if x œ (0.00, 0.25] if x œ (0.25, 0.50] if x œ (0.50, 0.75] .
ˆ
_ (0.50,0.75]
I (What to do about an interval I that contains no student’s HS GPA?)
12/52
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 4: Plot of mean College GPA vs binned HS GPA
College GPA
13/52
Predicting College GPA from HS GPA (3)
I Define mse(f,S):= 1 ÿ (f(x)≠y)2, |S| (x,y)œS
the mean squared error of predictions made by f on examples in S.
ˆ Òmse(f,S) = 0.376
I “mean” is with respect to the uniform distribution on examples in S.
yMSEKOYI.IT hdit
ˆ
mse(f,S) = 0.613 < 0.746 (the standard deviation of the yi’s)
I Piece-wise constant function fˆ is an improvement over the constant function (i.e., just predicting the mean 2.46 for all x)!
I
14/52
Predicting College GPA from HS GPA (4)
I But fˆ has some quirks.
I E.g., those with HS GPA between 2.50 and 2.75 are predicted
to have a lower College GPA than those with HS GPA between
2.25 and 2.50.
I E.g., something unusual with the student who has HS GPA of
4.5
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
College GPA
Figure 5: Plot of mean College GPA vs binned HS GPA
15/52
Least squares linear regression (1)
I Suppose we’d like to only consider functions with a specific functional form, e.g., a linear function: fCarl a fed
f(x) = mx + ◊ a
for m, ◊ œ R.
I Technically,x‘æmx+◊islineari ◊=0. If◊”=0,the function is not linear but a ne.
I Semantics: Positive m means higher HS GPA gets a higher prediction of College GPA.
o
Koko
16/52
Least squares linear regression (2)
I What is the linear function with smallest MSE on (x1,y1),...,(xn,yn)œR◊R? Thisistheproblemof
least squares linear regression. I Find (m, ◊) œ R2 to minimize
1ÿn (mxi+◊≠yi)2. n i=1
I Also called ordinary least squares (OLS)
17/52
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 6: Plot of least squares linear regression line
College GPA
18/52
Computing OLS (1)
Eas
I Eal I
I Derivatives equal zero conditions (normal equations):
ˆ Y]1ÿn (mxi+◊≠yi)2Z^=2ÿn (mxi+◊≠yi)=0
ˆ◊ [n i=1 \ n i=1
ˆ Y]1ÿn (mxi+◊≠yi)2Z^=2ÿn (mxi+◊≠yi)xi=0.
ˆm [n i=1 \ n i=1
I System of two linear equations with two unknowns (m, ◊).
IDefine 1ÿn
x := n
i=1
1ÿn 2 x2 := n xi ,
xi, xiyi,
so system can be re-written as
xm + ◊ = y x2m + x◊ = xy.
i=1 1 ÿn
xy := n
i=1
y := n yi, i=1
1 ÿn
19 / 52
Computing OLS (2)
I Write in matrix notation:
Cx 1DCmD=CyD.
x2 x ◊ xy
I 0.2267
COVERiiyif.gg TTg 0.3086 VarXi ˆXT2
ISolution:(mˆ,◊)œR givenby
mˆ:=xy≠x·y, ◊ˆ:=y≠xy≠x·y·x.
x2 ≠ x2 x2 ≠ x2
t.net
o
337
20/52
Computing OLS (3)
I Catch: The above solution only makes sense if x2 ≠ x2 ”= 0,
i.e., the variance of the xi’s is non-zero.
Ei
I If x2 ≠ x2 = 0, then the matrix defining the LHS of system of
equations is singular.
tn Yi optimally
Set in 0 Choose
i.e
I
21/52
Computing OLS (4)
IIn general, “derivative equals zero” is only a necessary I condition for a solution to be optimal; not necessarily a su cient condition!
I Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem.
22/52
Decomposition of expected MSE (1)
I Two di erent functions of HS GPA for predicting College GPA.
I What makes them di erent?
I We care about prediction of College GPA for student we haven’t
seen before based on their HS GPA.
I IID model: (X1,Y1),...,(Xn,Yn), (X,Y) are iid
I Say training examples (X1, Y1), . . . , (Xn, Yn) are used to
determine f.
I What is E[mse(f)]?
ˆ
ˆ
23/52
Decomposition of expected MSE (2)
0ˆ ËÈ
NH aye EMT
E[mse(f)]
bias variancedecoup xD'tvar
ˆ 2 2 ˆ =EËE[(f(X)≠Y) |f] È
ˆ22ˆ
=EËE[(f(X)≠Y) |f,X] È CECI Ix Ix
ˆ 2 var =EËvar(Y |X)+(f(X)≠E[Y |X]) È
ˆ2
=EËvar(Y |X)+E[(f(X)≠E[Y |X]) |X] È
ˆˆ2 =E var(Y |X)+var(f(X)|X)+(E[f(X)|X]≠E[Y |X])
#$ˈÈˈ 2È =E var(Y |X) +E var(f(X)|X) +E (E[f(X)|X]≠E[Y |X])
unavoidable error
̧ ̊ ̇ ̋
̧ ̊ ̇ˆ ̋ ̧ ̊ ̇ˆ ̋ variability of f approximation error of f
these two terns are affected by
your choice of learning alorithm 24 / 52
thumb
in
w m
Some
rules
of
fi to
piecewiseconst
pieces
4
t 745
I 21I
f
9
2
g
Decomposition of expected MSE (3)
g
I First term is quantifies inherent unpredictability of Y (even after seeing X) ˆ
I Second term measures the “variability” of f due to the random nature of training data. Depends on:
I probability distribution of training data,
I type of function being fit (e.g., piecewise constant, linear), I method of fitting (e.g., OLS),
I etc.
I Third term quantifies how well a function produced by the fitting procedure can approximate the regression function, even
after removing the “variability” of f.
ˆ
25/52
Multivariate linear regression (1)
I For Dartmouth data, also have SAT Score for all students.
I Can we use both predictor variables (HS GPA and SAT Score) to get an even better prediction of College GPA?
I Binning approach: instead of a 1-D grid (intervals), consider a 2-D grid (squares).
I Linear regression: a function f : R2 æ R of the form
f(x)=m1x1 +m2x2 +◊ forsome(m1,m2)œR2 and◊œR.
ftcan
decision tree
u
26/52
Multivariate linear regression (2)
I The general case: a (homogeneous) linear function f : Rd æ R
of the form
forsomewœRd. are
d
f(x) = xTw
I
Kiwi
I What about inhomogeneous linear functions? L
I Just always include a “feature” that always has value 1. Then the corresponding weight acts like ◊ from before.
I I
Kiwi 1 WI I w is called the weight vector or coe cient vector. Kyd
27/52
Multivariate ordinary least squares (1)
I What is the linear function with smallest MSE on (x1,y1),...,(xn,yn)œRd ◊R?
I Find w œ Rd to minimize R‚ ( w ) : = n
I Notation warning: xi œ Rd
1 ÿn i=1
( x Ti w ≠ y i ) 2 .
28/52
Multivariate ordinary least squares (2)
I In matrix notation:
nxdyydxt
R‚(w) := ÎAw ≠ bÎ2
SΩ≠ xT ≠æT n.IT Sy T
O
Ω≠ xTn ≠æ yn nxd matrix
where
1111 A:=ÔnWU . XVœRn◊d, b:=ÔnWU.XVœRn.
I If we put vector v œ Rd in the context of matrix multiplication,
it is treated as a column vector by default!
AwY HEecitargis TX
29/52
I If we want a row vector, we write vT. I Therefore
1 x1w ≠ y1 SW T O
. Aw≠b=ÔW . X
nUT.V xnw ≠ yn
lengths
of I
E
T
Figure 7: Geometric picture of least squares linear regression
Mse
w
average
squared
30/52
Multivariate normal equations (1)
I Like the one-dimensional case, optimal solutions are characterized by a system of linear equations (the “derivatives equal zero” conditions) called the normal equations:
d S ˆ R‚ ( w ) T WEIR W ˆw1 X
w was Yd ÒwR‚(w)=W . X=2AT(Aw≠b)=0, U ˆ R‚ ( w ) V
ˆwd which is equivalent to
ATAw = ATb.
AtbelRd
A'AelRd'd
31/52
Multivariate normal equations (2)
I If ATA is non-singular (i.e., invertible), then there is a unique solution given by
wˆ := (ATA)≠1ATb.
I If ATA is singular, then there are infinitely many solutions!
I Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem.
32/52
Algorithm for least squares linear regression
I How to solve least squares linear regression problem?
I Just solve the normal equations, a system of d linear equations in d unknowns.
I Time complexity (naïve) of Gaussian elimination algorithm: O(d3).
I Actually, also need to count time to form the system of equations, which is O(nd2).
din Eid Ttb 0Cnd7 time
nd't d
0
33/52
Announcements
Courseworks Chaugetooverallcontributionato
file o.yxttwlt jtttw4
Z checkpiazzaregularkgforannoouce.me
Quiz
check
scores
on
to.rs x to 3x Final Quiz1tQutz2tQuTz3 min Quiet Quit2Quiz
PrediclingareatutoomeY
f
Treat
mean squared
Regression Features X
45 Y as
error
random
ruse g Effy y 5
the badness
of a
prediction J variable
Judge
by its squared
error
outcome
available at time of prediction
to
Apply a prediction function F to X
get
prediction of Y ousel f Eff fly
45
Linearregression
with
assume
Regression functions
Ordinaryleastsquar Given training
find
WEIRD that minimizes
h
mselw S th 2 xTw yd
TI
Xia Yi
X takes values inRd linear or affine
tbh WEIRD is the weight vector
f
x
sew
ests data
Gn yal call this
CRdx R S
Classical statistics view of OLS (1)
I Normal linear regression model
I Model training examples (X1, Y1), . . . , (Xn, Yn) as iid random
variables taking values in Rd ◊ R, where
Yi |Xi =xi ≥N(xTiw,‡2)
c
I w œ Rd and ‡2 > 0 are the parameters of the model.
I The least squares linear regression problem is the same as the problem of finding the maximum likelihood value for w.
xiw Ski
w
34/52
Classical statistics view of OLS (2)
I Suppose your data really does come from a distribution in this statistical model, say, with parameters w and ‡2.
I Then the function with smallest MSE is the linear function fı(x) = xTw, and its MSE is mse(fı) = ‡2.
I So estimating w is a sensible idea! (Plug-in principle. . . )
35/52
Statistical learning view of OLS (1)
I IIDmodel: (X1,Y1),…,(Xn,Yn),(X,Y)≥iid P areiid random variables taking values in Rd ◊ R
I (X, Y ) is the (unseen) “test” example
I Goal: find a (linear) function w œ Rd with small MSE
mse(w) = E[(XTw ≠ Y )2].
I We cannot directly minimize mse(w) as a function of w œ Rd, since it is an expectation (e.g., integral) with respect to the unknown distribution P
36/52
Statistical learning view of OLS (2)
I However, we have an iid sample S := ((X1, Y1), . . . , (Xn, Yn)). I We swap out P in the definition of mse(f), and replace it with
the empirical distribution on S: distr over dxR Pn(x, y) := 1 ÿn 1{(x,y)=(x ,y )}.
I This is the distribution that puts probability mass 1/n on the i-th training example.
I Resulting objective function is
E [ ( X ̃ T w ≠ Y ̃ ) 2 ] = 1 ÿn ( X i T w ≠ Y i ) 2
n XiYi i=1
ii
w h e r e ( X ̃ , Y ̃ ) ≥ P n .
n i=1 Empirical risk
37/52
Statistical learning view of OLS (3)
I In some circles:
I (True/population) ris‚k of w: R(qw) := E[(XTw ≠ Y )2]
replace it with our estimate Pn.
Minimization
I Empiricalriskofw: R(w):= 1 n (XTw≠Y)2 ni=1i i
I This is another instance of the plug-in principle!
I We want to minimize mse(w) but we don’t know P, so we
OLS
Empirical
Risk
38/52
Statistical learning view of OLS (4)
I This is not specific to linear regression; also works for other types of functions, and also other types of prediction problems, including classification.
I For classification: f X oil I (True/population) ris‚k of f: R(qf) := E[1{f(X)”=Y }]
I Empirical risk of f: R(f) := 1 n 1{f(X )”=Y } n i=1 i i
I All that changed is the loss function (squared loss versus zero/one loss)
I Procedure that minimizes empirical risk: Empirical risk minimization (ERM)
39/52
Upgrading linear regression (1)
I Make linear regression more powerful by being creative about features
I We are forced to do this if x is not already provided as a vector of numbers
I Instead of using x directly, use Ï(x) for some transformation Ï (possibly vector-valued)
40/52
Upgrading linear regression (2)
I Examples:
I A ne feature expansion, e.g., Ï(x) = (1, x), to accommodate intercept
I Standardization, e.g., Ï(x) = (x ≠ μ)/‡ where (μ, ‡2) are (estimates of) the mean and variance of the feature value
I Non-linear scalar transformations, e.g., Ï(x) = ln(1 + x)
I Logical formula, e.g., Ï(x) = (x1 · x5 · ¬x10) ‚ (¬x2 · x7) I Trigonometric expansion, e.g.,
Ï(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ) I Polynomial expansion, e.g.,
Ï(x) = (1,×1,…,xd,x21,…,x2d,x1x2,…,xd≠1xd) I Headless neural network Ï(x) = N(x) œ Rk, where
N : Rd æ Rk is a map computed by a intermediate layer of a neural network
I (Later, we’ll talk about how to “learn” N.)
41/52
Example: Taking advantage of linearity
I Example: y is health outcome, x is body temperature
I Physician suggests relevant feature is (square) deviation from normal body temperature (x ≠ 98.6)2
I What if you didn’t know the magic constant 98.6? (Apparently it is wrong in the US anyway)
I UseÏ(x)=(1,x,x2)
I Can learn coe cients w œ R3 such that Ï(x)Tw = (x ≠ 98.6)2,
or any other quadratic polynomial in x (which could be better!)
Tw GXR2 w wew3 x 98.6 98.6 w t wzKtWzKZ
a
98.6l2
Cfl
OIi
7
98.6
42/52
Example: Binning features
I Dartmouth data example, where we considered intervals for the HS GPA variable:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
I Use Ï(x) = (1{xœ(0.00,0.25]}, 1{xœ(0.25,0.50]}, . . . ) with a linear
I Ï(x)Tw = wj if x is in the j-th interval.
function
I What is Ï(x)Tw?
43/52
E ect of feature expansion on expected MSE
ˆ
E[mse(f)] Ë È Ë È
#$ˆˆ2 =E var(Y |X) +E var(f(X)|X) +E (E[f(X)|X]≠E[Y |X])
unavoidable error
I Feature expansion can help reduce the third term (approximation error)
I But maybe at the cost of increasing the second term (variability)
̧ ̊ ̇ ̋
̧ ̊ ̇ˆ ̋ ̧ ̊ ̇ˆ ̋ variability of f approximation error of f
44/52
Performance of OLS (1)
I Study in context of IID model
I (X1,Y1),…,(Xn,Yn),(X,Y) are iid, and assume E[XXT] is
invertible (WLOG).withoutlossof generality
I Let wú denote the minimizer of mse(w) over all w œ Rd.
I Inductive bias assumption: mse(wú) is small, i.e., there is a linear function with low MSE.
I This is a fairly “weak” modeling assumption, especially compared to the normal regression model.
I How much larger is mse(wˆ) compared to mse(wú)?
45/52
Performance of OLS (2)
I Theorem: In the IID model, the OLS solution wˆ satisfies ! ” trace
n E[mse(wˆ)] ≠ mse(wú) æ tr(cov(ÁW ))
T≠1/2 n Tú
as n æ Œ, where W = E[XX ] X and Á = Y ≠ X w . I Corollary: If, in addition, (X, Y ) follows the normal linear
regression model Y | X = x ≥ N(xTwú, ‡2), then n !E[mse(wˆ)] ≠ mse(wú)” æ ‡2d,
which is more typically written as
E[mse(wˆ)] æ A1 + d B mse(wú). n
46/52
Linear algebraic view of OLS (1)
SW ø ø TX IWriteA=Ua1 ··· adV
¿¿
I aj œRn isj-thcolumnofA
A I n f
c
er
bIn17g e1pm I Span of a1,…,ad is range(A), a subspace of Rn
I Minimizing R‚(w) = ÎAw ≠ bÎ2 over w œ Rd is same as finding vector ˆb in range(A) closest to b
ad
aj c spanGa WWi Wd
Aw
Wj
XI
47/52
b
EA
a1 bˆ
a2 Figure 8: Orthogonal projection of b onto range(A)
48/52
Linear algebraic view of OLS (2)
I Solution ˆb is orthogonal projection of b onto range(A)
I ˆb i s u n i q u e ˆ ˆ
I Residual b ≠ b is orthogonal to b
I Togetwfromˆb,solveAw=ˆbforw.
I If rank(A) < d (always the case if n < d), then infinitely-many
and n < d guarantees non-uniqueness!
Hbk
ways to write ˆb as linear combination of a1, . . . , ad.
I Upshot: Uniqueness of least squares solution requires n Ø d,
115112 s
49/52
Over-fitting (1)
I In the IID model, over-fitting is the phenomenon where the true risk is much worse than the empirical risk.
50/52
Over-fitting (2)
I Example:
I Ï(x)=(1,x,x2,...,xk),degree-kpolynomialexpansion I Dimensionisd=k+1
I Any function of Æ k + 1 points can be interpolated by
polynomial of degree Æ k
I SoifnÆk+1=d,leastsquaressolutionwˆwillhavezero
empirical risk, regardless of its true risk (assuming no two training examples with distinct xi’s have di erent yi’s).
3 2 1 0
-1 -2 -3
Figure 9: Polynomial interpolation
0 0.2 0.4 0.6 0.8 1 x
51/52
y
Beyond empirical risk minimization
I Recall plug-in principle
I Want to minimize risk with respect to (unavailable) P; use Pn
instead
I What if we can’t regard data as iid from P ?
I Example: Suppose we know P = 1M + 1F I (mixture distribution) 2 2
We get size n1 iid sample from M, and size n2 iid sample from F, n2 π n1 kik 104in 2
I How to implement plug-in principle? xqn.Yn.nl
n
Uit hz
GiD t EIgcx.ytcxi.yitgtzntz.iou fEaaD HicYil 2h in
52/52