Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 52
Regression I: Linear regression
Outline
I Statistical model for regression
I College GPA example
I Ordinary least squares for linear regression
I The expected mean squared error
I Different views of ordinary least squares
I Features and linearity
I Over-fitting
I Beyond empirical risk minimization
1 / 52
Figure 1: Galton board
2 / 52
Real-valued predictions
I Example: Galton board
I Physical model: hard
I Statistical model: final position of ball is random
I Normal (Gaussian) distribution with mean µ and variance σ2
I Written N(µ, σ2)
I Probability density function is
pµ,σ2(y) =
1
√
2πσ2
e
− (y−µ)
2
2σ2 , y ∈ R.
I Goal: predict final position accurately, measure squared loss
(also called squared error)
(prediction− outcome)2
I Outcome is random, so look at expected squared loss (also
called mean squared error)
3 / 52
Optimal prediction for mean squared error
I Predict ŷ ∈ R; true final position is Y (random variable) with
mean E(Y ) = µ and variance var(Y ) = E[(Y − E(Y ))2] = σ2.
I Squared error is (ŷ − Y )2.
I Bias-variance decomposition:
E[(ŷ − Y )2] = E[(ŷ − µ+ µ− Y )2]
= (ŷ − µ)2 + 2(ŷ − µ)E[(µ− Y )] + E[(µ− Y )2]
= (ŷ − µ)2 + σ2.
I This is true for any random variable Y ; don’t need normality
assumption.
I So optimal prediction is ŷ = µ.
I When parameters are unknown, can estimate from related data,
. . .
I Can also do an analysis of a plug-in prediction . . .
4 / 52
Statistical model for regression
I Setting is same as for classification except:
I Label is real number, rather than {0, 1} or {1, 2, . . . ,K}
I Care about squared loss, rather than whether prediction is
correct
I Mean squared error of f :
mse(f) := E[(f(X)− Y )2],
the expected squared loss of f on random example
5 / 52
Optimal prediction function for regression
I If (X,Y ) is random test example, then
optimal prediction function is
f?(x) = E[Y | X = x]
I Also called the regression function or conditional mean function
I Prediction function with smallest MSE
I Depends on conditional distribution of Y given X
6 / 52
Test MSE (1)
I Just like in classification, we can use test data to estimate
mse(f̂) for a function f̂ that depends only on training data.
I IID model:
(X1, Y1), . . . , (Xn, Yn), (X ′1, Y ′1), . . . , (X ′m, Y ′m), (X,Y ) are iid
I Training examples (that you have):
S := ((X1, Y1), . . . , (Xn, Yn))
I Test examples (that you have): T := ((X ′1, Y ′1), . . . , (X ′m, Y ′m))
I Test example (that you don’t have) used to define MSE: (X,Y )
I Predictor f̂ is based only on training examples
I Hence, test examples are independent of f̂ (very
important!)
I We would like to estimate mse(f̂)
7 / 52
Test MSE (2)
I Test MSE mse(f̂ , T ) = 1
m
∑m
i=1(f̂(X ′i) 6= Y
′
i )
2
I By law of large numbers, mse(f̂ , T )→ mse(f̂) as m→∞
8 / 52
Example: College GPA
I Data from 750 Dartmouth students’ College GPA
I Mean: 2.46
I Standard deviation: 0.746
I Assume this data is iid sample from the population of
Dartmouth students (false)
I Absent any other features, best constant prediction of a
uniformly random Dartmouth student’s College GPA is
ŷ := 2.46.
Figure 2: Histogram of College GPA 9 / 52
Predicting College GPA from HS GPA (1)
I Students represented in data have High School (HS) GPA
I Maybe HS GPA is predictive of College GPA?
I Data: S := ((x1, y1), . . . , (xn, yn))
I xi is HS GPA of i-th student
I yi is College GPA of i-th student
10 / 52
1.5 2 2.5 3 3.5 4 4.5 5
HS GPA
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
C
o
ll
e
g
e
G
P
A
Figure 3: Plot of College GPA vs HS GPA
11 / 52
Predicting College GPA from HS GPA (2)
I First attempt:
I Define intervals of possible HS GPAs:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
I For each such interval I, record the mean µ̂I of the College
GPAs of students whose HS GPA falls in I.
f̂(x) :=
µ̂(0.00,0.25] if x ∈ (0.00, 0.25]
µ̂(0.25,0.50] if x ∈ (0.25, 0.50]
µ̂(0.50,0.75] if x ∈ (0.50, 0.75]
…
I (What to do about an interval I that contains no student’s HS
GPA?)
12 / 52
1.5 2 2.5 3 3.5 4 4.5 5
HS GPA
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
C
o
ll
e
g
e
G
P
A
Figure 4: Plot of mean College GPA vs binned HS GPA
13 / 52
Predicting College GPA from HS GPA (3)
I Define
mse(f, S) :=
1
|S|
∑
(x,y)∈S
(f(x)− y)2,
the mean squared error of predictions made by f on examples
in S.
I “mean” is with respect to the uniform distribution on examples
in S.
mse(f̂ , S) = 0.376√
mse(f̂ , S) = 0.613 < 0.746 (the standard deviation of the yi’s)
I Piece-wise constant function f̂ is an improvement over the
constant function (i.e., just predicting the mean 2.46 for all x)!
14 / 52
Predicting College GPA from HS GPA (4)
I But f̂ has some quirks.
I E.g., those with HS GPA between 2.50 and 2.75 are predicted
to have a lower College GPA than those with HS GPA between
2.25 and 2.50.
I E.g., something unusual with the student who has HS GPA of
4.5
1.5 2 2.5 3 3.5 4 4.5 5
HS GPA
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
C
o
ll
e
g
e
G
P
A
Figure 5: Plot of mean College GPA vs binned HS GPA
15 / 52
Least squares linear regression (1)
I Suppose we’d like to only consider functions with a specific
functional form, e.g., a linear function:
f(x) = mx+ θ
for m, θ ∈ R.
I Technically, x 7→ mx+ θ is linear iff θ = 0. If θ 6= 0, the
function is not linear but affine.
I Semantics: Positive m means higher HS GPA gets a higher
prediction of College GPA.
16 / 52
Least squares linear regression (2)
I What is the linear function with smallest MSE on
(x1, y1), . . . , (xn, yn) ∈ R× R? This is the problem of
least squares linear regression.
I Find (m, θ) ∈ R2 to minimize
1
n
n∑
i=1
(mxi + θ − yi)2.
I Also called ordinary least squares (OLS)
17 / 52
1.5 2 2.5 3 3.5 4 4.5 5
HS GPA
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
C
o
ll
e
g
e
G
P
A
Figure 6: Plot of least squares linear regression line
18 / 52
Computing OLS (1)
I Derivatives equal zero conditions (normal equations):
∂
∂θ
1n
n∑
i=1
(mxi + θ − yi)2
= 2n
n∑
i=1
(mxi + θ − yi) = 0
∂
∂m
1n
n∑
i=1
(mxi + θ − yi)2
= 2n
n∑
i=1
(mxi + θ − yi)xi = 0.
I System of two linear equations with two unknowns (m, θ).
I Define
x :=
1
n
n∑
i=1
xi, x2 :=
1
n
n∑
i=1
x2i ,
xy :=
1
n
n∑
i=1
xiyi, y :=
1
n
n∑
i=1
yi,
so system can be re-written as
xm+ θ = y
x2m+ xθ = xy. 19 / 52
Computing OLS (2)
I Write in matrix notation:[
x 1
x2 x
] [
m
θ
]
=
[
y
xy
]
.
I Solution: (m̂, θ̂) ∈ R2 given by
m̂ :=
xy − x · y
x2 − x2
, θ̂ := y −
xy − x · y
x2 − x2
· x.
20 / 52
Computing OLS (3)
I Catch: The above solution only makes sense if x2 − x2 6= 0,
i.e., the variance of the xi’s is non-zero.
I If x2 − x2 = 0, then the matrix defining the LHS of system of
equations is singular.
21 / 52
Computing OLS (4)
I In general, “derivative equals zero” is only a necessary
condition for a solution to be optimal; not necessarily a
sufficient condition!
I Theorem: Every solution to the normal equations is an
optimal solution to the least squares linear regression problem.
22 / 52
Decomposition of expected MSE (1)
I Two different functions of HS GPA for predicting College GPA.
I What makes them different?
I We care about prediction of College GPA for student we haven’t
seen before based on their HS GPA.
I IID model: (X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid
I Say training examples (X1, Y1), . . . , (Xn, Yn) are used to
determine f̂ .
I What is E[mse(f̂)]?
23 / 52
Decomposition of expected MSE (2)
E[mse(f̂)]
= E
[
E[(f̂(X)− Y )2 | f̂ ]
]
= E
[
E[(f̂(X)− Y )2 | f̂ , X]
]
= E
[
var(Y | X) + (f̂(X)− E[Y | X])2
]
= E
[
var(Y | X) + E[(f̂(X)− E[Y | X])2 | X]
]
= E
[
var(Y | X) + var(f̂(X) | X) + (E[f̂(X) | X]− E[Y | X])2
]
= E
[
var(Y | X)
]︸ ︷︷ ︸
unavoidable error
+E
[
var(f̂(X) | X)
]
︸ ︷︷ ︸
variability of f̂
+E
[
(E[f̂(X) | X]− E[Y | X])2
]
︸ ︷︷ ︸
approximation error of f̂
.
24 / 52
Decomposition of expected MSE (3)
I First term is quantifies inherent unpredictability of Y (even
after seeing X)
I Second term measures the “variability” of f̂ due to the random
nature of training data. Depends on:
I probability distribution of training data,
I type of function being fit (e.g., piecewise constant, linear),
I method of fitting (e.g., OLS),
I etc.
I Third term quantifies how well a function produced by the
fitting procedure can approximate the regression function, even
after removing the “variability” of f̂ .
25 / 52
Multivariate linear regression (1)
I For Dartmouth data, also have SAT Score for all students.
I Can we use both predictor variables (HS GPA and SAT Score)
to get an even better prediction of College GPA?
I Binning approach: instead of a 1-D grid (intervals), consider a
2-D grid (squares).
I Linear regression: a function f : R2 → R of the form
f(x) = m1x1 +m2x2 + θ
for some (m1,m2) ∈ R2 and θ ∈ R.
26 / 52
Multivariate linear regression (2)
I The general case: a (homogeneous) linear function f : Rd → R
of the form
f(x) = xTw
for some w ∈ Rd.
I w is called the weight vector or coefficient vector.
I What about inhomogeneous linear functions?
I Just always include a “feature” that always has value 1. Then
the corresponding weight acts like θ from before.
27 / 52
Multivariate ordinary least squares (1)
I What is the linear function with smallest MSE on
(x1, y1), . . . , (xn, yn) ∈ Rd × R?
I Find w ∈ Rd to minimize
R̂(w) :=
1
n
n∑
i=1
(xTiw − yi)
2.
I Notation warning: xi ∈ Rd
28 / 52
Multivariate ordinary least squares (2)
I In matrix notation:
R̂(w) := ‖Aw − b‖22
where
A :=
1
√
n
←− xT1 −→
...
←− xTn −→
∈ Rn×d, b := 1√n
y1
...
yn
∈ Rn.
I If we put vector v ∈ Rd in the context of matrix multiplication,
it is treated as a column vector by default!
I If we want a row vector, we write vT.
I Therefore
Aw − b =
1
√
n
xT1w − y1
...
xTnw − yn
29 / 52
Figure 7: Geometric picture of least squares linear regression
30 / 52
Multivariate normal equations (1)
I Like the one-dimensional case, optimal solutions are
characterized by a system of linear equations (the “derivatives
equal zero” conditions) called the normal equations:
∇w R̂(w) =
∂ R̂(w)
∂w1
...
∂ R̂(w)
∂w
d
= 2AT(Aw − b) = 0,
which is equivalent to
ATAw = ATb.
31 / 52
Multivariate normal equations (2)
I If ATA is non-singular (i.e., invertible), then there is a unique
solution given by
ŵ := (ATA)−1ATb.
I If ATA is singular, then there are infinitely many solutions!
I Theorem: Every solution to the normal equations is an
optimal solution to the least squares linear regression problem.
32 / 52
Algorithm for least squares linear regression
I How to solve least squares linear regression problem?
I Just solve the normal equations, a system of d linear equations
in d unknowns.
I Time complexity (naïve) of Gaussian elimination algorithm:
O(d3).
I Actually, also need to count time to form the system of
equations, which is O(nd2).
33 / 52
Classical statistics view of OLS (1)
I Normal linear regression model
I Model training examples (X1, Y1), . . . , (Xn, Yn) as iid random
variables taking values in Rd × R, where
Yi | Xi = xi ∼ N(xTiw, σ
2)
I w ∈ Rd and σ2 > 0 are the parameters of the model.
I The least squares linear regression problem is the same as the
problem of finding the maximum likelihood value for w.
34 / 52
Classical statistics view of OLS (2)
I Suppose your data really does come from a distribution in this
statistical model, say, with parameters w and σ2.
I Then the function with smallest MSE is the linear function
f?(x) = xTw, and its MSE is mse(f?) = σ2.
I So estimating w is a sensible idea! (Plug-in principle. . . )
35 / 52
Statistical learning view of OLS (1)
I IID model: (X1, Y1), . . . , (Xn, Yn), (X,Y ) ∼iid P are iid
random variables taking values in Rd × R
I (X,Y ) is the (unseen) “test” example
I Goal: find a (linear) function w ∈ Rd with small MSE
mse(w) = E[(XTw − Y )2].
I We cannot directly minimize mse(w) as a function of w ∈ Rd,
since it is an expectation (e.g., integral) with respect to the
unknown distribution P
36 / 52
Statistical learning view of OLS (2)
I However, we have an iid sample S := ((X1, Y1), . . . , (Xn, Yn)).
I We swap out P in the definition of mse(f), and replace it with
the empirical distribution on S:
Pn(x, y) :=
1
n
n∑
i=1
1{(x,y)=(xi,yi)}.
I This is the distribution that puts probability mass 1/n on the
i-th training example.
I Resulting objective function is
E[(X̃Tw − Ỹ )2] =
1
n
n∑
i=1
(XTi w − Yi)
2
where (X̃, Ỹ ) ∼ Pn.
37 / 52
Statistical learning view of OLS (3)
I In some circles:
I (True/population) risk of w: R(w) := E[(XTw − Y )2]
I Empirical risk of w: R̂(w) := 1
n
∑n
i=1(X
T
iw − Yi)
2
I This is another instance of the plug-in principle!
I We want to minimize mse(w) but we don’t know P , so we
replace it with our estimate Pn.
38 / 52
Statistical learning view of OLS (4)
I This is not specific to linear regression; also works for other
types of functions, and also other types of prediction problems,
including classification.
I For classification:
I (True/population) risk of f : R(f) := E[1{f(X) 6=Y }]
I Empirical risk of f : R̂(f) := 1
n
∑n
i=1 1{f(Xi) 6=Yi}
I All that changed is the loss function (squared loss versus
zero/one loss)
I Procedure that minimizes empirical risk:
Empirical risk minimization (ERM)
39 / 52
Upgrading linear regression (1)
I Make linear regression more powerful by being creative about
features
I We are forced to do this if x is not already provided as a vector
of numbers
I Instead of using x directly, use ϕ(x) for some transformation ϕ
(possibly vector-valued)
40 / 52
Upgrading linear regression (2)
I Examples:
I Affine feature expansion, e.g., ϕ(x) = (1, x), to accommodate
intercept
I Standardization, e.g., ϕ(x) = (x− µ)/σ where (µ, σ2) are
(estimates of) the mean and variance of the feature value
I Non-linear scalar transformations, e.g., ϕ(x) = ln(1 + x)
I Logical formula, e.g., ϕ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7)
I Trigonometric expansion, e.g.,
ϕ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . )
I Polynomial expansion, e.g.,
ϕ(x) = (1, x1, . . . , xd, x21, . . . , x2d, x1x2, . . . , xd−1xd)
I Headless neural network ϕ(x) = N(x) ∈ Rk, where
N : Rd → Rk is a map computed by a intermediate layer of a
neural network
I (Later, we’ll talk about how to “learn” N .)
41 / 52
Example: Taking advantage of linearity
I Example: y is health outcome, x is body temperature
I Physician suggests relevant feature is (square) deviation from
normal body temperature (x− 98.6)2
I What if you didn’t know the magic constant 98.6? (Apparently
it is wrong in the US anyway)
I Use ϕ(x) = (1, x, x2)
I Can learn coefficients w ∈ R3 such that ϕ(x)Tw = (x− 98.6)2,
or any other quadratic polynomial in x (which could be better!)
42 / 52
Example: Binning features
I Dartmouth data example, where we considered intervals for the
HS GPA variable:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
I Use ϕ(x) = (1{x∈(0.00,0.25]},1{x∈(0.25,0.50]}, . . . ) with a linear
function
I What is ϕ(x)Tw?
I ϕ(x)Tw = wj if x is in the j-th interval.
43 / 52
Effect of feature expansion on expected MSE
E[mse(f̂)]
= E
[
var(Y | X)
]︸ ︷︷ ︸
unavoidable error
+E
[
var(f̂(X) | X)
]
︸ ︷︷ ︸
variability of f̂
+E
[
(E[f̂(X) | X]− E[Y | X])2
]
︸ ︷︷ ︸
approximation error of f̂
.
I Feature expansion can help reduce the third term
(approximation error)
I But maybe at the cost of increasing the second term
(variability)
44 / 52
Performance of OLS (1)
I Study in context of IID model
I (X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid, and assume E[XXT] is
invertible (WLOG).
I Let w∗ denote the minimizer of mse(w) over all w ∈ Rd.
I Inductive bias assumption: mse(w∗) is small, i.e., there is a
linear function with low MSE.
I This is a fairly “weak” modeling assumption, especially
compared to the normal regression model.
I How much larger is mse(ŵ) compared to mse(w∗)?
45 / 52
Performance of OLS (2)
I Theorem: In the IID model, the OLS solution ŵ satisfies
n
(
E[mse(ŵ)]−mse(w∗)
)
→ tr(cov(εW ))
as n→∞, where W = E[XXT]−1/2X and ε = Y −XTw∗.
I Corollary: If, in addition, (X,Y ) follows the normal linear
regression model Y | X = x ∼ N(xTw∗, σ2), then
n
(
E[mse(ŵ)]−mse(w∗)
)
→ σ2d,
which is more typically written as
E[mse(ŵ)]→
(
1 +
d
n
)
mse(w∗).
46 / 52
Linear algebraic view of OLS (1)
I Write A =
↑ ↑a1 · · · ad
↓ ↓
I aj ∈ Rn is j-th column of A
I Span of a1, . . . , ad is range(A), a subspace of Rn
I Minimizing R̂(w) = ‖Aw− b‖22 over w ∈ Rd is same as finding
vector b̂ in range(A) closest to b
47 / 52
b
b̂
a1
a2
Figure 8: Orthogonal projection of b onto range(A)
48 / 52
Linear algebraic view of OLS (2)
I Solution b̂ is orthogonal projection of b onto range(A)
I b̂ is unique
I Residual b− b̂ is orthogonal to b̂
I To get w from b̂, solve Aw = b̂ for w.
I If rank(A) < d (always the case if n < d), then infinitely-many
ways to write b̂ as linear combination of a1, . . . , ad.
I Upshot: Uniqueness of least squares solution requires n ≥ d,
and n < d guarantees non-uniqueness!
49 / 52
Over-fitting (1)
I In the IID model, over-fitting is the phenomenon where the
true risk is much worse than the empirical risk.
50 / 52
Over-fitting (2)
I Example:
I ϕ(x) = (1, x, x2, . . . , xk), degree-k polynomial expansion
I Dimension is d = k + 1
I Any function of ≤ k + 1 points can be interpolated by
polynomial of degree ≤ k
I So if n ≤ k + 1 = d, least squares solution ŵ will have zero
empirical risk, regardless of its true risk (assuming no two
training examples with distinct xi’s have different yi’s).
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Figure 9: Polynomial interpolation
51 / 52
Beyond empirical risk minimization
I Recall plug-in principle
I Want to minimize risk with respect to (unavailable) P ; use Pn
instead
I What if we can’t regard data as iid from P?
I Example: Suppose we know P = 12M +
1
2F
(mixture distribution)
I We get size n1 iid sample from M , and size n2 iid sample from
F , n2 � n1
I How to implement plug-in principle?
52 / 52
Regression I: Linear regression