Machine learning lecture slides
COMS 4771 Fall 2020
0/52
Regression I: Linear regression
Outline
Statistical model for regression
College GPA example
Ordinary least squares for linear regression The expected mean squared error
Different views of ordinary least squares Features and linearity
Over-fitting
Beyond empirical risk minimization
1/52
Figure 1: Galton board
2/52
Real-valued predictions
Example: Galton board
Physical model: hard
Statistical model: final position of ball is random
Normal (Gaussian) distribution with mean μ and variance σ2 Written N(μ, σ2)
Probability density function is
1 − (y−μ)2 pμ,σ2(y)= √2πσ2e 2σ2
, y∈R.
Goal: predict final position accurately, measure squared loss
(also called squared error)
(prediction − outcome)2
Outcome is random, so look at expected squared loss (also called mean squared error)
3/52
Optimal prediction for mean squared error
Predict yˆ ∈ R; true final position is Y (random variable) with meanE(Y)=μandvariancevar(Y)=E[(Y −E(Y))2]=σ2.
Squared error is (yˆ − Y )2.
Bias-variance decomposition:
E [ ( yˆ − Y ) 2 ] = E [ ( yˆ − μ + μ − Y ) 2 ]
= (yˆ−μ)2 +2(yˆ−μ)E[(μ−Y)]+E[(μ−Y)2]
= ( yˆ − μ ) 2 + σ 2 .
This is true for any random variable Y ; don’t need normality
assumption.
So optimal prediction is yˆ = μ.
When parameters are unknown, can estimate from related data,
…
Can also do an analysis of a plug-in prediction . . .
4/52
Statistical model for regression
Setting is same as for classification except:
Label is real number, rather than {0,1} or {1,2,…,K} Care about squared loss, rather than whether prediction is
correct
Mean squared error of f:
mse(f) := E[(f(X) − Y )2],
the expected squared loss of f on random example
5/52
Optimal prediction function for regression
If (X, Y ) is random test example, then optimal prediction function is
f⋆(x) = E[Y | X = x]
Also called the regression function or conditional mean function Prediction function with smallest MSE
Depends on conditional distribution of Y given X
6/52
Test MSE (1)
Just like in classification, we can use test data to estimate ˆˆ
mse(f) for a function f that depends only on training data. IID model:
(X1,Y1),…,(Xn,Yn),(X1′,Y1′),…,(Xm′ ,Ym′ ),(X,Y) are iid Training examples (that you have):
S := ((X1,Y1),…,(Xn,Yn))
Test examples (that you have): T := ((X1′ , Y1′), . . . , (Xm′ , Ym′ )) Test example (that you don’t have) used to define MSE: (X, Y )
Predictor fˆ is based only on training examples
Hence, test examples are independent of fˆ (very
important!)
ˆ We would like to estimate mse(f)
7/52
Test MSE (2)
ˆ1mˆ′′2 Test MSE mse(f,T) = m i=1(f(Xi) ̸= Yi )
ˆˆ
By law of large numbers, mse(f,T) → mse(f) as m → ∞
8/52
Example: College GPA
Data from 750 Dartmouth students’ College GPA Mean: 2.46
Standard deviation: 0.746
Assume this data is iid sample from the population of
Dartmouth students (false)
Absent any other features, best constant prediction of a
uniformly random Dartmouth student’s College GPA is yˆ := 2.46.
Figure 2: Histogram of College GPA
9/52
Predicting College GPA from HS GPA (1)
Students represented in data have High School (HS) GPA Maybe HS GPA is predictive of College GPA?
Data: S := ((x1,y1),…,(xn,yn))
xi is HS GPA of i-th student
yi is College GPA of i-th student
10/52
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 3: Plot of College GPA vs HS GPA
College GPA
11/52
Predicting College GPA from HS GPA (2)
First attempt:
Define intervals of possible HS GPAs:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
For each such interval I, record the mean μˆI of the College GPAs of students whose HS GPA falls in I.
ˆ
f(x) :=
μˆ(0.00,0.25] μˆ(0.25,0.50]
μˆ
(0.50,0.75]
if x ∈ (0.00, 0.25]
if x ∈ (0.25, 0.50]
if x ∈ (0.50, 0.75] .
(What to do about an interval I that contains no student’s HS GPA?)
.
12/52
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 4: Plot of mean College GPA vs binned HS GPA
College GPA
13/52
Predicting College GPA from HS GPA (3)
Define
mse(f,S):= 1 (f(x)−y)2, |S| (x,y)∈S
the mean squared error of predictions made by f on examples in S.
“mean” is with respect to the uniform distribution on examples in S.
ˆ
mse(f,S) = 0.376
Piece-wise constant function fˆ is an improvement over the constant function (i.e., just predicting the mean 2.46 for all x)!
ˆ
mse(f,S) = 0.613 < 0.746 (the standard deviation of the yi’s)
14/52
Predicting College GPA from HS GPA (4)
But fˆ has some quirks.
E.g., those with HS GPA between 2.50 and 2.75 are predicted
to have a lower College GPA than those with HS GPA between
2.25 and 2.50.
E.g., something unusual with the student who has HS GPA of
4.5
4.5 4 3.5 3 2.5 2 1.5 1 0.5
0
1.5 2 2.5 3 3.5 4 4.5 5
HS GPA
College GPA
Figure 5: Plot of mean College GPA vs binned HS GPA
15/52
Least squares linear regression (1)
Suppose we’d like to only consider functions with a specific functional form, e.g., a linear function:
f(x) = mx + θ
for m, θ ∈ R.
Technically,x→mx+θislineariffθ=0. Ifθ̸=0,the
function is not linear but affine.
Semantics: Positive m means higher HS GPA gets a higher
prediction of College GPA.
16/52
Least squares linear regression (2)
What is the linear function with smallest MSE on (x1,y1),...,(xn,yn)∈R×R? Thisistheproblemof
least squares linear regression. Find (m, θ) ∈ R2 to minimize
1 n
(mxi +θ−yi)2. Also called ordinary least squares (OLS)
n
i=1
17/52
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 6: Plot of least squares linear regression line
College GPA
18/52
Computing OLS (1)
Derivatives equal zero conditions (normal equations):
∂ 1 n ∂θ n
(mxi +θ−yi)2 (mxi +θ−yi)2
2 n
= n (mxi +θ−yi)=0
i=1
2 n
= n (mxi +θ−yi)xi =0.
i=1 ∂ 1 n
∂m n
i=1
System of two linear equations with two unknowns (m, θ).
Define
1 n
x : = n
i=1
1 n
xy := n
1 n
i=1
x i , xiyi,
x 2 : = n
1 n
y := n
x 2i , yi,
i=1
i=1
so system can be re-written as
i=1
xm + θ = y x2m + xθ = xy.
19 / 52
Computing OLS (2)
Write in matrix notation:
x 1m y
x2 x θ = xy .
Solution:(mˆ,θˆ)∈R2givenby
mˆ:=xy−x·y, θˆ:=y−xy−x·y·x. x2 − x2 x2 − x2
20/52
Computing OLS (3)
Catch: The above solution only makes sense if x2 − x2 ̸= 0, i.e., the variance of the xi’s is non-zero.
If x2 − x2 = 0, then the matrix defining the LHS of system of equations is singular.
21/52
Computing OLS (4)
In general, “derivative equals zero” is only a necessary condition for a solution to be optimal; not necessarily a sufficient condition!
Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem.
22/52
Decomposition of expected MSE (1)
Two different functions of HS GPA for predicting College GPA. What makes them different?
We care about prediction of College GPA for student we haven’t
seen before based on their HS GPA.
IID model: (X1,Y1),...,(Xn,Yn), (X,Y) are iid
Say training examples (X1, Y1), . . . , (Xn, Yn) are used to ˆ
determine f.
ˆ What is E[mse(f)]?
23/52
Decomposition of expected MSE (2)
ˆ E[mse(f )]
ˆ2ˆ =E E[(f(X)−Y) |f]
ˆ 2ˆ =E E[(f(X)−Y) |f,X]
ˆ 2 =E var(Y |X)+(f(X)−E[Y |X])
ˆ2 =E var(Y |X)+E[(f(X)−E[Y |X]) |X]
ˆ ˆ 2 =E var(Y |X)+var(f(X)|X)+(E[f(X)|X]−E[Y |X])
ˆˆ 2 =E var(Y |X) +E var(f(X)|X) +E (E[f(X)|X]−E[Y |X])
unavoidable error
variability of fˆ approximation error of fˆ
24/52
Decomposition of expected MSE (3)
First term is quantifies inherent unpredictability of Y (even after seeing X)
Second term measures the “variability” of fˆ due to the random nature of training data. Depends on:
probability distribution of training data,
type of function being fit (e.g., piecewise constant, linear), method of fitting (e.g., OLS),
etc.
Third term quantifies how well a function produced by the fitting procedure can approximate the regression function, even
after removing the “variability” of f.
ˆ
25/52
Multivariate linear regression (1)
For Dartmouth data, also have SAT Score for all students.
Can we use both predictor variables (HS GPA and SAT Score)
to get an even better prediction of College GPA?
Binning approach: instead of a 1-D grid (intervals), consider a
2-D grid (squares).
Linear regression: a function f : R2 → R of the form
f(x)=m1x1 +m2x2 +θ forsome(m1,m2)∈R2 andθ∈R.
26/52
Multivariate linear regression (2)
The general case: a (homogeneous) linear function f : Rd → R of the form
f(x) = xTw
for some w ∈ Rd.
w is called the weight vector or coefficient vector. What about inhomogeneous linear functions?
Just always include a “feature” that always has value 1. Then the corresponding weight acts like θ from before.
27/52
Multivariate ordinary least squares (1)
What is the linear function with smallest MSE on (x1,y1),...,(xn,yn)∈Rd ×R?
Find w ∈ Rd to minimize
1 n
R ( w ) : = n Notation warning: xi ∈ Rd
( x Ti w − y i ) 2 .
i=1
28/52
Multivariate ordinary least squares (2)
In matrix notation:
R(w) := ∥Aw − b∥2
where
A:=√ . ∈Rn×d, b:=√ .∈Rn.
←−xT −→ y 1111
n . n . ←− xTn −→ yn
If we put vector v ∈ Rd in the context of matrix multiplication, it is treated as a column vector by default!
If we want a row vector, we write vT.
Therefore
111 Aw−b=√ .
xTw−y
n .
x Tn w − y n
29/52
Figure 7: Geometric picture of least squares linear regression
30/52
Multivariate normal equations (1)
Like the one-dimensional case, optimal solutions are characterized by a system of linear equations (the “derivatives equal zero” conditions) called the normal equations:
∂ R ( w )
∂w1
∇wR(w)= . =2AT(Aw−b)=0,
∂ R ( w ) ∂wd
which is equivalent to
ATAw = ATb.
31/52
Multivariate normal equations (2)
If ATA is non-singular (i.e., invertible), then there is a unique solution given by
wˆ := (ATA)−1ATb.
If ATA is singular, then there are infinitely many solutions!
Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem.
32/52
Algorithm for least squares linear regression
How to solve least squares linear regression problem?
Just solve the normal equations, a system of d linear equations
in d unknowns.
Time complexity (naïve) of Gaussian elimination algorithm:
O(d3).
Actually, also need to count time to form the system of
equations, which is O(nd2).
33/52
Classical statistics view of OLS (1)
Normal linear regression model
Model training examples (X1, Y1), . . . , (Xn, Yn) as iid random
variables taking values in Rd × R, where
Yi |Xi =xi ∼N(xTiw,σ2)
w ∈ Rd and σ2 > 0 are the parameters of the model.
The least squares linear regression problem is the same as the problem of finding the maximum likelihood value for w.
34/52
Classical statistics view of OLS (2)
Suppose your data really does come from a distribution in this statistical model, say, with parameters w and σ2.
Then the function with smallest MSE is the linear function f⋆(x) = xTw, and its MSE is mse(f⋆) = σ2.
So estimating w is a sensible idea! (Plug-in principle. . . )
35/52
Statistical learning view of OLS (1)
IIDmodel: (X1,Y1),…,(Xn,Yn),(X,Y)∼iid P areiid random variables taking values in Rd × R
(X, Y ) is the (unseen) “test” example
Goal: find a (linear) function w ∈ Rd with small MSE
mse(w) = E[(XTw − Y )2].
We cannot directly minimize mse(w) as a function of w ∈ Rd, since it is an expectation (e.g., integral) with respect to the unknown distribution P
36/52
Statistical learning view of OLS (2)
However, we have an iid sample S := ((X1, Y1), . . . , (Xn, Yn)). We swap out P in the definition of mse(f), and replace it with
the empirical distribution on S: 1 n
Pn(x, y) := n
This is the distribution that puts probability mass 1/n on the
i-th training example.
Resulting objective function is
i=1
̃ ̃ 1n
E[(XTw−Y)2]= n where (X ̃,Y ̃) ∼ Pn.
(XiTw−Yi)2
i=1
1{(x,y)=(xi,yi)}.
37/52
Statistical learning view of OLS (3)
In some circles:
(True/population) risk of w: R(w) := E[(XTw − Y )2]
Empirical risk of w: R(w) := 1 n (XTw − Y )2 ni=1i i
This is another instance of the plug-in principle!
We want to minimize mse(w) but we don’t know P, so we replace it with our estimate Pn.
38/52
Statistical learning view of OLS (4)
This is not specific to linear regression; also works for other types of functions, and also other types of prediction problems, including classification.
For classification:
(True/population) risk of f: R(f) := E[1{f(X)̸=Y }]
Empirical risk of f: R(f) := 1 n 1 n i=1
{f (Xi )̸=Yi }
All that changed is the loss function (squared loss versus
zero/one loss)
Procedure that minimizes empirical risk: Empirical risk minimization (ERM)
39/52
Upgrading linear regression (1)
Make linear regression more powerful by being creative about features
We are forced to do this if x is not already provided as a vector of numbers
Instead of using x directly, use φ(x) for some transformation φ (possibly vector-valued)
40/52
Upgrading linear regression (2)
Examples:
Affine feature expansion, e.g., φ(x) = (1, x), to accommodate
intercept
Standardization, e.g., φ(x) = (x − μ)/σ where (μ, σ2) are
(estimates of) the mean and variance of the feature value
Non-linear scalar transformations, e.g., φ(x) = ln(1 + x)
Logical formula, e.g., φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7)
Trigonometric expansion, e.g.,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . )
Polynomial expansion, e.g.,
φ(x) = (1,×1,…,xd,x21,…,x2d,x1x2,…,xd−1xd)
Headless neural network φ(x) = N(x) ∈ Rk, where
N : Rd → Rk is a map computed by a intermediate layer of a neural network
(Later, we’ll talk about how to “learn” N.)
41/52
Example: Taking advantage of linearity
Example: y is health outcome, x is body temperature
Physician suggests relevant feature is (square) deviation from
normal body temperature (x − 98.6)2
What if you didn’t know the magic constant 98.6? (Apparently
it is wrong in the US anyway)
Useφ(x)=(1,x,x2)
Can learn coefficients w ∈ R3 such that φ(x)Tw = (x − 98.6)2,
or any other quadratic polynomial in x (which could be better!)
42/52
Example: Binning features
Dartmouth data example, where we considered intervals for the HS GPA variable:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
Use φ(x) = (1{x∈(0.00,0.25]}, 1{x∈(0.25,0.50]}, . . . ) with a linear
function
What is φ(x)Tw?
φ(x)Tw = wj if x is in the j-th interval.
43/52
Effect of feature expansion on expected MSE
ˆ E[mse(f )]
ˆˆ 2 =E var(Y |X) +E var(f(X)|X) +E (E[f(X)|X]−E[Y |X])
unavoidable error
variability of fˆ approximation error of fˆ
Feature expansion can help reduce the third term (approximation error)
But maybe at the cost of increasing the second term (variability)
44/52
Performance of OLS (1)
Study in context of IID model
(X1,Y1),…,(Xn,Yn),(X,Y) are iid, and assume E[XXT] is
invertible (WLOG).
Let w∗ denote the minimizer of mse(w) over all w ∈ Rd.
Inductive bias assumption: mse(w∗) is small, i.e., there is a linear function with low MSE.
This is a fairly “weak” modeling assumption, especially compared to the normal regression model.
How much larger is mse(wˆ) compared to mse(w∗)?
45/52
Performance of OLS (2)
Theorem: In the IID model, the OLS solution wˆ satisfies n E[mse(wˆ)] − mse(w∗) → tr(cov(εW ))
asn→∞,whereW =E[XXT]−1/2X andε=Y −XTw∗. Corollary: If, in addition, (X, Y ) follows the normal linear
regression model Y | X = x ∼ N(xTw∗, σ2), then n E[mse(wˆ)] − mse(w∗) → σ2d,
which is more typically written as
d E[mse(wˆ)] → 1 + n mse(w∗).
46/52
Linear algebraic view of OLS (1)
↑ ↑ WriteA=a1 ··· ad
↓↓
aj ∈Rn isj-thcolumnofA
Span of a1,…,ad is range(A), a subspace of Rn
Minimizing R(w) = ∥Aw − b∥2 over w ∈ Rd is same as finding vector ˆb in range(A) closest to b
47/52
b
a1 bˆ
a2 Figure 8: Orthogonal projection of b onto range(A)
48/52
Linear algebraic view of OLS (2)
Solution ˆb is orthogonal projection of b onto range(A)
ˆb is unique
Residual b − ˆb is orthogonal to ˆb
Togetwfromˆb,solveAw=ˆbforw.
If rank(A) < d (always the case if n < d), then infinitely-many
ways to write ˆb as linear combination of a1, . . . , ad.
Upshot: Uniqueness of least squares solution requires n ≥ d, and n < d guarantees non-uniqueness!
49/52
Over-fitting (1)
In the IID model, over-fitting is the phenomenon where the true risk is much worse than the empirical risk.
50/52
Over-fitting (2)
Example:
φ(x)=(1,x,x2,...,xk),degree-kpolynomialexpansion
Dimensionisd=k+1
Any function of ≤ k + 1 points can be interpolated by
polynomial of degree ≤ k
Soifn≤k+1=d,leastsquaressolutionwˆwillhavezero
empirical risk, regardless of its true risk (assuming no two training examples with distinct xi’s have different yi’s).
3 2 1 0
-1 -2 -3
Figure 9: Polynomial interpolation
0 0.2 0.4 0.6 0.8 1 x
51/52
y
Beyond empirical risk minimization
Recall plug-in principle
Want to minimize risk with respect to (unavailable) P; use Pn
instead
What if we can’t regard data as iid from P ?
Example: Suppose we know P = 1M + 1F 22
(mixture distribution)
We get size n1 iid sample from M, and size n2 iid sample from
F, n2 ≪ n1
How to implement plug-in principle?
52/52