CS计算机代考程序代写 algorithm Machine learning lecture slides

Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 52

Regression I: Linear regression

Outline

I Statistical model for regression
I College GPA example
I Ordinary least squares for linear regression
I The expected mean squared error
I Different views of ordinary least squares
I Features and linearity
I Over-fitting
I Beyond empirical risk minimization

1 / 52

Figure 1: Galton board

2 / 52

Real-valued predictions

I Example: Galton board
I Physical model: hard
I Statistical model: final position of ball is random

I Normal (Gaussian) distribution with mean µ and variance σ2
I Written N(µ, σ2)
I Probability density function is

pµ,σ2(y) =
1


2πσ2

e
− (y−µ)

2

2σ2 , y ∈ R.

I Goal: predict final position accurately, measure squared loss
(also called squared error)

(prediction− outcome)2

I Outcome is random, so look at expected squared loss (also
called mean squared error)

3 / 52

Optimal prediction for mean squared error

I Predict ŷ ∈ R; true final position is Y (random variable) with
mean E(Y ) = µ and variance var(Y ) = E[(Y − E(Y ))2] = σ2.

I Squared error is (ŷ − Y )2.
I Bias-variance decomposition:

E[(ŷ − Y )2] = E[(ŷ − µ+ µ− Y )2]
= (ŷ − µ)2 + 2(ŷ − µ)E[(µ− Y )] + E[(µ− Y )2]
= (ŷ − µ)2 + σ2.

I This is true for any random variable Y ; don’t need normality
assumption.

I So optimal prediction is ŷ = µ.
I When parameters are unknown, can estimate from related data,

. . .
I Can also do an analysis of a plug-in prediction . . .

4 / 52

Statistical model for regression

I Setting is same as for classification except:
I Label is real number, rather than {0, 1} or {1, 2, . . . ,K}
I Care about squared loss, rather than whether prediction is

correct
I Mean squared error of f :

mse(f) := E[(f(X)− Y )2],

the expected squared loss of f on random example

5 / 52

Optimal prediction function for regression

I If (X,Y ) is random test example, then
optimal prediction function is

f?(x) = E[Y | X = x]

I Also called the regression function or conditional mean function
I Prediction function with smallest MSE
I Depends on conditional distribution of Y given X

6 / 52

Test MSE (1)

I Just like in classification, we can use test data to estimate
mse(f̂) for a function f̂ that depends only on training data.

I IID model:
(X1, Y1), . . . , (Xn, Yn), (X ′1, Y ′1), . . . , (X ′m, Y ′m), (X,Y ) are iid
I Training examples (that you have):

S := ((X1, Y1), . . . , (Xn, Yn))
I Test examples (that you have): T := ((X ′1, Y ′1), . . . , (X ′m, Y ′m))
I Test example (that you don’t have) used to define MSE: (X,Y )

I Predictor f̂ is based only on training examples
I Hence, test examples are independent of f̂ (very

important!)
I We would like to estimate mse(f̂)

7 / 52

Test MSE (2)

I Test MSE mse(f̂ , T ) = 1
m

∑m
i=1(f̂(X ′i) 6= Y


i )

2

I By law of large numbers, mse(f̂ , T )→ mse(f̂) as m→∞

8 / 52

Example: College GPA

I Data from 750 Dartmouth students’ College GPA
I Mean: 2.46
I Standard deviation: 0.746

I Assume this data is iid sample from the population of
Dartmouth students (false)

I Absent any other features, best constant prediction of a
uniformly random Dartmouth student’s College GPA is
ŷ := 2.46.

Figure 2: Histogram of College GPA 9 / 52

Predicting College GPA from HS GPA (1)

I Students represented in data have High School (HS) GPA
I Maybe HS GPA is predictive of College GPA?

I Data: S := ((x1, y1), . . . , (xn, yn))
I xi is HS GPA of i-th student
I yi is College GPA of i-th student

10 / 52

1.5 2 2.5 3 3.5 4 4.5 5

HS GPA

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

C
o
ll
e
g
e
G

P
A

Figure 3: Plot of College GPA vs HS GPA

11 / 52

Predicting College GPA from HS GPA (2)

I First attempt:
I Define intervals of possible HS GPAs:

(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·

I For each such interval I, record the mean µ̂I of the College
GPAs of students whose HS GPA falls in I.

f̂(x) :=



µ̂(0.00,0.25] if x ∈ (0.00, 0.25]
µ̂(0.25,0.50] if x ∈ (0.25, 0.50]
µ̂(0.50,0.75] if x ∈ (0.50, 0.75]

I (What to do about an interval I that contains no student’s HS
GPA?)

12 / 52

1.5 2 2.5 3 3.5 4 4.5 5

HS GPA

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

C
o
ll
e
g
e
G

P
A

Figure 4: Plot of mean College GPA vs binned HS GPA

13 / 52

Predicting College GPA from HS GPA (3)

I Define
mse(f, S) :=

1
|S|


(x,y)∈S

(f(x)− y)2,

the mean squared error of predictions made by f on examples
in S.
I “mean” is with respect to the uniform distribution on examples

in S.

mse(f̂ , S) = 0.376√
mse(f̂ , S) = 0.613 < 0.746 (the standard deviation of the yi’s) I Piece-wise constant function f̂ is an improvement over the constant function (i.e., just predicting the mean 2.46 for all x)! 14 / 52 Predicting College GPA from HS GPA (4) I But f̂ has some quirks. I E.g., those with HS GPA between 2.50 and 2.75 are predicted to have a lower College GPA than those with HS GPA between 2.25 and 2.50. I E.g., something unusual with the student who has HS GPA of 4.5 1.5 2 2.5 3 3.5 4 4.5 5 HS GPA 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 C o ll e g e G P A Figure 5: Plot of mean College GPA vs binned HS GPA 15 / 52 Least squares linear regression (1) I Suppose we’d like to only consider functions with a specific functional form, e.g., a linear function: f(x) = mx+ θ for m, θ ∈ R. I Technically, x 7→ mx+ θ is linear iff θ = 0. If θ 6= 0, the function is not linear but affine. I Semantics: Positive m means higher HS GPA gets a higher prediction of College GPA. 16 / 52 Least squares linear regression (2) I What is the linear function with smallest MSE on (x1, y1), . . . , (xn, yn) ∈ R× R? This is the problem of least squares linear regression. I Find (m, θ) ∈ R2 to minimize 1 n n∑ i=1 (mxi + θ − yi)2. I Also called ordinary least squares (OLS) 17 / 52 1.5 2 2.5 3 3.5 4 4.5 5 HS GPA 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 C o ll e g e G P A Figure 6: Plot of least squares linear regression line 18 / 52 Computing OLS (1) I Derivatives equal zero conditions (normal equations): ∂ ∂θ   1n n∑ i=1 (mxi + θ − yi)2   = 2n n∑ i=1 (mxi + θ − yi) = 0 ∂ ∂m   1n n∑ i=1 (mxi + θ − yi)2   = 2n n∑ i=1 (mxi + θ − yi)xi = 0. I System of two linear equations with two unknowns (m, θ). I Define x := 1 n n∑ i=1 xi, x2 := 1 n n∑ i=1 x2i , xy := 1 n n∑ i=1 xiyi, y := 1 n n∑ i=1 yi, so system can be re-written as xm+ θ = y x2m+ xθ = xy. 19 / 52 Computing OLS (2) I Write in matrix notation:[ x 1 x2 x ] [ m θ ] = [ y xy ] . I Solution: (m̂, θ̂) ∈ R2 given by m̂ := xy − x · y x2 − x2 , θ̂ := y − xy − x · y x2 − x2 · x. 20 / 52 Computing OLS (3) I Catch: The above solution only makes sense if x2 − x2 6= 0, i.e., the variance of the xi’s is non-zero. I If x2 − x2 = 0, then the matrix defining the LHS of system of equations is singular. 21 / 52 Computing OLS (4) I In general, “derivative equals zero” is only a necessary condition for a solution to be optimal; not necessarily a sufficient condition! I Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem. 22 / 52 Decomposition of expected MSE (1) I Two different functions of HS GPA for predicting College GPA. I What makes them different? I We care about prediction of College GPA for student we haven’t seen before based on their HS GPA. I IID model: (X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid I Say training examples (X1, Y1), . . . , (Xn, Yn) are used to determine f̂ . I What is E[mse(f̂)]? 23 / 52 Decomposition of expected MSE (2) E[mse(f̂)] = E [ E[(f̂(X)− Y )2 | f̂ ] ] = E [ E[(f̂(X)− Y )2 | f̂ , X] ] = E [ var(Y | X) + (f̂(X)− E[Y | X])2 ] = E [ var(Y | X) + E[(f̂(X)− E[Y | X])2 | X] ] = E [ var(Y | X) + var(f̂(X) | X) + (E[f̂(X) | X]− E[Y | X])2 ] = E [ var(Y | X) ]︸ ︷︷ ︸ unavoidable error +E [ var(f̂(X) | X) ] ︸ ︷︷ ︸ variability of f̂ +E [ (E[f̂(X) | X]− E[Y | X])2 ] ︸ ︷︷ ︸ approximation error of f̂ . 24 / 52 Decomposition of expected MSE (3) I First term is quantifies inherent unpredictability of Y (even after seeing X) I Second term measures the “variability” of f̂ due to the random nature of training data. Depends on: I probability distribution of training data, I type of function being fit (e.g., piecewise constant, linear), I method of fitting (e.g., OLS), I etc. I Third term quantifies how well a function produced by the fitting procedure can approximate the regression function, even after removing the “variability” of f̂ . 25 / 52 Multivariate linear regression (1) I For Dartmouth data, also have SAT Score for all students. I Can we use both predictor variables (HS GPA and SAT Score) to get an even better prediction of College GPA? I Binning approach: instead of a 1-D grid (intervals), consider a 2-D grid (squares). I Linear regression: a function f : R2 → R of the form f(x) = m1x1 +m2x2 + θ for some (m1,m2) ∈ R2 and θ ∈ R. 26 / 52 Multivariate linear regression (2) I The general case: a (homogeneous) linear function f : Rd → R of the form f(x) = xTw for some w ∈ Rd. I w is called the weight vector or coefficient vector. I What about inhomogeneous linear functions? I Just always include a “feature” that always has value 1. Then the corresponding weight acts like θ from before. 27 / 52 Multivariate ordinary least squares (1) I What is the linear function with smallest MSE on (x1, y1), . . . , (xn, yn) ∈ Rd × R? I Find w ∈ Rd to minimize R̂(w) := 1 n n∑ i=1 (xTiw − yi) 2. I Notation warning: xi ∈ Rd 28 / 52 Multivariate ordinary least squares (2) I In matrix notation: R̂(w) := ‖Aw − b‖22 where A := 1 √ n   ←− xT1 −→ ... ←− xTn −→   ∈ Rn×d, b := 1√n   y1 ... yn   ∈ Rn. I If we put vector v ∈ Rd in the context of matrix multiplication, it is treated as a column vector by default! I If we want a row vector, we write vT. I Therefore Aw − b = 1 √ n   xT1w − y1 ... xTnw − yn   29 / 52 Figure 7: Geometric picture of least squares linear regression 30 / 52 Multivariate normal equations (1) I Like the one-dimensional case, optimal solutions are characterized by a system of linear equations (the “derivatives equal zero” conditions) called the normal equations: ∇w R̂(w) =   ∂ R̂(w) ∂w1 ... ∂ R̂(w) ∂w d   = 2AT(Aw − b) = 0, which is equivalent to ATAw = ATb. 31 / 52 Multivariate normal equations (2) I If ATA is non-singular (i.e., invertible), then there is a unique solution given by ŵ := (ATA)−1ATb. I If ATA is singular, then there are infinitely many solutions! I Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem. 32 / 52 Algorithm for least squares linear regression I How to solve least squares linear regression problem? I Just solve the normal equations, a system of d linear equations in d unknowns. I Time complexity (naïve) of Gaussian elimination algorithm: O(d3). I Actually, also need to count time to form the system of equations, which is O(nd2). 33 / 52 Classical statistics view of OLS (1) I Normal linear regression model I Model training examples (X1, Y1), . . . , (Xn, Yn) as iid random variables taking values in Rd × R, where Yi | Xi = xi ∼ N(xTiw, σ 2) I w ∈ Rd and σ2 > 0 are the parameters of the model.
I The least squares linear regression problem is the same as the

problem of finding the maximum likelihood value for w.

34 / 52

Classical statistics view of OLS (2)

I Suppose your data really does come from a distribution in this
statistical model, say, with parameters w and σ2.
I Then the function with smallest MSE is the linear function

f?(x) = xTw, and its MSE is mse(f?) = σ2.
I So estimating w is a sensible idea! (Plug-in principle. . . )

35 / 52

Statistical learning view of OLS (1)

I IID model: (X1, Y1), . . . , (Xn, Yn), (X,Y ) ∼iid P are iid
random variables taking values in Rd × R
I (X,Y ) is the (unseen) “test” example

I Goal: find a (linear) function w ∈ Rd with small MSE

mse(w) = E[(XTw − Y )2].

I We cannot directly minimize mse(w) as a function of w ∈ Rd,
since it is an expectation (e.g., integral) with respect to the
unknown distribution P

36 / 52

Statistical learning view of OLS (2)

I However, we have an iid sample S := ((X1, Y1), . . . , (Xn, Yn)).
I We swap out P in the definition of mse(f), and replace it with

the empirical distribution on S:

Pn(x, y) :=
1
n

n∑
i=1

1{(x,y)=(xi,yi)}.

I This is the distribution that puts probability mass 1/n on the
i-th training example.

I Resulting objective function is

E[(X̃Tw − Ỹ )2] =
1
n

n∑
i=1

(XTi w − Yi)
2

where (X̃, Ỹ ) ∼ Pn.

37 / 52

Statistical learning view of OLS (3)

I In some circles:
I (True/population) risk of w: R(w) := E[(XTw − Y )2]
I Empirical risk of w: R̂(w) := 1

n

∑n
i=1(X

T
iw − Yi)

2

I This is another instance of the plug-in principle!
I We want to minimize mse(w) but we don’t know P , so we

replace it with our estimate Pn.

38 / 52

Statistical learning view of OLS (4)

I This is not specific to linear regression; also works for other
types of functions, and also other types of prediction problems,
including classification.

I For classification:
I (True/population) risk of f : R(f) := E[1{f(X) 6=Y }]
I Empirical risk of f : R̂(f) := 1

n

∑n
i=1 1{f(Xi) 6=Yi}

I All that changed is the loss function (squared loss versus
zero/one loss)

I Procedure that minimizes empirical risk:
Empirical risk minimization (ERM)

39 / 52

Upgrading linear regression (1)

I Make linear regression more powerful by being creative about
features
I We are forced to do this if x is not already provided as a vector

of numbers
I Instead of using x directly, use ϕ(x) for some transformation ϕ

(possibly vector-valued)

40 / 52

Upgrading linear regression (2)

I Examples:
I Affine feature expansion, e.g., ϕ(x) = (1, x), to accommodate

intercept
I Standardization, e.g., ϕ(x) = (x− µ)/σ where (µ, σ2) are

(estimates of) the mean and variance of the feature value
I Non-linear scalar transformations, e.g., ϕ(x) = ln(1 + x)
I Logical formula, e.g., ϕ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7)
I Trigonometric expansion, e.g.,

ϕ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . )
I Polynomial expansion, e.g.,

ϕ(x) = (1, x1, . . . , xd, x21, . . . , x2d, x1x2, . . . , xd−1xd)
I Headless neural network ϕ(x) = N(x) ∈ Rk, where

N : Rd → Rk is a map computed by a intermediate layer of a
neural network

I (Later, we’ll talk about how to “learn” N .)

41 / 52

Example: Taking advantage of linearity

I Example: y is health outcome, x is body temperature
I Physician suggests relevant feature is (square) deviation from

normal body temperature (x− 98.6)2
I What if you didn’t know the magic constant 98.6? (Apparently

it is wrong in the US anyway)
I Use ϕ(x) = (1, x, x2)
I Can learn coefficients w ∈ R3 such that ϕ(x)Tw = (x− 98.6)2,

or any other quadratic polynomial in x (which could be better!)

42 / 52

Example: Binning features

I Dartmouth data example, where we considered intervals for the
HS GPA variable:

(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·

I Use ϕ(x) = (1{x∈(0.00,0.25]},1{x∈(0.25,0.50]}, . . . ) with a linear
function

I What is ϕ(x)Tw?
I ϕ(x)Tw = wj if x is in the j-th interval.

43 / 52

Effect of feature expansion on expected MSE

E[mse(f̂)]

= E
[
var(Y | X)

]︸ ︷︷ ︸
unavoidable error

+E
[
var(f̂(X) | X)

]
︸ ︷︷ ︸

variability of f̂

+E
[
(E[f̂(X) | X]− E[Y | X])2

]
︸ ︷︷ ︸

approximation error of f̂

.

I Feature expansion can help reduce the third term
(approximation error)

I But maybe at the cost of increasing the second term
(variability)

44 / 52

Performance of OLS (1)

I Study in context of IID model
I (X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid, and assume E[XXT] is

invertible (WLOG).
I Let w∗ denote the minimizer of mse(w) over all w ∈ Rd.

I Inductive bias assumption: mse(w∗) is small, i.e., there is a
linear function with low MSE.

I This is a fairly “weak” modeling assumption, especially
compared to the normal regression model.

I How much larger is mse(ŵ) compared to mse(w∗)?

45 / 52

Performance of OLS (2)

I Theorem: In the IID model, the OLS solution ŵ satisfies

n
(
E[mse(ŵ)]−mse(w∗)

)
→ tr(cov(εW ))

as n→∞, where W = E[XXT]−1/2X and ε = Y −XTw∗.

I Corollary: If, in addition, (X,Y ) follows the normal linear
regression model Y | X = x ∼ N(xTw∗, σ2), then

n
(
E[mse(ŵ)]−mse(w∗)

)
→ σ2d,

which is more typically written as

E[mse(ŵ)]→
(

1 +
d

n

)
mse(w∗).

46 / 52

Linear algebraic view of OLS (1)

I Write A =


 ↑ ↑a1 · · · ad
↓ ↓




I aj ∈ Rn is j-th column of A
I Span of a1, . . . , ad is range(A), a subspace of Rn

I Minimizing R̂(w) = ‖Aw− b‖22 over w ∈ Rd is same as finding
vector b̂ in range(A) closest to b

47 / 52

b

a1

a2
Figure 8: Orthogonal projection of b onto range(A)

48 / 52

Linear algebraic view of OLS (2)

I Solution b̂ is orthogonal projection of b onto range(A)
I b̂ is unique
I Residual b− b̂ is orthogonal to b̂
I To get w from b̂, solve Aw = b̂ for w.
I If rank(A) < d (always the case if n < d), then infinitely-many ways to write b̂ as linear combination of a1, . . . , ad. I Upshot: Uniqueness of least squares solution requires n ≥ d, and n < d guarantees non-uniqueness! 49 / 52 Over-fitting (1) I In the IID model, over-fitting is the phenomenon where the true risk is much worse than the empirical risk. 50 / 52 Over-fitting (2) I Example: I ϕ(x) = (1, x, x2, . . . , xk), degree-k polynomial expansion I Dimension is d = k + 1 I Any function of ≤ k + 1 points can be interpolated by polynomial of degree ≤ k I So if n ≤ k + 1 = d, least squares solution ŵ will have zero empirical risk, regardless of its true risk (assuming no two training examples with distinct xi’s have different yi’s). 0 0.2 0.4 0.6 0.8 1 x -3 -2 -1 0 1 2 3 y Figure 9: Polynomial interpolation 51 / 52 Beyond empirical risk minimization I Recall plug-in principle I Want to minimize risk with respect to (unavailable) P ; use Pn instead I What if we can’t regard data as iid from P? I Example: Suppose we know P = 12M + 1 2F (mixture distribution) I We get size n1 iid sample from M , and size n2 iid sample from F , n2 � n1 I How to implement plug-in principle? 52 / 52 Regression I: Linear regression