程序代写代做代考 algorithm Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/52

Regression I: Linear regression

Outline
􏰛 Statistical model for regression
􏰛 College GPA example
􏰛 Ordinary least squares for linear regression 􏰛 The expected mean squared error
􏰛 Different views of ordinary least squares 􏰛 Features and linearity
􏰛 Over-fitting
􏰛 Beyond empirical risk minimization
1/52

Figure 1: Galton board
2/52

Real-valued predictions
􏰛 Example: Galton board
􏰛 Physical model: hard
􏰛 Statistical model: final position of ball is random
􏰛 Normal (Gaussian) distribution with mean μ and variance σ2 􏰛 Written N(μ, σ2)
􏰛 Probability density function is
1 − (y−μ)2 pμ,σ2(y)= √2πσ2e 2σ2
, y∈R.
􏰛 Goal: predict final position accurately, measure squared loss
(also called squared error)
(prediction − outcome)2
􏰛 Outcome is random, so look at expected squared loss (also called mean squared error)
3/52

Optimal prediction for mean squared error
􏰛 Predict yˆ ∈ R; true final position is Y (random variable) with meanE(Y)=μandvariancevar(Y)=E[(Y −E(Y))2]=σ2.
􏰛 Squared error is (yˆ − Y )2.
􏰛 Bias-variance decomposition:
E [ ( yˆ − Y ) 2 ] = E [ ( yˆ − μ + μ − Y ) 2 ]
= (yˆ−μ)2 +2(yˆ−μ)E[(μ−Y)]+E[(μ−Y)2]
= ( yˆ − μ ) 2 + σ 2 .
􏰛 This is true for any random variable Y ; don’t need normality
assumption.
􏰛 So optimal prediction is yˆ = μ.
􏰛 When parameters are unknown, can estimate from related data,

􏰛 Can also do an analysis of a plug-in prediction . . .
4/52

Statistical model for regression
􏰛 Setting is same as for classification except:
􏰛 Label is real number, rather than {0,1} or {1,2,…,K} 􏰛 Care about squared loss, rather than whether prediction is
correct
􏰛 Mean squared error of f:
mse(f) := E[(f(X) − Y )2],
the expected squared loss of f on random example
5/52

Optimal prediction function for regression
􏰛 If (X, Y ) is random test example, then optimal prediction function is
f⋆(x) = E[Y | X = x]
􏰛 Also called the regression function or conditional mean function 􏰛 Prediction function with smallest MSE
􏰛 Depends on conditional distribution of Y given X
6/52

Test MSE (1)
􏰛 Just like in classification, we can use test data to estimate ˆˆ
mse(f) for a function f that depends only on training data. 􏰛 IID model:
(X1,Y1),…,(Xn,Yn),(X1′,Y1′),…,(Xm′ ,Ym′ ),(X,Y) are iid 􏰛 Training examples (that you have):
S := ((X1,Y1),…,(Xn,Yn))
􏰛 Test examples (that you have): T := ((X1′ , Y1′), . . . , (Xm′ , Ym′ )) 􏰛 Test example (that you don’t have) used to define MSE: (X, Y )
􏰛 Predictor fˆ is based only on training examples
􏰛 Hence, test examples are independent of fˆ (very
important!)
􏰛ˆ We would like to estimate mse(f)
7/52

Test MSE (2)
􏰛ˆ1mˆ′′2 Test MSE mse(f,T) = m 􏰉i=1(f(Xi) ̸= Yi )
􏰛ˆˆ
By law of large numbers, mse(f,T) → mse(f) as m → ∞
8/52

Example: College GPA
􏰛 Data from 750 Dartmouth students’ College GPA 􏰛 Mean: 2.46
􏰛 Standard deviation: 0.746
􏰛 Assume this data is iid sample from the population of
Dartmouth students (false)
􏰛 Absent any other features, best constant prediction of a
uniformly random Dartmouth student’s College GPA is yˆ := 2.46.
Figure 2: Histogram of College GPA
9/52

Predicting College GPA from HS GPA (1)
􏰛 Students represented in data have High School (HS) GPA 􏰛 Maybe HS GPA is predictive of College GPA?
􏰛 Data: S := ((x1,y1),…,(xn,yn))
􏰛 xi is HS GPA of i-th student
􏰛 yi is College GPA of i-th student
10/52

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 3: Plot of College GPA vs HS GPA
College GPA
11/52

Predicting College GPA from HS GPA (2)
􏰛 First attempt:
􏰛 Define intervals of possible HS GPAs:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
􏰛 For each such interval I, record the mean μˆI of the College GPAs of students whose HS GPA falls in I.
ˆ
f(x) :=
 μˆ(0.00,0.25] μˆ(0.25,0.50]
μˆ
 (0.50,0.75] 
 
if x ∈ (0.00, 0.25]
if x ∈ (0.25, 0.50]
if x ∈ (0.50, 0.75] .
􏰛 (What to do about an interval I that contains no student’s HS GPA?)
.
12/52

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
1.5 2 2.5 3 3.5 4 4.5 5 HS GPA
Figure 4: Plot of mean College GPA vs binned HS GPA
College GPA
13/52

Predicting College GPA from HS GPA (3)
􏰛 Define
mse(f,S):= 1 􏰊 (f(x)−y)2, |S| (x,y)∈S
the mean squared error of predictions made by f on examples in S.
􏰛 “mean” is with respect to the uniform distribution on examples in S.
ˆ
mse(f,S) = 0.376
􏰍
􏰛 Piece-wise constant function fˆ is an improvement over the constant function (i.e., just predicting the mean 2.46 for all x)!
ˆ
mse(f,S) = 0.613 < 0.746 (the standard deviation of the yi’s) 14/52 Predicting College GPA from HS GPA (4) 􏰛 But fˆ has some quirks. 􏰛 E.g., those with HS GPA between 2.50 and 2.75 are predicted to have a lower College GPA than those with HS GPA between 2.25 and 2.50. 􏰛 E.g., something unusual with the student who has HS GPA of 4.5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1.5 2 2.5 3 3.5 4 4.5 5 HS GPA College GPA Figure 5: Plot of mean College GPA vs binned HS GPA 15/52 Least squares linear regression (1) 􏰛 Suppose we’d like to only consider functions with a specific functional form, e.g., a linear function: f(x) = mx + θ for m, θ ∈ R. 􏰛 Technically,x􏰜→mx+θislineariffθ=0. Ifθ̸=0,the function is not linear but affine. 􏰛 Semantics: Positive m means higher HS GPA gets a higher prediction of College GPA. 16/52 Least squares linear regression (2) 􏰛 What is the linear function with smallest MSE on (x1,y1),...,(xn,yn)∈R×R? Thisistheproblemof least squares linear regression. 􏰛 Find (m, θ) ∈ R2 to minimize 1 􏰊n (mxi +θ−yi)2. 􏰛 Also called ordinary least squares (OLS) n i=1 17/52 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1.5 2 2.5 3 3.5 4 4.5 5 HS GPA Figure 6: Plot of least squares linear regression line College GPA 18/52 Computing OLS (1) 􏰛 Derivatives equal zero conditions (normal equations): ∂  1 􏰊n ∂θ n (mxi +θ−yi)2 (mxi +θ−yi)2  2 􏰊n = n (mxi +θ−yi)=0  i=1  2 􏰊n = n (mxi +θ−yi)xi =0. i=1 ∂  1 􏰊n ∂m n i=1 􏰛 System of two linear equations with two unknowns (m, θ). 􏰛 Define 1 􏰊n x : = n i=1 1 􏰊n xy := n 1 􏰊n  i=1 x i , xiyi, x 2 : = n 1 􏰊n y := n x 2i , yi, i=1 i=1 so system can be re-written as i=1 xm + θ = y x2m + xθ = xy. 19 / 52 Computing OLS (2) 􏰛 Write in matrix notation: 􏰥x 1􏰦􏰥m􏰦 􏰥y􏰦 x2 x θ = xy . 􏰛Solution:(mˆ,θˆ)∈R2givenby mˆ:=xy−x·y, θˆ:=y−xy−x·y·x. x2 − x2 x2 − x2 20/52 Computing OLS (3) 􏰛 Catch: The above solution only makes sense if x2 − x2 ̸= 0, i.e., the variance of the xi’s is non-zero. 􏰛 If x2 − x2 = 0, then the matrix defining the LHS of system of equations is singular. 21/52 Computing OLS (4) 􏰛 In general, “derivative equals zero” is only a necessary condition for a solution to be optimal; not necessarily a sufficient condition! 􏰛 Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem. 22/52 Decomposition of expected MSE (1) 􏰛 Two different functions of HS GPA for predicting College GPA. 􏰛 What makes them different? 􏰛 We care about prediction of College GPA for student we haven’t seen before based on their HS GPA. 􏰛 IID model: (X1,Y1),...,(Xn,Yn), (X,Y) are iid 􏰛 Say training examples (X1, Y1), . . . , (Xn, Yn) are used to ˆ determine f. 􏰛ˆ What is E[mse(f)]? 23/52 Decomposition of expected MSE (2) ˆ E[mse(f )] 􏰔ˆ2ˆ􏰕 =E E[(f(X)−Y) |f] 􏰔ˆ 2ˆ􏰕 =E E[(f(X)−Y) |f,X] 􏰔 ˆ 2􏰕 =E var(Y |X)+(f(X)−E[Y |X]) 􏰔ˆ2􏰕 =E var(Y |X)+E[(f(X)−E[Y |X]) |X] 􏰔 ˆ ˆ 2􏰕 =E var(Y |X)+var(f(X)|X)+(E[f(X)|X]−E[Y |X]) 􏰑􏰒􏰔ˆ􏰕􏰔ˆ 2􏰕 =E var(Y |X) +E var(f(X)|X) +E (E[f(X)|X]−E[Y |X]) 􏰱 􏰰􏰯 􏰲 unavoidable error 􏰱􏰰􏰯􏰲􏰱 􏰰􏰯 􏰲 variability of fˆ approximation error of fˆ 24/52 Decomposition of expected MSE (3) 􏰛 First term is quantifies inherent unpredictability of Y (even after seeing X) 􏰛 Second term measures the “variability” of fˆ due to the random nature of training data. Depends on: 􏰛 probability distribution of training data, 􏰛 type of function being fit (e.g., piecewise constant, linear), 􏰛 method of fitting (e.g., OLS), 􏰛 etc. 􏰛 Third term quantifies how well a function produced by the fitting procedure can approximate the regression function, even after removing the “variability” of f. ˆ 25/52 Multivariate linear regression (1) 􏰛 For Dartmouth data, also have SAT Score for all students. 􏰛 Can we use both predictor variables (HS GPA and SAT Score) to get an even better prediction of College GPA? 􏰛 Binning approach: instead of a 1-D grid (intervals), consider a 2-D grid (squares). 􏰛 Linear regression: a function f : R2 → R of the form f(x)=m1x1 +m2x2 +θ forsome(m1,m2)∈R2 andθ∈R. 26/52 Multivariate linear regression (2) 􏰛 The general case: a (homogeneous) linear function f : Rd → R of the form f(x) = xTw for some w ∈ Rd. 􏰛 w is called the weight vector or coefficient vector. 􏰛 What about inhomogeneous linear functions? 􏰛 Just always include a “feature” that always has value 1. Then the corresponding weight acts like θ from before. 27/52 Multivariate ordinary least squares (1) 􏰛 What is the linear function with smallest MSE on (x1,y1),...,(xn,yn)∈Rd ×R? 􏰛 Find w ∈ Rd to minimize 1 􏰊n R􏰓 ( w ) : = n 􏰛 Notation warning: xi ∈ Rd ( x Ti w − y i ) 2 . i=1 28/52 Multivariate ordinary least squares (2) 􏰛 In matrix notation: R􏰓(w) := ∥Aw − b∥2 where A:=√  . ∈Rn×d, b:=√ .∈Rn. ←−xT −→ y 1111 n  .  n  .  ←− xTn −→ yn 􏰛 If we put vector v ∈ Rd in the context of matrix multiplication, it is treated as a column vector by default! 􏰛 If we want a row vector, we write vT. 􏰛 Therefore 111 Aw−b=√  . xTw−y   n .  x Tn w − y n 29/52 Figure 7: Geometric picture of least squares linear regression 30/52 Multivariate normal equations (1) 􏰛 Like the one-dimensional case, optimal solutions are characterized by a system of linear equations (the “derivatives equal zero” conditions) called the normal equations:  ∂ R􏰓 ( w )   ∂w1  􏰓 ∇wR(w)= . =2AT(Aw−b)=0,   ∂ R􏰓 ( w )  ∂wd which is equivalent to ATAw = ATb. 31/52 Multivariate normal equations (2) 􏰛 If ATA is non-singular (i.e., invertible), then there is a unique solution given by wˆ := (ATA)−1ATb. 􏰛 If ATA is singular, then there are infinitely many solutions! 􏰛 Theorem: Every solution to the normal equations is an optimal solution to the least squares linear regression problem. 32/52 Algorithm for least squares linear regression 􏰛 How to solve least squares linear regression problem? 􏰛 Just solve the normal equations, a system of d linear equations in d unknowns. 􏰛 Time complexity (naïve) of Gaussian elimination algorithm: O(d3). 􏰛 Actually, also need to count time to form the system of equations, which is O(nd2). 33/52 Classical statistics view of OLS (1) 􏰛 Normal linear regression model 􏰛 Model training examples (X1, Y1), . . . , (Xn, Yn) as iid random variables taking values in Rd × R, where Yi |Xi =xi ∼N(xTiw,σ2) 􏰛 w ∈ Rd and σ2 > 0 are the parameters of the model.
􏰛 The least squares linear regression problem is the same as the problem of finding the maximum likelihood value for w.
34/52

Classical statistics view of OLS (2)
􏰛 Suppose your data really does come from a distribution in this statistical model, say, with parameters w and σ2.
􏰛 Then the function with smallest MSE is the linear function f⋆(x) = xTw, and its MSE is mse(f⋆) = σ2.
􏰛 So estimating w is a sensible idea! (Plug-in principle. . . )
35/52

Statistical learning view of OLS (1)
􏰛 IIDmodel: (X1,Y1),…,(Xn,Yn),(X,Y)∼iid P areiid random variables taking values in Rd × R
􏰛 (X, Y ) is the (unseen) “test” example
􏰛 Goal: find a (linear) function w ∈ Rd with small MSE
mse(w) = E[(XTw − Y )2].
􏰛 We cannot directly minimize mse(w) as a function of w ∈ Rd, since it is an expectation (e.g., integral) with respect to the unknown distribution P
36/52

Statistical learning view of OLS (2)
􏰛 However, we have an iid sample S := ((X1, Y1), . . . , (Xn, Yn)). 􏰛 We swap out P in the definition of mse(f), and replace it with
the empirical distribution on S: 1 􏰊n
Pn(x, y) := n
􏰛 This is the distribution that puts probability mass 1/n on the
i-th training example.
􏰛 Resulting objective function is
i=1
̃ ̃ 1􏰊n
E[(XTw−Y)2]= n where (X ̃,Y ̃) ∼ Pn.
(XiTw−Yi)2
i=1
1{(x,y)=(xi,yi)}.
37/52

Statistical learning view of OLS (3)
􏰛 In some circles:
􏰛 (True/population) risk of w: R(w) := E[(XTw − Y )2]
􏰛 Empirical risk of w: R􏰓(w) := 1 􏰉n (XTw − Y )2 ni=1i i
􏰛 This is another instance of the plug-in principle!
􏰛 We want to minimize mse(w) but we don’t know P, so we replace it with our estimate Pn.
38/52

Statistical learning view of OLS (4)
􏰛 This is not specific to linear regression; also works for other types of functions, and also other types of prediction problems, including classification.
􏰛 For classification:
􏰛 (True/population) risk of f: R(f) := E[1{f(X)̸=Y }]
􏰛 Empirical risk of f: R􏰓(f) := 1 􏰉n 1 n i=1
{f (Xi )̸=Yi }
􏰛 All that changed is the loss function (squared loss versus
zero/one loss)
􏰛 Procedure that minimizes empirical risk: Empirical risk minimization (ERM)
39/52

Upgrading linear regression (1)
􏰛 Make linear regression more powerful by being creative about features
􏰛 We are forced to do this if x is not already provided as a vector of numbers
􏰛 Instead of using x directly, use φ(x) for some transformation φ (possibly vector-valued)
40/52

Upgrading linear regression (2)
􏰛 Examples:
􏰛 Affine feature expansion, e.g., φ(x) = (1, x), to accommodate
intercept
􏰛 Standardization, e.g., φ(x) = (x − μ)/σ where (μ, σ2) are
(estimates of) the mean and variance of the feature value
􏰛 Non-linear scalar transformations, e.g., φ(x) = ln(1 + x)
􏰛 Logical formula, e.g., φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7)
􏰛 Trigonometric expansion, e.g.,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . )
􏰛 Polynomial expansion, e.g.,
φ(x) = (1,×1,…,xd,x21,…,x2d,x1x2,…,xd−1xd)
􏰛 Headless neural network φ(x) = N(x) ∈ Rk, where
N : Rd → Rk is a map computed by a intermediate layer of a neural network
􏰛 (Later, we’ll talk about how to “learn” N.)
41/52

Example: Taking advantage of linearity
􏰛 Example: y is health outcome, x is body temperature
􏰛 Physician suggests relevant feature is (square) deviation from
normal body temperature (x − 98.6)2
􏰛 What if you didn’t know the magic constant 98.6? (Apparently
it is wrong in the US anyway)
􏰛 Useφ(x)=(1,x,x2)
􏰛 Can learn coefficients w ∈ R3 such that φ(x)Tw = (x − 98.6)2,
or any other quadratic polynomial in x (which could be better!)
42/52

Example: Binning features
􏰛 Dartmouth data example, where we considered intervals for the HS GPA variable:
(0.00, 0.25] , (0.25, 0.50] , (0.50, 0.75] , · · ·
􏰛 Use φ(x) = (1{x∈(0.00,0.25]}, 1{x∈(0.25,0.50]}, . . . ) with a linear
function
􏰛 What is φ(x)Tw?
􏰛 φ(x)Tw = wj if x is in the j-th interval.
43/52

Effect of feature expansion on expected MSE
ˆ E[mse(f )]
􏰑􏰒􏰔ˆ􏰕􏰔ˆ 2􏰕 =E var(Y |X) +E var(f(X)|X) +E (E[f(X)|X]−E[Y |X])
􏰱 􏰰􏰯 􏰲
unavoidable error
􏰱􏰰􏰯􏰲􏰱 􏰰􏰯 􏰲
variability of fˆ approximation error of fˆ
􏰛 Feature expansion can help reduce the third term (approximation error)
􏰛 But maybe at the cost of increasing the second term (variability)
44/52

Performance of OLS (1)
􏰛 Study in context of IID model
􏰛 (X1,Y1),…,(Xn,Yn),(X,Y) are iid, and assume E[XXT] is
invertible (WLOG).
􏰛 Let w∗ denote the minimizer of mse(w) over all w ∈ Rd.
􏰛 Inductive bias assumption: mse(w∗) is small, i.e., there is a linear function with low MSE.
􏰛 This is a fairly “weak” modeling assumption, especially compared to the normal regression model.
􏰛 How much larger is mse(wˆ) compared to mse(w∗)?
45/52

Performance of OLS (2)
􏰛 Theorem: In the IID model, the OLS solution wˆ satisfies n 􏰏E[mse(wˆ)] − mse(w∗)􏰐 → tr(cov(εW ))
asn→∞,whereW =E[XXT]−1/2X andε=Y −XTw∗. 􏰛 Corollary: If, in addition, (X, Y ) follows the normal linear
regression model Y | X = x ∼ N(xTw∗, σ2), then n 􏰏E[mse(wˆ)] − mse(w∗)􏰐 → σ2d,
which is more typically written as
􏰅d􏰆 E[mse(wˆ)] → 1 + n mse(w∗).
46/52

Linear algebraic view of OLS (1)
↑ ↑ 􏰛WriteA=a1 ··· ad
↓↓
􏰛 aj ∈Rn isj-thcolumnofA
􏰛 Span of a1,…,ad is range(A), a subspace of Rn
􏰛 Minimizing R􏰓(w) = ∥Aw − b∥2 over w ∈ Rd is same as finding vector ˆb in range(A) closest to b
47/52

b
a1 bˆ
a2 Figure 8: Orthogonal projection of b onto range(A)
48/52

Linear algebraic view of OLS (2)
􏰛 Solution ˆb is orthogonal projection of b onto range(A)
􏰛 ˆb is unique
􏰛 Residual b − ˆb is orthogonal to ˆb
􏰛 Togetwfromˆb,solveAw=ˆbforw.
􏰛 If rank(A) < d (always the case if n < d), then infinitely-many ways to write ˆb as linear combination of a1, . . . , ad. 􏰛 Upshot: Uniqueness of least squares solution requires n ≥ d, and n < d guarantees non-uniqueness! 49/52 Over-fitting (1) 􏰛 In the IID model, over-fitting is the phenomenon where the true risk is much worse than the empirical risk. 50/52 Over-fitting (2) 􏰛 Example: 􏰛 φ(x)=(1,x,x2,...,xk),degree-kpolynomialexpansion 􏰛 Dimensionisd=k+1 􏰛 Any function of ≤ k + 1 points can be interpolated by polynomial of degree ≤ k 􏰛 Soifn≤k+1=d,leastsquaressolutionwˆwillhavezero empirical risk, regardless of its true risk (assuming no two training examples with distinct xi’s have different yi’s). 3 2 1 0 -1 -2 -3 Figure 9: Polynomial interpolation 0 0.2 0.4 0.6 0.8 1 x 51/52 y Beyond empirical risk minimization 􏰛 Recall plug-in principle 􏰛 Want to minimize risk with respect to (unavailable) P; use Pn instead 􏰛 What if we can’t regard data as iid from P ? 􏰛 Example: Suppose we know P = 1M + 1F 22 (mixture distribution) 􏰛 We get size n1 iid sample from M, and size n2 iid sample from F, n2 ≪ n1 􏰛 How to implement plug-in principle? 52/52