程序代写代做代考 scheme flex algorithm Predictive Analytics – Week 8: Linear Methods for Regression II

Predictive Analytics – Week 8: Linear Methods for Regression II

Predictive Analytics
Week 8: Linear Methods for Regression II

Semester 2, 2018

Discipline of Business Analytics, The University of Sydney Business School

Week 8: Linear Methods for Regression II

1. Introduction

2. Principal components regression

3. Partial least squares (optional)

4. Illustration and discussion

5. Considerations in high dimensions

6. Robust regression

Reading: Chapters 6.3 and 6.4 of ISL.

Exercise questions: Chapter 6.8 of ISL, Q5 and Q6. 2/43

Introduction

Dimension reduction methods (key concept)

Dimension reduction methods consist of building M < p transformed variables which are linear combinations (projections) of the predictors. We then fit a linear regression of the response on the new variables. 3/43 Dimension reduction methods Given the original predictors x1, x2, . . . , xp, we let z1, z2, . . . , zM represent M < p linear combinations of the original predictors, that is zm = p∑ j=1 φjmxj , for some constants φ1m, φ2m, . . . , φpm, m = 1, . . . ,M , to be determined. 4/43 Dimension reduction methods We consider a linear regression model for the transformed predictors Yi = θ0 + M∑ m=1 θmzim + �i, which we fit by OLS. We therefore estimate only M + 1 < p+ 1 regression parameters, reducing variance compared to OLS. 5/43 Dimension reduction methods The model for the transformed predictors implies a model for the original predictors: M∑ m=1 θmzim = M∑ m=1 θm   p∑ j=1 φjmxij   = p∑ j=1 βjxij , where βj = M∑ m=1 θmφjm. Dimension reduction is therefore a constraint on the original linear regression model. The cost of imposing this restriction is bias. 6/43 Dimension reduction methods • The reduction in variance compared to OLS can be substantial when M << p. • If M = p and Zm are linearly independent, no dimension reduction occurs and dimension reduction methods are equivalent to OLS on original p predictors. • Dimension reduction methods can be useful when p > N .

7/43

Principal components regression

Principal components analysis (key concept)

Principal Component Analysis (PCA) is a popular way of
deriving a set of low dimensional set of features from a large
dimensional set of variables. In our setting, we want to use PCA to
reduce the dimension of the N × p design matrix X.

In our discussion below, we assume that we first center and
standardise all the predictors.

8/43

Principal components (key concept)

We define the first principal component of X as the linear
combination

z1 = φ11×1 + φ21×2 + . . .+ φp1xp,

such that z1 has largest sample variance among all linear
combinations whose coefficients satisfy ||φ1||22 =

∑p
j=1 φ

2
j1 = 1.

9/43

Principal components (key concept)

The m-th principal component of X is the linear combination

zm =
p∑
j=1

φjmxj ,

such zm has largest sample variance among all linear combinations
that are orthogonal to z1, . . . ,zm−1 and satisfy ||φm||22 = 1.

10/43

Principal components (key concept)

The first m principal components of the design matrix X provide
the best m-dimensional linear approximation to it, in the sense of
capturing variation in the predictor data.

11/43

Principal components analysis

10 20 30 40 50 60 70

0
5

1
0

1
5

2
0

2
5

3
0

3
5

Population

A
d
S

p
e
n
d
in

g

The two axes represent predictors. The green line indicates the
first principal component and the blue dashed line shows the
second principal component.

12/43

Principal components analysis

20 30 40 50

5
1

0
1

5
2

0
2

5
3

0

Population

A
d
S

p
e
n
d
in

g

−20 −10 0 10 20


1

0

5
0

5
1

0
1st Principal Component

2
n
d
P

ri
n
c
ip

a
l
C

o
m

p
o
n
e
n
t

13/43

Principal components analysis

−3 −2 −1 0 1 2 3

2
0

3
0

4
0

5
0

6
0

1st Principal Component

P
o

p
u

la
ti
o

n

−3 −2 −1 0 1 2 3

5
1

0
1

5
2

0
2

5
3

0
1st Principal Component

A
d

S
p

e
n

d
in

g

14/43

Principal components analysis

−1.0 −0.5 0.0 0.5 1.0

2
0

3
0

4
0

5
0

6
0

2nd Principal Component

P
o

p
u

la
ti
o

n

−1.0 −0.5 0.0 0.5 1.0

5
1

0
1

5
2

0
2

5
3

0
2nd Principal Component

A
d

S
p

e
n

d
in

g

15/43

Principal Components Regression

The principal components regression (PCR) method consists of
running a regression of Y on the first m principal components of
X.The PCR method implicitly assumes that directions of highest
variance in X are the ones most associated with the response.

16/43

Principal components regression

Algorithm 1 Principal components regression
1: Center and standardise the predictors.

2: Use PCA to obtain z1, . . . ,zp, the p principal components of
the design matrix X.

3: for m = 1, . . . , p do
4: Regress the response y on z1, . . . ,zm (the first m principal

components) by OLS and call it Mm.
5: end for

6: Select the best model out of M1, . . . ,Mp by cross-validation.

17/43

Principal Component Regression

• PCR can lead to substantial variance reduction compared to
OLS when a small number of components account for a large
part of the variation in the predictor data.

• Additional principal components leads to smaller bias, but
larger variance.

• In PCR, the number of principal components is typically
chosen by cross-validation.

• Try to sketch learning cure for PCR (Train & Validation
(Test) MSE vs Number of Components).

• PCR does not perform variable selection.

18/43

Comparison with ridge regression

• There is a close connection between the ridge regression and
PCR methods.

• Ridge regression shrinks the coefficients of all principal
components, with least shrinkage for the first component
progressively smaller shrinkage factors for subsequent
components.

• PCR leaves the components with largest variance alone and
discards the ones with smallest variance.

• We can therefore think of ridge regression as a continuous
version of PCR. Ridge may be preferred in most cases as it
shrinks smoothly.

19/43

Comparison with ridge regression

20/43

Partial least squares (optional)

Partial Least Squares

The partial least squares method (PLS) tries to identify the best
linear combinations of predictors in a supervised way, in a sense
that it takes into account the information in y to construct the
new features. When constructing zm, PLS weights the predictors
by the strength of their univariate effect on y.

That contrasts with PCA, which identifies promising directions in
an unsupervised way.

21/43

Partial Least Squares

Algorithm 2 Partial Least Squares (Initialisation)
1: Center and standardise the predictors.

2: Run p simple linear regressions of y on each predictor xj and
denote the associated coefficients as φ1j .

3: Compute the first direction z1 =
∑p
j=1 φ1jxj .

4: Run a SLR regression of y on z1 and let the coefficient be θ̂1.
Call this model M1.

5: Orthogonalise each predictor with respect to z1: x
(1)
j = xj −

xj
[
(xTj z1)/(x

T
j xj)

]
. These are the residuals of a SLR of xj

on z1. (continues on the next slide)

22/43

Partial Least Squares

Algorithm Partial Least Squares (continued)
1: for m = 2, . . . , p do
2: Run p simple linear regressions of y on each x(m−1)j and de-

note the associated coefficients as φmj .
3: Compute the new direction zm =

∑p
j=1 φmjxj .

4: Run a SLR regression of y on zm and let the coefficient be
θ̂m. Call the linear regression model with response y, inputs
z1, . . . ,zm, and estimated coefficients θ̂1, . . . , θ̂m modelMm.

5: Orthogonalise each x(m−1)j with respect to zm: x
(m)
j =

x
(m−1)
j − xj

[
(xTj z1)/(x

T
j xj)

]
.

6: end for

7: Select the best model out of M1, . . . ,Mp by cross-validation.

23/43

PCR and PLS: discussion

• While PCR seeks directions with high variance, PLS seeks
directions with high variance and correlation with response.

• The variance aspect tends to dominate, such that PLS
behaves similarly to PCR and ridge regression.

• Using y reduces bias but potentially increases variance. PLS
shrinks low variance directions, but can actually inflate high
variance ones.

• In practice, PLS often does no better or slightly worse than
PCR and ridge regression.

24/43

Illustration and discussion

Illustration: predicting the equity premium

Since the cross-validation performance is nearly identical with 2
and 3 components, we select M = 2 for the results below.

25/43

Dimension reduction methods

Equity premium prediction results

Train R2 Test R2

OLS 0.108 0.014
PCR 0.039 0.048
PLS 0.085 0.036

For this example, PCR has the best test performance among all
linear methods that we have discussed.

26/43

Comparison of shrinkage and selection methods

27/43

Considerations in high dimensions

Considerations in high dimensions

A high-dimensional regime occurs when the number of predictors
is larger than the number of observations (p > N). Similar issues
occur when p ≈ N .

We cannot perform least squares in this setting, recall Lecture 2. If
p = N , the training R2 is always one. OLS is too flexible when
p > N and will overfit the data when p ≈ N .

28/43

Example

Text analytics. In type of analysis, the predictors are often a large
number of binary variables indicating the presence of words in a
document, search history, etc. This is called a bag of words
model. We thousands of possible words, the the number of
predictors is very large in this type of analysis.

We can further extend the feature space to include n-grams,
recording the appearance of words together in a sequence.

29/43

Considerations in high dimensions

• We can apply variable selection, shrinkage, and dimension
reduction methods with carefully tuned hyper-parameters to
high-dimensional settings.

• However, even these methods are subject to marked
deterioration in performance as the number of irrelevant or
very weak predictors increases relative to N .

• Therefore, we cannot blindly rely on standard methods in high
dimensional regimes. We need to carefully consider
appropriate dimension reduction and penalisation schemes,
preferably based on understanding of the substantive problem.

• The next slide shows an example.
30/43

Supervised Principal Components

Algorithm 3 Supervised Principal Components
1: Center and standardise the predictors.

2: Run p separate simple linear regressions of y on each individual
predictor a record the estimated coefficients.

3: for θ in 0 ≤ θ1 < . . . < θK do 4: Form a reduced design matrixXθ consisting only of predictors whose SLR coefficient is higher than θ in absolute value. 5: Use PCA to obtain z1, . . . ,zm, the first m principal compo- nents of Xθ. 6: Use these principal components to predict the response. 7: end for 8: Select θ and m by cross-validation. 31/43 Robust regression Robust regression All the linear regression methods that we have seen so far were based on the squared error loss function, which is equivalent to assuming a Gaussian likelihood for the data. However, estimation based on the squared error loss can result in poor fit when there are outliers. This is because the squared error penalises deviations quadratically, so that points with larger residuals have more effect on the estimation than points with low residuals (near the regression line). 32/43 Robust regression One way to achieve robustness to outliers is to replace the squared error losses with other losses that are less influenced by unusual observations. Alternatively (and equivalently in some cases), we replace the Gaussian likelihood with that of a distribution with heavy tails. Such a distribution will assign higher likelihood to outliers, without having to adjust the regression fit to account for them. 33/43 Least absolute deviation The least absolute deviation (LAD) estimator is β̂lad = argmin β N∑ i=1 ∣∣∣∣∣∣yi − β0 − p∑ j=1 βixij ∣∣∣∣∣∣ LAD estimation is equivalent to ML based on the Laplace distribution. In the special case when we formulate the minimisation problem m̂ = argmin m n∑ i=1 |Yi −m| the LAD estimator m̂ is the sample median of the response. 34/43 Huber loss A popular method for robust regression is the Huber loss: β̂huber = argmin β   N∑ i=1 Lδ  yi − β0 − p∑ j=1 βixij     Lδ(e) =  e 2 if |e| ≤ δ 2δ|e| − δ2 if |e| ≥ δ The Huber loss combines the good properties of squared and absolute errors. 35/43 Loss functions 36/43 Review questions • What are dimension reduction methods for regression? • What is principal components analysis (PCR)? • What is the relationship between PCR and ridge regression? • What is the high-dimensional regime? • Explain the purpose of robust regression. 37/43 Technical appendix (Optional) A column vector v is an eigenvector of a square matrix A if it satisfies the equation Av = λv, where λ is a scalar known as the eigenvalue associated with v. The eigenvectors of A do not change direction when multiplied by A. A scalar λ is an eigenvalue of A if (A− λI) is singular: det(A− λI) = 0. 38/43 Principal components analysis The eigendecomposition of a diagonalisable p× p symmetric real square matrix A has the form A = V ΛV T , where Λ is a p× p diagonal matrix whose diagonal elements λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0 are the eigenvalues of A and V is a p× p orthogonal matrix whose columns vj are the eigenvectors of A. If one or more eigenvalues λj are zero then A is singular (non-invertible). 39/43 Principal components analysis A orthogonal matrix V is a square matrix whose columns and rows are orthonormal, i.e., V V T = V TV = I, such that V −1 = V T . 40/43 Principal components analysis Now, let X be the N × p matrix of centered predictors. The sample variance-covariance matrix of X is S = (XTX)/N, where XTX has an eigendecomposition denoted as V ΛV T . The eigenvalues of XTX are all positive provided that there is no perfect multicollinearity. Eigenvalues near zero indicate the presence of multicollinearity. 41/43 Principal components analysis The first principal component of X is z1 = Xv1. The sample variance of the first principal component is s2z1 = vT1 X TXv1 N = v1V ΛV Tv1 N = λ1 N , where vT1 v1 = 1 and vT1 vj = 0 (j 6= 1) since V is an orthogonal matrix. The first principal component is therefore the linear combination of the columns of X that has the largest variance among all possible normalised linear combinations. 42/43 Principal components analysis The principal components of X are zm = Xvm, for m = 1, . . . , p, with decreasing sample variance s2zm = λm N , since λ1 ≥ λ2 ≥ . . . ≥ λp > 0. Since the eigenvectors are
orthogonal, the principal components have sample correlation zero.

The principal component zm is the direction of largest variance
that is orthogonal to z1, . . . ,zm−1.

43/43

Introduction
Principal components regression
Partial least squares (optional)
Illustration and discussion
Considerations in high dimensions
Robust regression