程序代写代做代考 data mining Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Linear Regression
MAST90083 Computational Statistics and Data Mining
Dr Karim Seghouane
School of Mathematics & Statistics The University of Melbourne
Linear Regression 1/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Outline
§i. Introduction
§ii. Linear regression
§iii. Other Considerations
§iii. Selection and Regularization §iv. Dimension Reduction Methods §v. Multiple Outcome Shrinkage
Linear Regression 2/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Statistical Models
􏰔 What is the simplest mathematical model that describes the relationship between two variables ?
􏰔 Straight line
􏰔 Statistical models are fitted for a variety of reasons:
􏰔 Explanation and prediction: Uncover causes by studying the relationship between an interested variable (the response) and a set of variables called the explanatory variables& use the model for prediction
􏰔 Examine and test scientific hypotheses
Linear Regression 3/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Linear Models
􏰔 Linear models have a long history in statistics, but even in today’s computer era they are still important and widely used in supervised learning.
􏰔 They are simple and provide a picture of how the inputs affect the output
􏰔 For prediction purposes they can sometimes outperform fancier nonlinear models, particularly in small sample cases, low signal-to-noise ratio or sparse data
􏰔 We will study some of the key questions associated with the linear regression model
Linear Regression 4/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Linear regression
Given a vector of input variables x = (x1,…,xp)⊤ ∈ Rp and a response variable y ∈ R
y ≈ f (x)
where
f(x) = β0+􏰏βixi =β0+β1×1+…+βpxp
p i=1
􏰔 The linear model assumes that the dependence of y on
x1, x2, …, xp is linear. Or, it can be well approximated by the linear relationships.
􏰔 Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.
􏰔 The βj′s are the unknown parameters that need to be determined.
Linear Regression 5/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Parameter Estimation
􏰔 􏰔
We have at our disposal a set of training data
(x1 , y1 ), …, (xN , yN ) from which to estimate the parameters β
The most popular method is least squares where β is obtained by minimizing the residual sum of squares
N N􏰉p􏰊2 RSS(β)=􏰏(yi −f (xi))2 =􏰏 yi −β0 −􏰏βixi
􏰔
i=1 i=1
What is the statistical interpretation ?
i=1
Linear Regression 6/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Statistical Interpretation
􏰔 From a statistical point of view, this represents the maximum likelihood estimation of β assuming
yi =x⊤i β+εi i=1,…,N
􏰔 and ε1, …, εN are independent random samples from N (0, σ2),
σ > 0 an unknown parameter so that ε ∼ N(0,σ2IN)
􏰔 TakingX astheN×(p+1)matrixwitheachrowaninput vector and 1 in the first position and y an N × 1 vector of responses: E (y) = Xβ and Cov(y) = σ2IN, so that
Note 1
y ∼ N 􏰕Xβ,σ2IN􏰖
Linear Regression 7/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Matrix Form
􏰔 The residual sum of square takes the matrix form RSS(β) = (y − X β)⊤ (y − X β)
􏰔 Assuming that X has full column rank or X⊤X is positive definite gives a unique solution
ˆ 􏰁 ⊤ 􏰂−1 ⊤ β=XXXy
􏰔 and the fitted values at the training inputs are ( Note 2 )
ˆ 􏰁⊤􏰂−1⊤ ˆy=Xβ=X X X X y= Regression 8/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Geometric Interpretation
􏰔 The hat matrix H is square and satisfies: H2 = H and H⊤ = H ( Note 3 )
􏰔 H is the orthogonal projector onto V = Sp(X) (column space of X or the subspace of RN spanned by the column vectors of X)
􏰔 and ˆy is the orthogonal projection of y onto Sp(X)
􏰔 The residual vector y − ˆy is orthogonal to this subspace
Linear Regression 9/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Statistical Properties
􏰔 Assuming the model y ∼ N 􏰕X β, σ2IN 􏰖 gives ˆ 􏰃2􏰁⊤􏰂−1􏰄
or
􏰔 where σ2 is estimated by
β∼N β,σ X X ˆy ∼ N 􏰕 X β , σ 2 H 􏰖
1N1
σˆ 2 = 􏰏 ( y i − yˆ i ) 2 = ( y − ˆy ) ⊤ ( y − ˆy )
N−p−1 N−p−1 i=1
􏰔 and (N − p − 1)σˆ2 ∼ σ2χ2N−p−1
􏰔 Furthermore σˆ2 and βˆ are statistically independent. ( Why ?)
Linear Regression 10/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Assessing the Accuracy of the coefficient estimates
􏰔 Approximate confidence set for the parameter vector β 􏰁ˆ 􏰂⊤ ⊤ 􏰁ˆ 􏰂 2 (1−α)
Cβ={β| β−β X X β−β ≤σˆχp+1 }
􏰔 TestthenullhypothesisofH0 :βj =0vs. H1 :βj ̸=0using
βˆj
z j = σˆ √ v j
􏰔 and vj is the jth diagonal element of 􏰕X⊤X􏰖−1. Under H0, zj is tN−p−1.
􏰔 Testing for a group of variables, H0 : smaller model is correct
F = (RSS0 − RSS1)/(p1 − p0) ∼ Fp1−p0,N−p1−1 RSS1/(N − p1 − 1)
Note 4
Linear Regression 11/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Gauss-Markov Theorem
􏰔 LSE of β have the smallest variance among all linear unbiased estimates
􏰔 Assuming the estimation of θ = a⊤β, then its LSE is ˆ⊤ˆ⊤􏰁⊤􏰂−1⊤ ⊤
θ=aβ=a XX Xy=c0y 􏰔 Then E 􏰁a⊤βˆ􏰂 = a⊤β is unbiased and
􏰁􏰂 􏰁⊤􏰂 Var θˆ ≤Var c y
􏰔 for any other linear estimator θ ̃ = c⊤y
Linear Regression 12/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Reducing the MSE
􏰔 The LSE has the smallest MSE of all linear estimators with no bias.
􏰔 Biased estimator can generate smaller MSE
􏰔 Shrinking a set of coefficients to zero may result in a biased
estimate
􏰔 MSE is related to the prediction accuracy of a new response
y0 = f (x0) + ε0 at input x0
􏰑 ̃ 􏰒2 2 􏰑⊤ ̃ 􏰒2 2 􏰑 ̃ 􏰒
E y0 −f(x0) =σ +E x0 β−f(x0) =σ +MSE f(x0) Note 5
Linear Regression 13/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Assessing the Overall Accuracy of the Model
􏰔 Quantify how well the model fits the observations
􏰔 Two quantities are used: the residual standard error (RSE)
which measures the lack of fit
􏰥􏰤 􏰁 􏰂 􏰢
􏰤RSS βˆ 􏰎N 2
􏰔 R2 Statistics
RSE =􏰣 = i=1(yi −yˆi) N−p−1 N−p−1
R2 = TSS−RSS TSS
􏰔 measure the amount of variability (TSS = 􏰎 (yi − y ̄)2) removed by the model
Linear Regression 14/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Other Considerations in the Regression Model
􏰔 Correlation of the error terms
􏰔 Interactions or collinearity
􏰔 Categorical predictors and their interpretation (two or more categories).
􏰔 Non-linear effects of predictors
􏰔 Outliers and high-leverage points
􏰔 Multiple outputs
Linear Regression 15/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Correlation of the error terms
􏰔 Use generalized LS
RSS(β) = (y − X β)⊤ Σ−1 (y − X β)
􏰔 Similar to assuming
y ∼ N (Xβ,Σ)
􏰔 Still least square but using a different metric matrix Σ instead
of I Note 6
Linear Regression 16/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Interactions or collinearity
􏰔 Variables closely related to one another which leads to linear dependence or collinearity among the columns of X.
􏰔 It is difficult to separate the individual effects of collinear variables on the response.
Var􏰁βˆ􏰂 = σ2 􏰁X⊤X􏰂−1
􏰔 Collinearity has considerable effect on the precision of βˆ → large variances, wide confidence interval and low power of the tests
􏰔 It is important to identify and address potential collinearity problems
Linear Regression 17/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Detection of collinearity
􏰔 Look at the correlation matrix of the variables to detect pair of highly correlated variables
􏰔 Collinearity between three or more variables compute variance inflation factor VIF for each variable
VIF 􏰁βˆ 􏰂 = 1
j 1−Rj2
􏰔 Geometrically 1 − Rj2 measures how close xj is to the subspace spanned by X−j
􏰁ˆ􏰂σ2 −1 2 Varβj =1−Rj2≤λmaxλmin=κ(X)
􏰔 Examine the eigenvalues and eigenvectors Note 7
Linear Regression 18/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Categorical predictors
􏰔 Also referred as categorical or discrete predictors or variables.
􏰔 Prediction task is called regression for quantitative output and
classification for qualitative outputs
􏰔 Qualitative variables are represented by numerical codes
􏰍1 if the ith experiment is a success xi = 0 if the ith experiment is a failure
􏰔 This results in the model
􏰍β0 +β1 +εi if the ith exp. is a success
yi =β0+β1xi+εi = β0+εi iftheithexp. isafailure Note 8
Linear Regression 19/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Non-linear effects of predictors
􏰔 The linear model assumes a linear relationship between the response and predictors
􏰔 The true relationship between the response and the predictors may be non-linear
􏰔 Polynomial regression is simple way to extend linear models to accommodate non-linear relationships
􏰔 In this case non-linearity is obtained by considering transformed versions of the predictors
􏰔 The parameters can be estimated using standard linear regression methods.
Note 9
Linear Regression 20/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Outliers
􏰔 Different reasons can lead to outliers. Example: incorrect recording of an observation
􏰔 The residual as an estimate of the error can be used to identify outliers by examination for extreme values
􏰔 Better use the studentized residuals E[e]=E[(Ip −H)y]=(Ip −H)δ
􏰔 If the diagonal of H is not close to 1 (small) then e reflects the presence of outliers.
Note 10
Linear Regression 21/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
High-leverage points
􏰔 Observations with high leverage have an unusual value for xi
􏰔 Difficult to identify when there are multiple predictors
􏰔 To quantify an observation’s leverage use the leverage statistic
hi = N1 + (N − 1)−1 (xi − ̄x) S−1 (xi − ̄x)
􏰔 S is the ample covariance matrix, xi the ith row of X and ̄x
the average row
􏰔 The leverage statistic n1 ≤ hi ≤ 1 and the average is (p + 1)/n.
􏰔 If an observation has hi greatly exceeds (p + 1)/n, then we may suspect that the corresponding point has high leverage.
Note 11
Linear Regression 22/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Multiple outputs
􏰔 Multiple outputs y1, …, yK need to be predicted from x1, …, xp where a linear model is assumed for each output
p
yk =β0k +􏰏xjβjk +εk =fk (X)+εk
j=1
􏰔 InmatrixnotationY=XB+EwhereYisN×K,Xis N×(p+1),Bis(p+1)×K (matrixofparameters)andEis N × K matrix of errors.
KN
RSS(B)=􏰏􏰏(yik −fk (xi))2 =tr􏰑(Y−XB)⊤(Y−XB)􏰒
k=1 i=1
Linear Regression 23/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Multiple outputs
􏰔 The least squares estimates have exactly the same form ˆ 􏰁 ⊤ 􏰂−1 ⊤
B=XXXY
􏰔 In case of correlated errors ε ∼ N (0, Σ) the multivariate
criterion becomes
Note 11
N
RSS(B)=􏰏(yi −f (xi))⊤Σ−1(yi −f (xi))
i=1
Linear Regression 24/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Why ?
􏰔 The least squares estimates is in most cases not satisfied when a large number of potential explanatory variables are available
􏰔 Improving prediction accuracy: LSE often has low bias but large variance, sacrifice a little bit of bias to reduce the variance of the predicted values and improve overall prediction accuracy
􏰔 Interpretation: Do all the predictors help to explain y ? determine a smaller subset with strongest effects and sacrifices the small details
Linear Regression 25/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Linear Model Selection and Regularization
Linear Regression 26/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Deciding on the Important Variables
Subset selection
􏰔 All subsets or best subsets regression (examine all potential combination 😮 )
􏰔 Forward selection – begin with intercept and iteratively add one variable.
􏰔 Backward selection – begin with the full model and iteratively remove one variable.
􏰔 What is best for cases where p > n?
Linear Regression 27/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O

􏰔 Retain a subset of predictor and eliminate the rest
􏰔 LSE is used to obtain the coefficients of the retained variables
􏰔 For eack k ∈ {0,1,2,…,p} find the subset k that gives the smallest residual
􏰔 The choice of k is obtained using a criterion and involves a tradeoff between bias & variance
􏰔 Different criteria ← minimizes an estimate of the expected prediction error
Infeasible for large p Note 12
Linear Regression 28/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Forward Selection
􏰔 Sequential addition of predictors → forward stepwise selection
􏰔 Starts with the intercept and sequentially add the predictor
that most improve the fit
􏰔 Add predictor producing the largest value of
RSS􏰁βˆ 􏰂−RSS􏰁βˆ 􏰂 i i+1
F= 􏰁ˆ􏰂
RSS βi+1 /(N−k−2)
􏰔 Use 90th or 95th percentile of F1,N−k−2 as Fe Note 14
Linear Regression 29/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Backward Elimination
􏰔 Start with the full model and sequentially remove predictors 􏰔 Use Fd to choose the predictor to delete (smallest value of F) 􏰔 Stop when each predictor in the model produces F > Fd
􏰔 CanbeusedonlywhenN>p
Fd ≃ Fe Note 15
Linear Regression 30/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Alternative: Shrinkage Methods
􏰔 Subset selection produces produce an interpretable model with possible lower prediction error than the full model
􏰔 The selection is discrete → often exhibits high variance
􏰔 Shrinkage methods are continuous and don’t suffer as much
from high variability
􏰔 We fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
􏰔 Shrinking the coefficient estimates can significantly reduce their variance (not immediately obvious).
Linear Regression 31/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Linear Regression 32/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Linear Regression 33/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Linear Regression 34/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
􏰔 Ridge regression shrinks the regression coefficients by constraining their size
􏰔 This is the approach used in neural networks where it is known as weight decay
􏰔 The larger the value of λ, the greater the amount of shrinkage
􏰔 β0 is left out of the penalty term Note 16
Linear Regression 35/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
􏰔 Because we have now the addition of the penalty term, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant!
􏰔 It is best to apply ridge regression after standardizing the predictors, using the formula
xij
x ̃ij=􏰡1􏰎n (xij−x ̄ij)2 n i=1
Linear Regression 36/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Note 17
RSS(λ) = (y−Xβ)⊤ (y−Xβ)+λβ⊤β ˆridge 􏰁 ⊤ 􏰂−1 ⊤
β =XX+λI Xy 􏰔 The ridge solution is a linear function of y
􏰔 Avoid singularity when X⊤X is not full by adding a positive constant to the diagonal of X⊤X.
􏰔 For orthogonal predictor βˆridge = γβˆ where 0 ≤ γ ≤ 1
􏰔 The effective degrees of freedom of the ridge regression fit is
df(λ)=tr􏰁X[X⊤X +λI]−1X⊤􏰂
Linear Regression 37/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge regression – credit data example
balance ∼ age,cards,education,income,limit,rating,gender,student, status , ethnicity
Linear Regression 38/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge regression vs. LS
Linear Regression 39/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Lasso
􏰔 Ridge regression disadvantage: includes all p predictors (some of them with minor influence)
􏰔 Lasso , in contrast, select subset.
􏰔 The lasso coefficients, βˆλL, minimize the quantity
Linear Regression 40/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
The Variable Selection Property of the Regression 41/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Some Remarks on Lasso
􏰔 Making s sufficiently small will cause some of the coefficients to be exactly zero → continuous subset selection
􏰔 If s = 􏰎p ∥βˆls ∥, then the lasso estimates are the βˆls ’s. i=1 j j
􏰔 s should be adaptively chosen to minimize an estimate of expected prediction error.
Linear Regression 42/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
The Variable Selection Property of the Regression 43/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Profile of
􏰔 Profiles of lasso coefficients as the tuning parameter t is
varied. The coefficients are plotted versus t = s/􏰎p ∥βˆls∥ i=1 j
Linear Regression 44/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Dimension Reduction Methods
􏰔 When there is a large number of predictors, often correlated, we can select what variables (dimensions) to use.
􏰔 But, why not transform the predictors (to a lower dimension) and then fit the least squares model using the transformed variables.
􏰔 We will refer to these techniques as dimension reduction methods.
􏰔 Use a small linear combinations zm , m = 1, …, M of xj
􏰔 The methods differ in how the linear combinations are obtained
Linear Regression 45/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components Regression
􏰔 The linear combinations zm are the principal components zm =Xvm
max Var(Xα) ∥α∥=1
v⊤l Sα=0, l=1,…,m−1
􏰔 v⊤l Sα = 0 ensure zm = X vm is uncorrelated with all previous
linearcombinationszl =Xvl,l=1,…,m−1 􏰔 y is regressed on z1,…,zM for M ≤ p
Linear Regression 46/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components Regression
􏰔 Since zm are othrogonal, this regression is just a sum of univariate regressions
M
ˆypcr = ̄y+􏰏θˆmzm
m=1
θˆ m = < z m , y > / < z m , z m >
M
βˆpcr =􏰏θˆmvm
m=1
􏰔 ifM=p,ˆypcr =ˆyLS sincethecolumnsofZ=UDspanthe column space of X
􏰔 PCR discards the p − M smallest eigenvalue components.
Linear Regression 47/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components Regression
Linear Regression 48/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Dimension Reduction Methods
Linear Regression 49/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components
􏰔 Effectively, we change the orientation of the axes.
􏰔 The 1st principal component is that (normalized) linear
combination of the variables with the largest variance.
􏰔 The 2nd principal component has the largest (remainder) variance, subject to being uncorrelated with the first.
􏰔 And so on…
􏰔 Many times we can explain most of the variation with only
few principal components.
􏰔 More details in Chapter 10.2 of the book ‘An introduction to statistical learning’
Linear Regression 50/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components
Linear Regression 51/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components
Linear Regression 52/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components Regression
􏰔 Now we can run a regression analysis on only several principal components.
􏰔 We call it Principal Components Regression ( PCR )
􏰔 Note, these directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions.
􏰔 Consequently, PCR suffers from a potentially serious drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
Linear Regression 53/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Partial Least Squares (PLS)
􏰔 It is a dimension reduction method, which first identifies a new set of features Z1, …, ZM that are linear combinations of the original features.
􏰔 Then fits a linear model via OLS using these M new features.
􏰔 Up to this point very much as PCR.
􏰔 PLS identifies these new features using the response Y (supervised way).
􏰔 PLS approach attempts to find directions that help explain both the response and the predictors.
Linear Regression 54/61

Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Partial Least Squares (PLS)
􏰔 Uses y to construct linear combinations of the inputs.
􏰔 The inputs are weighted by the strength of their univariate
effect on y
􏰔 Regress y on z → θˆ and orthogonalize with respect to z
mmm
􏰔 Continue the process until M < p directions are obtained 􏰔 PLS seeks directions that have high variance and have high correlation with the response max Corr2 (y, X α) Var (X α) ∥α∥=1 v⊤l Sα=0, l=1,...,m−1 Linear Regression 55/61 Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O Partial Least Squares (PLS) 􏰔 The first component say t = X α1 maximizes α1 = arg max Cov2 (y,Xα) ∥α∥=1 􏰔 Subsequent components t2, t3,... are chosen such that they maximizes the squared covariance to y and all components are mutually orthogonal 􏰔 Orthogonality is enforced by deflating the original variables X Xi =X−Pt1,...,ti−1X 􏰔 Pt1 ,...,ti −1 denotes the orthogonal projection onto the space spanned by t1,...,ti−1 􏰔 ˆy = Pt1,...,tmy instead of ˆy = Xβˆ = X 􏰕X⊤X􏰖−1 X⊤y Linear Regression 56/61 Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O Illustrating the connection The connection between these methods can be seen through the optimisation criterion they use to define projection directions 􏰔 PCR extracts components that explain the variance of the predictor space max Var(Xα) ∥α∥=1 v⊤l Sα=0, l=1,...,m−1 􏰔 PLS extracts components that have a high covariance with max Corr2 (y, X α) Var (X α) ∥α∥=1 v⊤l Sα=0, l=1,...,m−1 􏰔 Both method are similar in there aim to extract m components from the predictor space X Linear Regression 57/61 Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O Illustrating the connection Both methods aims 􏰔 at expressing the solution in lower dimensional subspace β = V z where V is an p × m matrix of orthonormal columns 􏰔 Using this basis for the subspace, an alternative approximate minimization problem is considered min∥y − X β∥≈ min∥y − XV z∥ βz 􏰔 In PCR V is directly obtained from X 􏰔 in PLS V depends on y in a complicated nonlinear way Linear Regression 58/61 Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O Illustrating the connection Considering 􏰔 The singular value decomposition X = UDV ⊤ where U is N×p,D=diag(d1,...,dp)isp×pandV isp×p 􏰔 The columns of U and V are orthogonal such that U⊤U = Ip and V⊤V = Ip 􏰔 The least squares solution takes the form ˆ 􏰏p u ⊤i y 􏰏p β= dvi= βi i=1i i 􏰔 The other estimator are shrinkage estimators and can be expressed as p βˆ = 􏰏 f ( d i ) β i i Linear Regression 59/61 Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O Multiple Outcome Shrinkage 􏰔 When the output are not correlated → apply a univariate technique individually to each outcome or work with each column output individually 􏰔 Other approaches exploit correlations in the different responses → canonical correlation analysis 􏰔 CCA find a sequence of linear combinations Xvm and Yum such that the correlations are maximized Corr2 (Yum,Xvm) 􏰔 Reduced rank regression Bˆrr(m)=argmin Note 19 rank(B)=m N 􏰏(y −Bx)⊤Σ−1(y −Bx) i i i i i=1 Linear Regression 60/61 Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O For more readings 􏰔 Summaries on LMS. 􏰔 Chapter 3, 5 & 14.5 from ’The elements of statistical learning’ book. 􏰔 Chapters 3, 6 & 10.2 from ’An introduction to statistical learning’ book. Linear Regression 61/61