Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Linear Regression
MAST90083 Computational Statistics and Data Mining
Dr Karim Seghouane
School of Mathematics & Statistics The University of Melbourne
Linear Regression 1/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Outline
§i. Introduction
§ii. Linear regression
§iii. Other Considerations
§iii. Selection and Regularization §iv. Dimension Reduction Methods §v. Multiple Outcome Shrinkage
Linear Regression 2/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Statistical Models
What is the simplest mathematical model that describes the relationship between two variables ?
Straight line
Statistical models are fitted for a variety of reasons:
Explanation and prediction: Uncover causes by studying the relationship between an interested variable (the response) and a set of variables called the explanatory variables& use the model for prediction
Examine and test scientific hypotheses
Linear Regression 3/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Linear Models
Linear models have a long history in statistics, but even in today’s computer era they are still important and widely used in supervised learning.
They are simple and provide a picture of how the inputs affect the output
For prediction purposes they can sometimes outperform fancier nonlinear models, particularly in small sample cases, low signal-to-noise ratio or sparse data
We will study some of the key questions associated with the linear regression model
Linear Regression 4/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Linear regression
Given a vector of input variables x = (x1,…,xp)⊤ ∈ Rp and a response variable y ∈ R
y ≈ f (x)
where
f(x) = β0+βixi =β0+β1×1+…+βpxp
p i=1
The linear model assumes that the dependence of y on
x1, x2, …, xp is linear. Or, it can be well approximated by the linear relationships.
Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.
The βj′s are the unknown parameters that need to be determined.
Linear Regression 5/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Parameter Estimation
We have at our disposal a set of training data
(x1 , y1 ), …, (xN , yN ) from which to estimate the parameters β
The most popular method is least squares where β is obtained by minimizing the residual sum of squares
N Np2 RSS(β)=(yi −f (xi))2 = yi −β0 −βixi
i=1 i=1
What is the statistical interpretation ?
i=1
Linear Regression 6/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Statistical Interpretation
From a statistical point of view, this represents the maximum likelihood estimation of β assuming
yi =x⊤i β+εi i=1,…,N
and ε1, …, εN are independent random samples from N (0, σ2),
σ > 0 an unknown parameter so that ε ∼ N(0,σ2IN)
TakingX astheN×(p+1)matrixwitheachrowaninput vector and 1 in the first position and y an N × 1 vector of responses: E (y) = Xβ and Cov(y) = σ2IN, so that
Note 1
y ∼ N Xβ,σ2IN
Linear Regression 7/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Matrix Form
The residual sum of square takes the matrix form RSS(β) = (y − X β)⊤ (y − X β)
Assuming that X has full column rank or X⊤X is positive definite gives a unique solution
ˆ ⊤ −1 ⊤ β=XXXy
and the fitted values at the training inputs are ( Note 2 )
ˆ ⊤−1⊤ ˆy=Xβ=X X X X y= Regression 8/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Geometric Interpretation
The hat matrix H is square and satisfies: H2 = H and H⊤ = H ( Note 3 )
H is the orthogonal projector onto V = Sp(X) (column space of X or the subspace of RN spanned by the column vectors of X)
and ˆy is the orthogonal projection of y onto Sp(X)
The residual vector y − ˆy is orthogonal to this subspace
Linear Regression 9/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Statistical Properties
Assuming the model y ∼ N X β, σ2IN gives ˆ 2⊤−1
or
where σ2 is estimated by
β∼N β,σ X X ˆy ∼ N X β , σ 2 H
1N1
σˆ 2 = ( y i − yˆ i ) 2 = ( y − ˆy ) ⊤ ( y − ˆy )
N−p−1 N−p−1 i=1
and (N − p − 1)σˆ2 ∼ σ2χ2N−p−1
Furthermore σˆ2 and βˆ are statistically independent. ( Why ?)
Linear Regression 10/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Assessing the Accuracy of the coefficient estimates
Approximate confidence set for the parameter vector β ˆ ⊤ ⊤ ˆ 2 (1−α)
Cβ={β| β−β X X β−β ≤σˆχp+1 }
TestthenullhypothesisofH0 :βj =0vs. H1 :βj ̸=0using
βˆj
z j = σˆ √ v j
and vj is the jth diagonal element of X⊤X−1. Under H0, zj is tN−p−1.
Testing for a group of variables, H0 : smaller model is correct
F = (RSS0 − RSS1)/(p1 − p0) ∼ Fp1−p0,N−p1−1 RSS1/(N − p1 − 1)
Note 4
Linear Regression 11/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Gauss-Markov Theorem
LSE of β have the smallest variance among all linear unbiased estimates
Assuming the estimation of θ = a⊤β, then its LSE is ˆ⊤ˆ⊤⊤−1⊤ ⊤
θ=aβ=a XX Xy=c0y Then E a⊤βˆ = a⊤β is unbiased and
⊤ Var θˆ ≤Var c y
for any other linear estimator θ ̃ = c⊤y
Linear Regression 12/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Reducing the MSE
The LSE has the smallest MSE of all linear estimators with no bias.
Biased estimator can generate smaller MSE
Shrinking a set of coefficients to zero may result in a biased
estimate
MSE is related to the prediction accuracy of a new response
y0 = f (x0) + ε0 at input x0
̃ 2 2 ⊤ ̃ 2 2 ̃
E y0 −f(x0) =σ +E x0 β−f(x0) =σ +MSE f(x0) Note 5
Linear Regression 13/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Assessing the Overall Accuracy of the Model
Quantify how well the model fits the observations
Two quantities are used: the residual standard error (RSE)
which measures the lack of fit
RSS βˆ N 2
R2 Statistics
RSE = = i=1(yi −yˆi) N−p−1 N−p−1
R2 = TSS−RSS TSS
measure the amount of variability (TSS = (yi − y ̄)2) removed by the model
Linear Regression 14/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Other Considerations in the Regression Model
Correlation of the error terms
Interactions or collinearity
Categorical predictors and their interpretation (two or more categories).
Non-linear effects of predictors
Outliers and high-leverage points
Multiple outputs
Linear Regression 15/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Correlation of the error terms
Use generalized LS
RSS(β) = (y − X β)⊤ Σ−1 (y − X β)
Similar to assuming
y ∼ N (Xβ,Σ)
Still least square but using a different metric matrix Σ instead
of I Note 6
Linear Regression 16/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Interactions or collinearity
Variables closely related to one another which leads to linear dependence or collinearity among the columns of X.
It is difficult to separate the individual effects of collinear variables on the response.
Varβˆ = σ2 X⊤X−1
Collinearity has considerable effect on the precision of βˆ → large variances, wide confidence interval and low power of the tests
It is important to identify and address potential collinearity problems
Linear Regression 17/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Detection of collinearity
Look at the correlation matrix of the variables to detect pair of highly correlated variables
Collinearity between three or more variables compute variance inflation factor VIF for each variable
VIF βˆ = 1
j 1−Rj2
Geometrically 1 − Rj2 measures how close xj is to the subspace spanned by X−j
ˆσ2 −1 2 Varβj =1−Rj2≤λmaxλmin=κ(X)
Examine the eigenvalues and eigenvectors Note 7
Linear Regression 18/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Categorical predictors
Also referred as categorical or discrete predictors or variables.
Prediction task is called regression for quantitative output and
classification for qualitative outputs
Qualitative variables are represented by numerical codes
1 if the ith experiment is a success xi = 0 if the ith experiment is a failure
This results in the model
β0 +β1 +εi if the ith exp. is a success
yi =β0+β1xi+εi = β0+εi iftheithexp. isafailure Note 8
Linear Regression 19/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Non-linear effects of predictors
The linear model assumes a linear relationship between the response and predictors
The true relationship between the response and the predictors may be non-linear
Polynomial regression is simple way to extend linear models to accommodate non-linear relationships
In this case non-linearity is obtained by considering transformed versions of the predictors
The parameters can be estimated using standard linear regression methods.
Note 9
Linear Regression 20/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Outliers
Different reasons can lead to outliers. Example: incorrect recording of an observation
The residual as an estimate of the error can be used to identify outliers by examination for extreme values
Better use the studentized residuals E[e]=E[(Ip −H)y]=(Ip −H)δ
If the diagonal of H is not close to 1 (small) then e reflects the presence of outliers.
Note 10
Linear Regression 21/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
High-leverage points
Observations with high leverage have an unusual value for xi
Difficult to identify when there are multiple predictors
To quantify an observation’s leverage use the leverage statistic
hi = N1 + (N − 1)−1 (xi − ̄x) S−1 (xi − ̄x)
S is the ample covariance matrix, xi the ith row of X and ̄x
the average row
The leverage statistic n1 ≤ hi ≤ 1 and the average is (p + 1)/n.
If an observation has hi greatly exceeds (p + 1)/n, then we may suspect that the corresponding point has high leverage.
Note 11
Linear Regression 22/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Multiple outputs
Multiple outputs y1, …, yK need to be predicted from x1, …, xp where a linear model is assumed for each output
p
yk =β0k +xjβjk +εk =fk (X)+εk
j=1
InmatrixnotationY=XB+EwhereYisN×K,Xis N×(p+1),Bis(p+1)×K (matrixofparameters)andEis N × K matrix of errors.
KN
RSS(B)=(yik −fk (xi))2 =tr(Y−XB)⊤(Y−XB)
k=1 i=1
Linear Regression 23/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Multiple outputs
The least squares estimates have exactly the same form ˆ ⊤ −1 ⊤
B=XXXY
In case of correlated errors ε ∼ N (0, Σ) the multivariate
criterion becomes
Note 11
N
RSS(B)=(yi −f (xi))⊤Σ−1(yi −f (xi))
i=1
Linear Regression 24/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Why ?
The least squares estimates is in most cases not satisfied when a large number of potential explanatory variables are available
Improving prediction accuracy: LSE often has low bias but large variance, sacrifice a little bit of bias to reduce the variance of the predicted values and improve overall prediction accuracy
Interpretation: Do all the predictors help to explain y ? determine a smaller subset with strongest effects and sacrifices the small details
Linear Regression 25/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Linear Model Selection and Regularization
Linear Regression 26/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Deciding on the Important Variables
Subset selection
All subsets or best subsets regression (examine all potential combination 😮 )
Forward selection – begin with intercept and iteratively add one variable.
Backward selection – begin with the full model and iteratively remove one variable.
What is best for cases where p > n?
Linear Regression 27/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Retain a subset of predictor and eliminate the rest
LSE is used to obtain the coefficients of the retained variables
For eack k ∈ {0,1,2,…,p} find the subset k that gives the smallest residual
The choice of k is obtained using a criterion and involves a tradeoff between bias & variance
Different criteria ← minimizes an estimate of the expected prediction error
Infeasible for large p Note 12
Linear Regression 28/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Forward Selection
Sequential addition of predictors → forward stepwise selection
Starts with the intercept and sequentially add the predictor
that most improve the fit
Add predictor producing the largest value of
RSSβˆ −RSSβˆ i i+1
F= ˆ
RSS βi+1 /(N−k−2)
Use 90th or 95th percentile of F1,N−k−2 as Fe Note 14
Linear Regression 29/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Backward Elimination
Start with the full model and sequentially remove predictors Use Fd to choose the predictor to delete (smallest value of F) Stop when each predictor in the model produces F > Fd
CanbeusedonlywhenN>p
Fd ≃ Fe Note 15
Linear Regression 30/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Alternative: Shrinkage Methods
Subset selection produces produce an interpretable model with possible lower prediction error than the full model
The selection is discrete → often exhibits high variance
Shrinkage methods are continuous and don’t suffer as much
from high variability
We fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
Shrinking the coefficient estimates can significantly reduce their variance (not immediately obvious).
Linear Regression 31/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Linear Regression 32/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Linear Regression 33/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Linear Regression 34/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Ridge regression shrinks the regression coefficients by constraining their size
This is the approach used in neural networks where it is known as weight decay
The larger the value of λ, the greater the amount of shrinkage
β0 is left out of the penalty term Note 16
Linear Regression 35/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Because we have now the addition of the penalty term, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant!
It is best to apply ridge regression after standardizing the predictors, using the formula
xij
x ̃ij=1n (xij−x ̄ij)2 n i=1
Linear Regression 36/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge Regression
Note 17
RSS(λ) = (y−Xβ)⊤ (y−Xβ)+λβ⊤β ˆridge ⊤ −1 ⊤
β =XX+λI Xy The ridge solution is a linear function of y
Avoid singularity when X⊤X is not full by adding a positive constant to the diagonal of X⊤X.
For orthogonal predictor βˆridge = γβˆ where 0 ≤ γ ≤ 1
The effective degrees of freedom of the ridge regression fit is
df(λ)=trX[X⊤X +λI]−1X⊤
Linear Regression 37/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge regression – credit data example
balance ∼ age,cards,education,income,limit,rating,gender,student, status , ethnicity
Linear Regression 38/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Ridge regression vs. LS
Linear Regression 39/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Lasso
Ridge regression disadvantage: includes all p predictors (some of them with minor influence)
Lasso , in contrast, select subset.
The lasso coefficients, βˆλL, minimize the quantity
Linear Regression 40/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
The Variable Selection Property of the Regression 41/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Some Remarks on Lasso
Making s sufficiently small will cause some of the coefficients to be exactly zero → continuous subset selection
If s = p ∥βˆls ∥, then the lasso estimates are the βˆls ’s. i=1 j j
s should be adaptively chosen to minimize an estimate of expected prediction error.
Linear Regression 42/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
The Variable Selection Property of the Regression 43/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Profile of
Profiles of lasso coefficients as the tuning parameter t is
varied. The coefficients are plotted versus t = s/p ∥βˆls∥ i=1 j
Linear Regression 44/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Dimension Reduction Methods
When there is a large number of predictors, often correlated, we can select what variables (dimensions) to use.
But, why not transform the predictors (to a lower dimension) and then fit the least squares model using the transformed variables.
We will refer to these techniques as dimension reduction methods.
Use a small linear combinations zm , m = 1, …, M of xj
The methods differ in how the linear combinations are obtained
Linear Regression 45/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components Regression
The linear combinations zm are the principal components zm =Xvm
max Var(Xα) ∥α∥=1
v⊤l Sα=0, l=1,…,m−1
v⊤l Sα = 0 ensure zm = X vm is uncorrelated with all previous
linearcombinationszl =Xvl,l=1,…,m−1 y is regressed on z1,…,zM for M ≤ p
Linear Regression 46/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components Regression
Since zm are othrogonal, this regression is just a sum of univariate regressions
M
ˆypcr = ̄y+θˆmzm
m=1
θˆ m = < z m , y > / < z m , z m >
M
βˆpcr =θˆmvm
m=1
ifM=p,ˆypcr =ˆyLS sincethecolumnsofZ=UDspanthe column space of X
PCR discards the p − M smallest eigenvalue components.
Linear Regression 47/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components Regression
Linear Regression 48/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Dimension Reduction Methods
Linear Regression 49/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components
Effectively, we change the orientation of the axes.
The 1st principal component is that (normalized) linear
combination of the variables with the largest variance.
The 2nd principal component has the largest (remainder) variance, subject to being uncorrelated with the first.
And so on…
Many times we can explain most of the variation with only
few principal components.
More details in Chapter 10.2 of the book ‘An introduction to statistical learning’
Linear Regression 50/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components
Linear Regression 51/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components
Linear Regression 52/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Principal Components Regression
Now we can run a regression analysis on only several principal components.
We call it Principal Components Regression ( PCR )
Note, these directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions.
Consequently, PCR suffers from a potentially serious drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
Linear Regression 53/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Partial Least Squares (PLS)
It is a dimension reduction method, which first identifies a new set of features Z1, …, ZM that are linear combinations of the original features.
Then fits a linear model via OLS using these M new features.
Up to this point very much as PCR.
PLS identifies these new features using the response Y (supervised way).
PLS approach attempts to find directions that help explain both the response and the predictors.
Linear Regression 54/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Partial Least Squares (PLS)
Uses y to construct linear combinations of the inputs.
The inputs are weighted by the strength of their univariate
effect on y
Regress y on z → θˆ and orthogonalize with respect to z
mmm
Continue the process until M < p directions are obtained
PLS seeks directions that have high variance and have high
correlation with the response
max Corr2 (y, X α) Var (X α)
∥α∥=1
v⊤l Sα=0, l=1,...,m−1
Linear Regression 55/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Partial Least Squares (PLS)
The first component say t = X α1 maximizes α1 = arg max Cov2 (y,Xα)
∥α∥=1
Subsequent components t2, t3,... are chosen such that they maximizes the squared covariance to y and all components are mutually orthogonal
Orthogonality is enforced by deflating the original variables X Xi =X−Pt1,...,ti−1X
Pt1 ,...,ti −1 denotes the orthogonal projection onto the space spanned by t1,...,ti−1
ˆy = Pt1,...,tmy instead of ˆy = Xβˆ = X X⊤X−1 X⊤y
Linear Regression 56/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Illustrating the connection
The connection between these methods can be seen through the optimisation criterion they use to define projection directions
PCR extracts components that explain the variance of the predictor space
max Var(Xα) ∥α∥=1
v⊤l Sα=0, l=1,...,m−1
PLS extracts components that have a high covariance with
max Corr2 (y, X α) Var (X α) ∥α∥=1
v⊤l Sα=0, l=1,...,m−1
Both method are similar in there aim to extract m components from the predictor space X
Linear Regression 57/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Illustrating the connection
Both methods aims
at expressing the solution in lower dimensional subspace
β = V z where V is an p × m matrix of orthonormal columns
Using this basis for the subspace, an alternative approximate
minimization problem is considered
min∥y − X β∥≈ min∥y − XV z∥
βz
In PCR V is directly obtained from X
in PLS V depends on y in a complicated nonlinear way
Linear Regression 58/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Illustrating the connection
Considering
The singular value decomposition X = UDV ⊤ where U is
N×p,D=diag(d1,...,dp)isp×pandV isp×p
The columns of U and V are orthogonal such that U⊤U = Ip
and V⊤V = Ip
The least squares solution takes the form
ˆ p u ⊤i y p β= dvi= βi
i=1i i
The other estimator are shrinkage estimators and can be expressed as
p
βˆ = f ( d i ) β i
i
Linear Regression 59/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
Multiple Outcome Shrinkage
When the output are not correlated → apply a univariate technique individually to each outcome or work with each column output individually
Other approaches exploit correlations in the different responses → canonical correlation analysis
CCA find a sequence of linear combinations Xvm and Yum such that the correlations are maximized
Corr2 (Yum,Xvm)
Reduced rank regression
Bˆrr(m)=argmin Note 19
rank(B)=m
N
(y −Bx)⊤Σ−1(y −Bx) i i i i
i=1
Linear Regression 60/61
Introduction Linear regression Other Considerations Selection and Regularization Dimension Reduction Methods Multiple O
For more readings
Summaries on LMS.
Chapter 3, 5 & 14.5 from ’The elements of statistical learning’
book.
Chapters 3, 6 & 10.2 from ’An introduction to statistical learning’ book.
Linear Regression 61/61