CS代考 Senction3-notes

Senction3-notes
STAT318/462 — Data Mining
Dr G ́abor Erd ́elyi
University of Canterbury, Christchurch,
Course developed by Dr B. Robertson. Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
This section provides a brief introduction to linear regression. Linear regression is a fundamental statistical learning method (and it is the basis of many methods) and I expect that some of you will have studied it before. However, there are a number of students in this class that have not covered linear regression. The purpose of this section is to introduce/refresh linear regression, rather than pro- viding a full treatment on the subject (STAT202, STAT315 and STAT448 cover linear regression in more detail). I encourage students that have not studied linear regression before to carefully read chapter 3 of the course textbook, including the sections that are not covered in these lecture notes. It is important to have a basic understanding of linear regression to fully appreciate the material covered later in the course.
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,1 / 26

Linear regression
Linear regression is a simple parametric approach to supervised learning that assumes there is an approximately linear relationship between the predictors X1,X2,…,Xp and the response Y.
Although true regression functions are never linear, linear regression is an extremely useful and widely used method.
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,2 / 26
Although linear models are simple, in some cases they can perform better than more sophisticated non-linear models. They can be particularly useful when the number of training observations is relatively small, when the signal-to-noise ratio is low (the ‘ term is relatively large) or when the training data sparsely populate the predictor space.

Linear regression: advertising data
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,3 / 26
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Sales
5 10 15 20 25

Simple linear regression
In simple (one predictor) linear regression, we assume a model Y = —0 + —1X + ‘,
where —0 and —1 are two unknown parameters and ‘ is an error term with E(‘) = 0.
Given some parameter estimates —ˆ0 and —ˆ1, the prediction of Y at X = x is given
b y yˆ = —ˆ + —ˆ x . 01
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,4 / 26
The population regression line is
E(Y|X=x) = E(—0+—1x+‘)
= —0+—1x,
where E(‘) = 0 by assumption. The parameters —0 (intercept) and —1 (slope) are
called the regression coecients. The line 􏱜 yˆ = —ˆ + —ˆ x ,
is the estimated regression line, which is an approximation to population regres-
01
sion line based on observed training data. To derive statistical properties of the
estimators (—ˆ , —ˆ ), further model assumptions are required (other than a linear 01
relationship and E(‘) = 0). Slide 10 requires the errors (‘) to be uncorrelated with constant variance ‡2. Slide 11 requires the errors to be independent and identically distributed normal random variables with mean 0 and variance ‡2 (in statistical notation: ‘ ≥ Normal(0, ‡2)). These additional assumptions are only required for these specific statistical properties and not to fit the model. For example, you do not require the normality assumption to fit a useful linear model.

Estimating the parameters: least squares approach
􏱲􏱮􏱳 ǐ g u
􏱛
Let yˆi = —ˆ0 + —ˆ1xi be the prediction of Y at X = xi , where xi is the predictor value at the ith training observation. Then, the ith residual is defined as
e i = y i ≠ yˆ i , in7
where yi is the response value at the ith training observation.
The least squares approach chooses —ˆ0 and —ˆ1 to minimize the residual sum of
squares (RSS)
RSS=ÿn ei2=ÿn (yi≠yˆi)2=ÿn (yi≠—ˆ0≠—ˆ1xi)2. i=1 i=1 i=1
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,5 / 26
We want the regression line to be as close to the data points as possible. A popular approach is the method of least squares:
m i n ÿn ( y ≠ — ˆ ≠ — ˆ x ) 2 . —ˆ0,—ˆ1i=1 i 0 1i
This quadratic function is relatively easy to minimize (by taking the partial deriva- tives and setting them equal to zero) and the solution is given on slide 8.
n.EC

Advertising example
yˆ = 7.03 + 0.0475x
TV
0 50 100 150 200 250 300
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,6 / 26
The least squares solution to regressing Sales on TV (using TV to predict Sales)
is
sales = 7.03 + 0.0475 ◊ TV, which was computed using the lm function in R.

Sales
5 10 15 20 25

Advertising example
56789
β0
Con􏱜tour plot of the RSS on the advertising data, using TV as the predictor.
2.5
2.2 2.3
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,7 / 26
The RRS function is quadratic (bowl shaped) and hence, has a unique minimizer (shown by the red dot).
3
3
β1
0.03 0.04 0.05 0.06
2.15
3
3

Estimating the parameters: least squares approach
Using some calculus, we can show that
qni=1(xi ≠x ̄)(yi ≠y ̄) qni=1(xi ≠x ̄)yi —1 = qni=1(xi ≠ x ̄)2 = qni=1(xi ≠ x ̄)2
ˆ
and
where x ̄ and y ̄ are the sample means of x and y, respectively.
—ˆ 0 = y ̄ ≠ —ˆ 1 x ̄ ,
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,8 / 26
There are two important consequences of the least squares fit. Firstly, the residuals sum to zero:
ÿn e =ÿn (y ≠yˆ) = ÿ(y ≠—ˆ ≠—ˆx) iiii01i
i=1 i=1
Secondly, the regression line passes through the centre of mass (x ̄,y ̄). The pre- dicted response at X = x ̄ is
yˆ = —ˆ + —ˆ x ̄ 0ˆ1ˆ
= (y ≠y ̄+—ˆx ̄≠—ˆx) i11i
= ny ̄≠ny ̄+n—ˆx ̄≠n—ˆx ̄=0. 11
= y ̄ ≠ — 1 x ̄ + — 1 x ̄
= y ̄ ((x ̄, y ̄) is on the regression line).
It is also relatively easy to show that —ˆ and —ˆ are unbiased estimators. That is,
ˆˆ01
E(—0|X) = —0 and E(—1|X) = —1 (we will not prove this result in this course).

Assessing the accuracy of the parameter estimates
−2 −1 0 1 2 −2 −1 0 1 2
XX
Truemodel(red)isY =2+3X+‘,where‘≥Normal(0,‡2).
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,9 / 26
The true regression function is linear, so we would expect the simple linear regres- sion model to perform well. Ten least squares fits using dierent training data are shown in the right panel. Observations:
• All ten fits have slightly dierent slopes.
• All ten fits pivot around the mean (x ̄, y ̄) = (0, 2).
• The population regression line and the fitted regression lines get further apart as x moves away from x ̄.
• We are less certain about predictions for an x far from x ̄ (the variability in the mean response increases as x moves away from x ̄ as seen by the dierent fits).
• The linear model is useful for interpolation (predictions within the range of training data), but not extrapolation (beyond the range of the training data).
To quantify the variability in the regression coecients, we compute/estimate their standard errors (which is simply the standard deviations of the estimators).
−10 −5 0 5 10
−10 −5 0 5 10
Y
Y

Assessing the accuracy of the parameter estimates
The standard errors for the parameter estimates are
SE(—ˆ0) = ÒV(—ˆ0|X) = ‡Û31 + qn x ̄2 4
SE(—ˆ1)= V(—ˆ1|X)=Òqni=1(xi≠x ̄)2, where ‡ = V (‘).
Usually ‡ is not known and needs to be estimated from data using the residual standard error (RSE)
RSE=Ûqni=1(yi ≠yˆi)2, wherepisthenumberofpredictors(p=1here). n≠p≠1
and Ò
n i = 1 ( x i ≠ x ̄ ) 2 ‡
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,10 / 26
The standard error reflects how much the estimator varies under repeated sampling. You can think about an SE of —ˆ in the following way. Assume we have many
1
training data sets from the population of interest. Then, for each training data
set, we fit a linear model (each fit has a dierent —ˆ value). The standard error
of —ˆ is the standard deviation of the —ˆ values we1obtained. If ‡ is large, the 11ˆ
standard errors will tend to be large. This means —1 can vary wildly for dierent training sets.
If the xi’s are well spread over the predictor’s range, the estimators will tend to be more precise (small standard errors). If an xi is far from x ̄, xi is called a high leverage point. These points can have a huge impact on the estimated regression line.
We can also construct confidence intervals (CIs) for the regression coecients, for example an ¥ 95% CI for the slope parameter —1 is
—ˆ ±2SE(—ˆ). 11
Assumptions: The SE formulas require the errors to be uncorrelated with constant variance ‡2. The CI requires a stronger assumption: ‘ ≥ Normal(0, ‡2). Bootstrap CIs can be constructed if the normality assumption fails (see Section 5).

Hypothesis testing
If—1 =0,thenthesimplelinearmodelreducestoY =—0+‘,andX isnot associated with Y .
To test whether X is associated with Y , we perform a hypothesis test: H0 : —1 = 0 (there is no relationship between X and Y )
HA : —1 ”= 0 (there is some relationship between X and Y ) If the null hypothesis is true (—1 = 0), then
t = —ˆ 1 ≠ 0 S E ( —ˆ 1 )
will have a t-distribution with n ≠ 2 degrees of freedom.
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,11 / 26
We look for evidence to reject the null hypothesis (H0) to establish the alternative hypothesis HA. We reject H0 if the p-value for the test is small. The p-value is the probability of observing a t-statistic more extreme than the observed statistic tú if H0 is true. This is a two-sided test so more extreme means t Æ ≠|tú| and t Ø |tú|.
􏱖􏱴 􏱖􏱗 􏱖􏱵 0 􏱵 􏱗 􏱴 t
• A large p-value is NOT strong evidence in favour of H0.
• The p-value is NOT the probability that HA is true.
• When we reject H0 we say that the result is statistically significant (which does not imply scientific significance).
• Alevel0<–<1testrejectsH0 :—1 =0ifandonlyifthe(1≠–)100% confidence interval for —1 does not include 0. p-value for t*= 2 (or t*= -2). Density 0.0 􏱶􏱞􏱴 Results for the advertising data set Coecient Intercept 7.0325 TV 0.0475 Std. Error 0.4578 0.0027 t-statistic p-value 15.36 <0.0001 17.67 <0.0001 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,12 / 26 We fit linear models using R and it performs hypothesis tests for us. We need to be able to interpret the output and draw valid conclusions. • The intercept —ˆ tells us the predicted sales when the TV budget is set to zero. 0 • The p-value is very small for TV, so there is some relationship between TV budget and sales (—1 ”= 0). • An approximate (we are estimating the standard error using RSE) 95% confidence interval for —1 is 0.0475 ± 2(0.0027) ¥ (0.042, 0.053), which does not contain zero. That is, —1 ”= 0 at the 95% confidence level. Assessing the overall accuracy Once we have established that there is some relationship between X and Y , we want to quantify the extent to which the linear model fits the data. The residual standard error (RSE) provides an absolute measure of lack of fit for the linear model, but it is not always clear what a good RSE is. An alternative measure of fit is R-squared (R2), 2 R S S q ni = 1 ( y i ≠ yˆ i ) 2 R =1≠TSS=1≠qni=1(yi≠y ̄)2, where TSS is the total sum of squares. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,13 / 26 The R2 statistic measures the fraction of variation in y that is explained by the model and satisfies 0 Æ R2 Æ 1. The closer R2 is to 1, the better the model. Results for the advertising data set Quantity Residual standard error (RSE) R 2 The R2 statistic has an interpretable advantage over RSE because it always lies between 0 and 1. A good R2 value usually depends on the application. Value 3.26 0.612 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,14 / 26 Approximately 61% of the variation in sales is explained by the linear model with TV as a predictor. Multiple linear regression In multiple linear regression, we assume a model Y =—0 +—1X1 +...+—pXp +‘, where—0,—1,...,—p arep+1unknownparametersand‘isanerrortermwith E(‘) = 0. Given some parameter estimates —ˆ0,—ˆ1,...,—ˆp, the prediction of Y at X = x is given by yˆ = —ˆ + —ˆ x + . . . + —ˆ x . 011pp G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,15 / 26 The slope parameters have a mathematical interpretation in multiple linear regres- sion: —i estimates the expected change in Y per unit change in Xi , with all other predictors fixed. This is a useful mathematical property, but usually the predictors are correlated and hence, tend to change together (an increase in one predictor tends to increase another etc.). Multiple linear regression Y yˆ = —ˆ0 + —ˆ1x1 + —ˆ2x2. X2 X1 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,16 / 26 Estimating the parameters: least squares approach The parameters —0, —1, . . . , —p are estimated using the least squares approach. We choose —ˆ0, —ˆ1, . . . , —ˆp to minimize the sum of squared residuals R S S = ÿn ( y i ≠ y ˆ i ) 2 i=1 = ÿn (yi ≠—ˆ0 ≠—ˆ1xi1 ≠...≠—ˆpxip)2. i=1 We will calculate these parameter estimates using R. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,17 / 26 Results for the advertising data Intercept TV Radio Newspaper Coecient 2.939 0.046 0.189 -0.001 Std. Error 0.3119 0.0014 0.0086 0.0059 t-statistic 9.42 32.81 21.89 -0.18 p-value <0.0001 <0.0001 <0.0001 0.8599 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,18 / 26 When reading this output, each statement is made conditional on the other pre- dictors being in the model. • Given that TV and Radio are in the model, Newspaper is not useful for predicting sales (high p-value). • Possible cause: There is a moderate correlation between Newspaper and Radio of ¥ 0.35. If Radio is included in the model, Newspaper is not needed. Revision material: The sample correlation for variables x and y is rxy=Ò qni=1(xi≠x ̄)(yi≠y ̄) qni=1(xi ≠ x ̄)2 qni=1(yi ≠ y ̄)2 and satisfies ≠1 Æ rxy Æ 1. The sample correlation measures how close the pairs (xi , yi ) are to falling on a line. If rxy > 0, then x and y are positively correlated. If rxy < 0, then x and y are negatively correlated. Finally, If rxy ¥ 0, then x and y are uncorrelated (not linearly related, but they could be non-linearly related). Is there a relationship between Y and X? To test whether X is associated with Y , we perform a hypothesis test: H0 :—1 =—2 =...=—p =0(thereisnorelationship) HA : at least one —j is non-zero (there is some relationship) If the null hypothesis is true (no relationship), then F = (TSS - RSS)/p RSS/(n ≠ p ≠ 1) will have an F -distribution with parameters p and n ≠ p ≠ 1. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,19 / 26 We will use R to compute the F-statistic and its corresponding p-value. The F- test is a one-sided test with a distribution whose shape is determined by p and n ≠ p ≠ 1: Comparison of F-Distributions F(p,n-p-1) Dist ributions F(20,1000) F(10,50) F(5,10) F(3,5) 012345 F value We reject the null hypothesis (that there is no relationship) if F is suciently large (small p-value). Density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Is the model a good fit? Once we have established that there is some relationship between the response and the predictors, we want to quantify the extent to which the multiple linear model fits the data. The residual standard error (RSE) and R2 are commonly used. For the advertising data we have: Quantity Residual standard error (RSE) R 2 F-statistic Value 1.69 0.897 (p-value << 0.0001) 570 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,20 / 26 The F-statistic is very large (p-value is essentially zero) which gives us strong evidence for a relationship (we reject H0 and accept HA). By including all three predictors to the linear model, we have explained ¥ 90% of the variation in sales. The F statistic does not tell us which predictors are important, only that at least one of the slope parameters is non-zero. To determine which predictors are important, further analysis is required. Warning: R2 increases (or remains the same) if more predictors are added to the model. Hence, we should not use R2 for comparing models that contain dierent numbers of predictors. We will consider model selection in Section 5. Extensions to the linear model We can remove the additive assumption and allow for interaction eects. Consider the standard linear model with two predictors Y =—0 +—1X1 +—2X2 +‘. An interaction term is included by adding a third predictor to the standard model Y =—0 +—1X1 +—2X2 +—3X1X2 +‘. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,21 / 26 For example, spending money on radio advertising could increase the eectiveness of TV advertising. Results for the advertising data Consider the model Sales = —0 + —1Tv + —2Radio + —3(Tv ◊ Radio) + ‘. The results are: Intercept TV Radio Tv◊Radio Coecient 6.7502 0.0191 0.0289 0.0011 Std. Error 0.248 0.002 0.009 0.000 t-statistic 27.23 12.70 3.24 20.73 p-value <0.0001 <0.0001 0.0014 <0.0001 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,22 / 26 The estimated coecient for the interaction term is statistically significant, when TV and Radio are included in the model. • There is strong evidence for —3 ”= 0. • R2 has gone from ¥ 90% (the model with all three predictors) to ¥ 97% by including the interaction term. Note: these models have the same complexity (4 parameters, —0, . . . , —3) and hence can be compared used R2. Hierarchical Principle: If we include an interaction term in the model, we should also include the main predictors (even if they are not significant). • We include them for better interpretation (as interpretable results are often one of the reasons for choosing a linear model in the first place). Extensions to the linear model We can accommodate non-linear relationships using polynomial regression. Consider the simple linear model Y = —0 + —1X + ‘. Non-linear relationships can be captured by including powers of X in the model. For example, a quadratic model is Y = —0 + —1X + —2X2 + ‘. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,23 / 26 The model is linear in —0,—1 and —2. The hierarchical principle applies here as well. If you include X2 in your model, you should also include X (even if it is not significant). Polynomial regression: Auto data 50 100 150 200 Horsepower Linear Degree 2 Degree 5 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,24 / 26 The linear fit fails to capture the non-linear structure in the training data. To objectively choose the best fit, we need to use model selection methods (more about this in section 5). We cannot use R2 because R2 fordegree1ÆR2 fordegree2ÆR2 fordegree5 Subjectively, we could argue that the quadratic model fits best (the degree 5 polynomial is more complex and does appear to capture much more than the quadratic). Miles per gallon 10 20 30 40 50 Results for the auto data The figure suggests that mpg = —0 + —1Horsepower + —2Horsepower2 + ‘, may fit the data better than a simple linear model. The results are: Intercept Horsepower2 Horsepower Coecient 56.9001 -0.4662 0.0012 Std. Error 1.8004 0.0311 0.0001 t-statistic p-value 31.6 <0.0001 -15.0 <0.0001 10.1 <0.0001 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,25 / 26 Horsepower and (Horsepower)2 are significant and hence, useful for predicting mpg. What we did not cover Qualitative predictors need to be coded using dummy variables for linear regression (R does this automatically for us). Deciding on important variables. Outliers (unusual y value) and high leverage points (x value far from x ̄). Non-constant variance and correlation of error terms. Collinearity. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,26 / 26 The textbook covers these topics and you are encouraged to read these sections if you are unfamiliar with this material. A basic knowledge of linear regression is essential to fully appreciate the methods covered in this course. The previous lecture slides (and the course textbook) provide this basic knowledge.