Regression
(Module 5)
Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2022
1 Introduction 1
Copyright By PowCoder代写 加微信 powcoder
2 Regression 2
3 Simple linear regression 4
3.1 Pointestimationofthemean ………………………………….. 4
3.2 Interlude:Analysisofvariance …………………………………. 7
3.3 Pointestimationofthevariance…………………………………. 8
3.4 Standarderrorsoftheestimates ………………………………… 8
3.5 Confidenceintervals ………………………………………. 9
3.6 Predictionintervals……………………………………….. 10
3.7 Rexamples…………………………………………… 11
3.8 Modelchecking…………………………………………. 14
4 Further regression models 15
5 Correlation 16 5.1 Definitions……………………………………………. 16 5.2 Pointestimation ………………………………………… 16 5.3 Relationshiptoregression ……………………………………. 17 5.4 Confidenceinterval……………………………………….. 18 5.5 Rexample……………………………………………. 18
Aims of this module
• Introduce the concept of regression
• Show a simple model for studying the relationship between two variables • Discuss correlation and how it relates to regression
1 Introduction
Relationships between two variables
We have studied how to do estimation for some simple scenarios:
• iid samples from a single distribution (Xi)
• comparing iid samples from two different distributions (Xi & Yj) • differences between paired measurements (Xi − Yi)
We now consider how to analyse bivariate data more generally, i.e. two variables, X and Y , measured at the same time, i.e. as a pair.
The data consist of pairs of data points, (xi,yi). These can be visualised using a scatter plot.
Example data
xi yi 1.80 9.18 1.40 7.66 2.10 6.33 0.30 4.51 3.60 14.04 0.70 4.94 1.10 4.24 2.10 8.19 0.90 4.55 3.80 11.57
Often interested in how Y depends on X. For example, we might want to use X to predict Y .
In such a setting, we will assume that the X values are known and fixed (henceforth, x instead of X), and look at
how Y varies given x.
Example: Y is a student’s final mark for Statistics, and x is their mark for the prerequisite subject Probability. Does
x help to predict Y ?
The regression of Y on x is the conditional mean, E(Y | x) = μ(x).
The regression can take any form. We consider simple linear regression, which has the form of a straight line:
E(Y |x)=α+βx and var(Y |x)=σ2.
2 Regression
Regression
Example: simple linear regression model
E(Y |x)=α+βx var(Y | x) = σ2
Terminology
• Y is called a response variable. Can also be called an outcome or target variable. Please do not call it the
‘dependent’ variable.
• x is called a predictor variable. Can also be called an explanatory variable. Please do not call it an ‘independent’ variable.
• μ(x) is called the (linear) predictor function or sometimes the regression curve or the model equation.
• The parameters in the predictor function are called regression coefficients.
Why ‘regression’?
It is strange terminology, but it has stuck.
Refers to the idea of ‘regression to the mean’: if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement, and vice versa.
First described by Sir Francis Galton when studying the inheritance of height between fathers and sons. In doing so, he invented the technique of simple linear regression.
A regression model is called linear if it is linear in the coefficients. It doesn’t have to define a straight line!
Complex and non-linear functions of x are allowed, as long as the resulting predictor function is a linear combination (i.e. an additive function) of them, with the coefficients ‘out the front’.
For example, the following are linear models:
μ(x) = α + βx + γx2 μ(x) = α + β
μ(x) = α sin x + β log x
The following are NOT linear models:
Reparameterisation
Changing our model slightly. . . Let α0 = α + βx ̄, which gives:
Now our model is in terms of α0 and β.
This will make calculations and proofs simpler.
3.1 Point estimation of the mean
Least squares estimation
Choose α0 and β to minimize the sum of squared deviations: n
H(α0,β)=(yi −α0 −β(xi −x ̄))2 i=1
Solve this by finding the partial derivatives and setting to zero:
These are called the normal equations.
∂ H ( α 0 , β ) ∂α
[yi −α0 −β(xi −x ̄)](−1)
[yi −α0 −β(xi −x ̄)](−(xi −x ̄))
μ(x) = α sin(βx) μ(x) = α
1+βx μ(x) = αxβ
. . . but the last one can be re-expressed as a linear model on a log scale (by taking logs of both sides), μ∗(x)=α∗ +βlogx
3 Simple linear regression
Estimation goals
Back to our simple linear regression model:
E(Y |x)=α+βx and var(Y |x)=σ2.
• We wish to estimate the slope (β), the intercept (α), the variance of the errors (σ2), their standard errors and construct confidence intervals for these quantities.
• Often want to use the fitted model to make predictions about future observations (i.e. predict Y for a new x).
• Note: the Yi are not iid. They are independent but have different means, since they depend on xi.
• We have not (yet) assumed any specific distribution for Y , only a conditional mean and variance.
E(Y |x)=α+βx
= α 0 + β ( x − x ̄ )
∂ H ( α 0 , β ) ∂β
Least squares estimators
Some algebra yields the least square estimators,
αˆ0=Y, β=ni=1(xi−x ̄)2.
ni=1(xi −x ̄) expression for βˆ is:
These are equivalent, due to the following result:
ni=1(xi −x ̄)(Yi −Y ̄) β = ni=1(xi − x ̄)2 .
Can also then get an estimator for α:
(xi −x ̄)(Yi −Y ̄)=(xi −x ̄)Yi.
αˆ = αˆ 0 − βˆ x ̄ = Y ̄ − βˆx ̄.
And also an estimator for the predictor function,
μˆ ( x ) = αˆ + βˆ x
= αˆ 0 + βˆ ( x − x ̄ ) = Y ̄ + βˆ ( x − x ̄ ) .
Ordinary least squares
This method is sometimes called ordinary least squares or OLS.
Other variants of least squares estimation exist, with different names. For example, ‘weighted least squares’.
Example: least squares estimates
For our data:
The fitted model equation is then:
x ̄ = 1.78
y ̄ = 7.52 = αˆ0
αˆ = 2.91 βˆ = 2.59
μˆ(x) = 2.91 + 2.59x
> rbind(y, x)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 9.18 7.66 6.33 4.51 14.04 4.94 4.24 8.19 4.55 11.57
x 1.80 1.40 2.10 0.30 3.60 0.70 1.10 2.10 0.90 3.80
> model1 <- lm(y ~ x)
lm(formula = y ~ x)
Coefficients:
(Intercept) x
2.911 2.590
Properties of these estimators
What do we know about these estimators? They are all linear combinations of the Yi,
where K = ni=1(xi − x ̄)2.
This allows us to easily calculate means and variances.
Similarly,
Can we get their standard errors? We need an estimate of σ2.
var(μˆ(x))= n+ K σ2
E(αˆ0)=E(Y)= n ˆ n(xi−x ̄)
E(Yi)= n [α0 +β(xi −x ̄)]=α0 i=1
E(Yi) = K (xi −x ̄)(α0 +(xi −x ̄)β)
n 1 αˆ0= nYi
ˆ n xi−x ̄
β= K Yi i=1
i=1 ( x i − x ̄ ) α 0 + K β = β
This also implies, E(αˆ) = α and E(μˆ(x)) = μ(x), and so we have that all of the estimators are unbiased.
Variances?
var(αˆ0) = var(Y ) = n2
n (xi−x ̄) n xi−x ̄2
var(βˆ)=var i=1
var(Yi) (xi −x ̄)2 σ2
= K2σ2 = σ2
(xi −x ̄)2 = K2σ2K
1 x ̄22 var(αˆ)= n+K σ
var(Yi) = n
i=1 (xi −x ̄)2 var(Yi) = K2
cov(αˆ0,βˆ) = 0
1 ( x − x ̄ ) 2
3.2 Interlude: Analysis of variance
Analysis of variance: iid model
ForXi ∼N(μ,σ2)iid,
(Xi−μ)2 =(Xi−X ̄)2+n(X ̄−μ)2 i=1 i=1
Analysis of variance: regression model
(Yi − α0 − β(xi − x ̄))2
=(Yi −αˆ0 −βˆ(xi −x ̄)+αˆ0 +βˆ(xi −x ̄)−α0 −β(xi −x ̄))2 i=1
=(Yi −αˆ0 −βˆ(xi −x ̄)+(αˆ0 −α0)+(βˆ−β)(xi −x ̄))2
=(Yi −αˆ0 −βˆ(xi −x ̄))2 +n(αˆ0 −α0)2 +K(βˆ−β)2 i=1
Note that the cross-terms disappear. Let’s see...
The cross-terms. . .
For the second term,
Therefore, all the cross-terms are zero.
t1 = 2(Yi − αˆ0 − βˆ(xi − x ̄))(αˆ0 − α0)
t2 = 2(Yi −αˆ0 −βˆ(xi −x ̄))(βˆ−β)(xi −x ̄) i=1
t3 =2(xi −x ̄)(βˆ−β)(αˆ0 −α0)
Since ni=1(xi − x ̄) = 0 and ni=1(Yi − αˆ0) = ni=1(Yi − Y ̄ ) = 0, the first and third cross-terms are easily shown to
2 =(Yi −Y ̄)(xi −x ̄)−βˆ(xi −x ̄)2
= (Yi − Y ̄ )(xi − x ̄) − βˆK i=1
=Yi(xi −x ̄)−Yi(xi −x ̄)
2(βˆ−β) i=1 i=1 n
Back to the analysis of variance formula. . .
Taking expectations gives,
(Yi − α0 − β(xi − x ̄))2
=(Yi −αˆ0 −βˆ(xi −x ̄))2 +n(αˆ0 −α0)2 +K(βˆ−β)2 i=1
nσ2 = E(D2) + σ2 + σ2 ⇒ E(D2) = (n − 2)σ2
D2 = (Yi − αˆ0 − βˆ(xi − x ̄))2.
3.3 Point estimation of the variance
Variance estimator
Based on these results, we have an unbiased estimator of the variance, σˆ2= 1 D2.
The inferred mean for each observation is called its fitted value, Yˆi = αˆ0 + βˆ(xi − x ̄). The deviation from each fitted value is called a residual, Ri = Yi − Yˆi.
The variance estimator is based on the sum of squared residuals, D2 = ni=1 Ri2.
Example: variance estimate
For our data:
3.4 Standard errors of the estimates
Standard errors
d2 = 16.12 σˆ2 = 2.015 σˆ = 1.42
We can subsitute σˆ2 into the formulae for the standard deviation of the estimators in order to calculate standard errors.
For example,
var(βˆ) = σ2 K
⇒ se(βˆ)=√σˆ K
Example: standard errors
For our data:
se(αˆ0)=√σˆ =0.449 n
se(βˆ) = √σˆ = 0.404 K
n K 10 12.34
22 se(μˆ(x))=σˆ 1+(x−x ̄) =1.42× 1 +(x−1.78)
3.5 Confidence intervals
Maximum likelihood estimation
Want to also construct confidence intervals. This requires further assumptions about the population distribution. Let’s assume a normal distribution:
Yi ∼N(α+βxi,σ2). Alternative notation (commonly used for regression/linear models):
Yi =α+βxi +εi, where εi ∼N(0,σ2). Since the Yi’s are independent, the likelihood is:
2 n 1 (yi−α−βxi)2 L(α,β,σ )= √2πσ2 exp − 2σ2
Let’s maximise the likelihood. . .
1 n ni=1(yi−α0−β(xi−x ̄))2
= √2πσ2 exp − 2σ2
− ln L(α, β, σ2) = 2 ln(2πσ2) + 2σ2 =nln(2πσ2)+ 1 H(α0,β)
The α0 and β that maximise the likelihood (minimise the log-likelihood) are the same as those that minimise the sum
of squares, H.
The OLS estimates are the same as the MLEs! What about σ2?
Differentiate by σ, set to zero, solve. . .
σˆ2 =1D2 MLE n
This is biased. Prefer to use the previous, unbiased estimator, σˆ2= 1 D2
Sampling distributions
The Y1, · · · , Yn are independent normally distributed random variables.
Except for σˆ2, our estimators are linear combinations of the Yi so will also have normal distributions, with mean and variance as previously derived.
(yi − α0 − β(xi − x ̄))2
For example,
ˆ σ2 β∼N β,K .
Moreover, we know αˆ0 and βˆ are independent, because they are bivariate normal rvs with zero covariance. Using the analysis of variance decomposition (from earlier), we can show that,
(n − 2)σˆ2
σ 2 ∼ χ 2n − 2 .
Therefore, we can define pivots for the various mean parameters. For example,
This allows us to construct confidence intervals.
σˆ/ K μˆ(x) − μ(x)
σˆ1+(x−x ̄)2 ∼tn−2 nK
Example: confidence itervals
For our data, a 95% CI for β is:
βˆ±c√σˆ =2.59±2.31×0.404=(1.66,3.52) K
where c is the 0.975 quantile of tn−2. A 95% CI for μ(3) is:
μˆ(3) ± c × se(μˆ(3)) = 10.68 ± 2.31 × 0.667 = (9.14, 12.22)
3.6 Prediction intervals
Deriving prediction intervals
Use the same trick as we used for the simple model,
Y∗ ∼Nμ(x∗),σ2
A 95% PI for Y ∗ is given by:
Example: prediction interval
1(x∗−x ̄)2 μˆ(x∗)∼Nμ(x∗),n+ K σ2
1 (x∗−x ̄)2 Y∗−μˆ(x∗)∼N 0, 1+n+ K σ2
1 (x∗−x ̄)2 μˆ ( x ∗ ) ± c σˆ 1 + n + K
A 95% PI for Y ∗ corresponding to x∗ = 3 is:
10.68±2.31×1.42× 1+ 1 +(3−1.78) =(7.06,14.30)
2 10 12.34
Much wider than the corresponding CI, as we’ve seen previously. 10
3.7 R examples
> model1 <- lm(y ~ x)
> summary(model1)
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.01970 -1.05963 0.02808 1.04774 1.80580
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9114 0.8479 3.434 0.008908 **
x 2.5897 0.4041 6.408 0.000207 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.419 on 8 degrees of freedom
Multiple R-squared: 0.8369,Adjusted R-squared: 0.8166
F-statistic: 41.06 on 1 and 8 DF, p-value: 0.0002074
> # Confidence intervals for mean parameters
> confint(model1)
2.5 % 97.5 %
(Intercept) 0.9560629 4.866703
x 1.6577220 3.521623
> # Data to use for prediction.
> data2 <- data.frame(x = 3)
> # Confidence interval for mu(3).
> predict(model1, newdata = data2, interval = “confidence”)
fit lwr upr
1 10.6804 9.142823 12.21798
> # Prediction interval for y when x = 3.
> predict(model1, newdata = data2, interval = “prediction”)
fit lwr upr
1 10.6804 7.064 14.2968
R example explained
• The lm (linear model) command fits the model.
• model1 is an object that contains all the results of the regression needed for later calculations.
• summary(model1) acts on model1 and summarizes the regression.
• predict can calculate CIs and PIs.
• R provides more detail than we need at the moment. Much of the output relates to hypothesis testing that we will get to later.
Plot data and fitted model
> plot(x, y, col = “blue”)
> abline(model1, col = “blue”)
The command abline(model1) adds the fitted line to a plot.
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Fitted values and CIs for their means
> predict(model1, interval = “confidence”)
fit lwr upr
1 7.572793 6.537531 8.608056
2 6.536924 5.442924 7.630925
3 8.349695 7.272496 9.426895
4 3.688285 1.963799 5.412771
5 12.234204 10.247160 14.221248
6 4.724154 3.280382 6.167925
7 5.760023 4.546338 6.973707
8 8.349695 7.272496 9.426895
9 5.242088 3.921478 6.562699
10 12.752138 10.603796 14.900481
Confidence band for the mean
> data3 <- data.frame(x = seq(-1, 5, 0.05))
> y.conf <- predict(model1, data3, interval = "confidence")
> head(cbind(data3, y.conf))
x fit lwr upr
1 -1.00 0.3217104 -2.468232 3.111653
2 -0.95 0.4511941 -2.295531 3.197919
3 -0.90 0.5806777 -2.122943 3.284298
4 -0.85 0.7101613 -1.950472 3.370794
5 -0.80 0.8396449 -1.778124 3.457414
6 -0.75 0.9691286 -1.605906 3.544164
> matplot(data3$x, y.conf, type = “l”, lty = c(1, 2, 2),
+ lwd = 2, xlab = “x”, ylab = “y”)
> points(x, y, col = “blue”)
4 6 8 10 12 14
−1 0 1 2 3 4 5
Prediction bands for new observations
> y.pred <- predict(model1, data3, interval = "prediction")
> head(cbind(data3, y.pred))
x fit lwr upr
1 -1.00 0.3217104 -3.979218 4.622639
2 -0.95 0.4511941 -3.821827 4.724215
3 -0.90 0.5806777 -3.664763 4.826119
4 -0.85 0.7101613 -3.508034 4.928357
5 -0.80 0.8396449 -3.351646 5.030936
6 -0.75 0.9691286 -3.195606 5.133863
> matplot(data3$x, y.pred, type = “l”, lty = c(1, 3, 3),
+ lwd = 2, xlab = “x”, ylab = “y”)
> points(x, y, col = “blue”)
−1 0 1 2 3 4 5
0 5 10 15 20 0 5 10 15
Both bands plotted together
> matplot(data3$x, cbind(y.conf, y.pred[, -1]),
+ type = “l”, lwd = 2, lty = c(1, 2, 2, 3, 3),
+ xlab = “x”, ylab = “y”)
> points(x, y)
−1 0 1 2 3 4 5
What modelling assumptions have we made?
• Linear model for the mean
• Equal variances for all observations (homoscedasticity) • Normally distributed residuals
Ways to check these:
• Plot the data and fitted model together (done!) • Plot residuals vs fitted values
• QQ plot of the residuals
In R, the last two of these are very easy to do:
> plot(model1, 1:2)
3.8 Model checking
Checking our assumptions
0 5 10 15 20
Residuals vs Fitted
Normal Q−Q
4 6 8 10 12
Fitted values
4 Further regression models
Multiple regression
• What if we have more than one predictor?
• Observe xi1,…,xik as well as yi (for each i) • Can fit a multiple regression model:
−1.5 −1.0 −0.5 0.0 0.5
Theoretical Quantiles
E(Y |x1,…,xk)=β0 +β1×1 +β2×2 +···+βkxk
• This is linear in the coefficients, so is still a linear model • Fit by method of least squares by minimising:
H =(yi −β0 −β1xi1 −β2xi2 −···−βkxik)2
• Take partial derivatives, etc., and solve for β0,…,βk.
• The subject Linear Statistical Models (MAST30025) looks into these types of models in much more detail.
Two-sample problem
• The two-sample problem can be expressed as a linear model!
• Sample Y1,…,Yn ∼ N(μ1,σ2) and Yn+1,…,Yn+m ∼ N(μ2,σ2).
• Define indicator variables (xi1,xi2) where (xi1,xi2) = (1,0) for i = 1,…,n and (xi1,xi2) = (0,1) for i = n+1,…,n+m.
• Observed data: (yi, xi1, xi2)
• Then Y1,…,Yn each have mean 1×β1 +0×β2 = μ1 and Yn+1,…,Yn+m each have mean 0×β1 +1×β2 = μ2.
• This is in the form a multiple regression model.
• The general linear model unifies many different types of models together into a common framework. The subject MAST30025 covers this in more detail.
−2 −1 0 1 2
Standardized residuals
−1.5 −0.5 0.0 0.5 1.0 1.5
5 Correlation
5.1 Definitions
Correlation coefficient
(Revision) for two rvs X and Y , the correlation coefficient, or simply the correlation, is defined as: cov(X,Y) σXY
ρ=ρXY =√varXvarY =σXσY
This is a quantitative measure of the strength of relationship, or association, between X and Y .
x1y11 y12 y13 x2 y21 y22 y23 x3 y31 y32
0.131 -0.283 1.625 0.464 -1.021 -0.090 0.268 0.660 -0.334 0.862 i i
#We will now consider inference on ρ, based on an iid sample of pairs (X , Y ). # 0.070 -0.524 -0.135 0.110 0.450 -0.973 0.085 0.588 -0.114 -0.263
#No1.031 0.863 0.734 -0.133 -0.278 1.756 0.582 -0.462 -0.195 -0.221
#te:unlikeinregression,Xisnowconsideredasarandomvariable. -0.727 -1.439 -0.820 -0.893 0.908 -0.466 -0.457 -1.640 -0.306 -1.369
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
0.494 0.110 -0.210 -0.680
-0.7ρ55= −-01.855
-0.514 -0.675
-0.438 -0.955
-0.035 -1.119
-0.049 0.816
-1.046 -1.795
-0.333 -0.418
-1.601 -0.975
0.450 -0.784
-0.645 1.047
-0.860 -0.177
1.728 1.713
-1.670 -1.338
-0.120 0.107
0.570 0.332
-0.010 -1.025 0.641 -0.470 1.554 -2.115 -1.404 -1.514 0.515 -0.024 0.005 1.414 0.331 -0.404 2.297
2.069 0.805 -0.916 -0.793 -0.008 -1.042
-0.090 -1.466 -0.752 0.425 -0.249 -0.875 -0.384 -0.552 1.030 1.486 0.406 -0.558 -0.092 1.825 -0.870 0.340 -0.508 2.078 0.346 1.586 0.937 -2.082 0.872 -1.685 1.123 1.462 -0.411 -0.155 -2.073 1.132 0.969 0.757 -1.960 -0.583 0.314 -0.685 0.004 -0.763 -0.351 0.643 0.892 -1.923 -0.439 -0.631 -0.004 -0.403 -1.219
-0.790 0.362
ρ =0.7−360.75-0.336
-0.551 1.155 0.331 -1.272
-0.741 -0.2 -0.4
0.374 1.211 1.611 -1.067
-0.928 -0.826 0.143 0.402
-0.572 1.786 -0.083 -0.356
-0.439 0.456 0.885 -0.485
-0.562 0.428
1.680 -0.023 -0.450 -0.538 -0.853 0.589 0.959 0.936 -0.754 -0.017 0.350 -1.077 -0.489 2.781 -0.513 -0.911 0.092 1.944 -0.696 0.040 -0.907 0.590 -0.892 0.157 -2.005 0.189 0.399 0.057 -0.741 0.447 -1.482 0.358 -1.351 -0.653 0.685 -1.655 1.832 -0.448 -0.932 1.599 -2.265 -0.300 0.478 -0.831 -0.512 0.051 -0.285
2.464 -0.271
0.515 ρ-0.4=68 -0.344
1.034 -0.356
2.094 0.016
0.327 -0.444 −0-.25.264 -0.900 0.281
-0.853 0.368
2.971 -0.548
-0.068 -0.070 0.735 -0.241 0.484
0.146 0.531 0.819 -0.073 0.300 1.360
-1.672 1.345 0.049 -0.326
-0.177 -1.053 1.748 -0.196
0.897 -1.404 -0.890
-0.055 0.336 -0.750
0.048 -1.085 -1.234
-0.453 -0.976
-0.244 0.096
-0.111 -1.152
-0.474 0.367 -0.304
0.023 -0.553 -0.566
-0.775 -0.607 -0.009
-0.093 -2.343
1.963 -0.602
-0.345 -1.067
0.369 1.107
-0.407 1.010
-1.060 -0.183
0.373 -0.162
-0.611 0.025 -0.792
0.316 0.521
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com