CS计算机代考程序代写 Subject 9 Simple Linear Regression

Subject 9 Simple Linear Regression

STAT 205 L01-02, Fall 2021

Instructor: Dr. Bingrui (Cindy) Sun
Department of Mathematics and Statistics

University of Calgary

Outline of Topics and Where to Find Them in the Pearson
eText

Topic Pearson eText

9.1 Scatter-Plots, Correlation, 4.1-4.3

Model Interpretation

9.2 Assumptions behind the Linear Model 4.4

9.3 Analysis of Variance, 4.4

Coe�cient of Determination

9.4 Inferences about the Slope β1 Not covered

9.5 Con�dence Interval for the Response Mean, Not covered

Prediction Interval for a Single Response Value

All data sets used in this subject are stored in the Subject9Data.csv
�le in D2L.

Learning Outcomes of Subject 9
(i) Determine whether a linear relationship exists between the two
variables of interest, determine the strength of the linear relationship
of such variables, based on a scatter plot, the correlation coe�cient,
and the coe�cient of determination.
(ii) Find the least-squares regression line. Interpret the meaning of
the y-intercept and slope (in the context of the data). Apply the
least-squares estimate of the model to predict a value of Y for a
given value of X, and compute the residual.
(iii)Construct and interpret con�dence interval estimates for both the
slope and Y-intercept terms in the model. Apply the T-test to test
the value of both the Y-intercept and slope terms.
(iv)Construct and interpret con�dence interval estimates for the re-
sponse/dependent variable: for the mean and for a single value (pre-
diction interval).
(v) Understand the conditions under which simple linear regression
model building hold: the conditions of normality and homoscedas-
ticity of the residuals. Use a residual plot and QQ plot to check
whether these conditions hold.

9.1 Simple Linear Regression

In this chapter, We show how to use the sample data to estimate
the straight-line relationship between the mean value of one variable
(Y ), E (Y ), as it relates to a second variable, X . The methodology
of estimating and using a straight-line relationship is referred to as
simple linear regression analysis.

The methods detailed in this chapter will help answer questions such
as:
In the transportation industry, is a correlation evident between the
price of transportation and the mass of the object being shipped? If
so, how strong is the correlation? what is the average increase in the
price due to a unit change of the mass of the object being shipped?
Can I build a linear equation to estimate the price of transportation
for a given mass? Are there any conditions required for my linear
equation to hold, and how to assess these conditions?

Example 9.1 Refer to the Subject9Data.csv data set in D2L. Ex-
amine the linear relationship between software sales and computer
sales.

Use a scatter plot to visually examine whether there is a linear
relationship between Y and X . StatCrunch commands: Graph�
Scatterplot�Y variable: Software Sales; X variable: Computer Sales.

A linear relationship between computer sales and software sales is
obvious.

The linear regression model

Y = β0 + β1X + ϵ

where Y =response or dependent variable.
X =explanatory/predictor or independent variable (variable used as
a predictor of y).
β0 (beta zero) = y−intercept of the line, the point at which the line
intersects, or cuts through, the y−axis.
β1 (beta one) =Slope of the line, the change (amount of increase
or decrease) in the deterministic component of y for every one-unit
increase in x .
A positive slope implies that E (y) increases by the amount β1, and
a negative slope implies that E (y) decreases by the amount β1, per
unit increase in x .
ϵ (epsilon) =random error component, which is assumed to follow
a Normal distribution with mean 0 and standard deviation σ, where
σ measures the variation of y about the population regression line:
E (Y ) = β0 + β1x .

E (Y ) = β0 + β1x is also referred to as the line of means.

image courtesy: www.brighton-webs.co.uk

β0 and β1 are population parameters that will be known only if we
have access to the entire population of (X ,Y ) measurements. To-
gether with a speci�c value of the predictor variable X , they deter-
mine the mean value of Y , which is just a speci�c point on the line
of means.

image courtesy: cnx.org

The response Y to a given x is a random variable. The greater the
variability (σ2) of the random error ϵ, the greater will be errors in
the estimation of β0 and β1, and in the error of using ŷ to predict y
for some value of x .

Fitting the Model: The Least Squares Approach

For a given data point-say, the point (xi , yi ), -the observed value of Y
is yi and the predicted value of Y would be obtained by substituting
xi into the prediction equation:

ŷi = β̂0 + β̂1xi

The deviation of the ith value of y from its predicted value is yi−ŷi =
yi − (β̂0 + β̂1xi ). Then the sum of the squares of the deviations of
the y−values about their predicted values for all n data points is

SSE =

[yi − (β̂0 + β̂1xi )]2

The quantities β̂0 and β̂1 that make the SSE a minimum are called
the least squares estimates of the population parameters β0 and β1.

The least squares line ŷ = β̂0 + β̂1 is the line that has the following
two properties:

1. The sum of the errors equals 0, i.e., mean error of prediction=0.
2. The sum of squared errors (SSE) is smaller than that for any
other straight-line model.

Interpreting the Estimates of β0 and β1 in Simple Linear Regression

y−intercept: β̂0 represents the predicted value of y when x = 0.
(Caution: This value will not be meaningful if the value x = 0 is
nonsensical or outside the range of the sample data.)

slope: β̂1 represents the increase (or decrease) in E (Y ) for every
1-unit increase in x . (Caution: This interpretation is valid only for
x−values within the range of the sampled data.)

StatCrunch: Stat-Regression-Simple linear-X variable: Computer Sales,

Y variable: Software Sales, Graphs: Fitted line plot, QQ plot of resid-

uals, Predicted values vs. residuals

Example 9.1 continuation
(a) Use the method of least squares to estimate the values of β0 and
β1.
β̂0 = −31901.227, β̂1 = 0.63441913.

(b) Predict the software sales when the computer sales is 300, 000.
ŷ = β̂0+ β̂1xi = −31901.227+ 0.63441913(300000) = 158424.512.

(c) Give practical interpretation of β̂0 and β̂1.
The model parameters should be interpreted only within the sampled
range of the independent variable (X ), in this case, for computer sales
between 195,000 and 441,050. Thus, the y−intercept, which is, by
de�nition, at x = 0, is not within the range of the sampled values of
x and is not subject to meaningful interpretation.

The slope β̂1 = 0.63441913 implies that for every unit increase
in computer sales within the sampled range between 195,000 and
441,050, software sales will increase by 0.63441913 unit, on average.

9.2 Assumptions behind the Linear Model

The following are assumptions of simple linear analysis:

1. The model is linear. (scatter-plot)
2. The error terms have constant variances. (residual plot in which
the residuals are plotted along with their associated �tted value (ŷ)
as an ordered pair (ŷ , y − ŷ).) A plot of the residuals will indicate
that a line is a good �t to the data if the plot has no systematic
pattern. Residual plots are more meaningful with larger sample sizes.
For small sample sizes, residual plot analysis can be problematic and
subject to over-interpretation.
The assumption of constant error variance is sometimes called ho-
moscedasticity. If the error variances are not constant, it is called
heteroscedasticity.
3. The error terms are independent.
4. The error terms are normally distributed. (Normal quantile
(QQ) plot)

Example 9.1 continuation

The data points are close to the regression line, indicating that the
line might be a good �t for this data set.
The Normal quantile plot implies that the residuals are normally
distributed. The error terms appear to have constant variance.

9.3 Introduction to Analysis of Variance

Analysis of variance, often abbreviated as ANOVA, is essential for
regression (Subject 9) and for comparing several means (Subject7).
Analysis of variance summarizes information about the sources of
variation in the data.

The total variation in the response y is expressed by the deviation
yi − ȳ . The overall deviation of any y observation from the mean of
the y ′s is the sum of two deviations:

(yi − ȳ) = (ŷi − ȳ) + (yi − ŷi )

The di�erences (ŷi − ȳ) re�ect the variation in mean response due to
di�erences in the xi . This variation is accounted for by the regression
line, because the ŷ ′i s lie exactly on the line. The di�erences (yi − ŷi )
re�ect the scatter of the actual observations about the �tted line.

It is an algebraic fact that the sums of squares add:∑
(yi − ȳ)2 =


(ŷi − ȳ)2 +


(yi − ŷi )2

We rewrite this equation as

SST = SSR + SSE

where SST =

(yi − ȳ)2, SSR =

(ŷi − ȳ)2, and SSE =

(yi −
ŷi )

2, with “SS” stands for sum of squares, and “T” for total, “R”
for model, and “E” for error. The total degrees of freedom DFT is
the sum of DFR and DFE, the degrees of freedom for the model and
for the error:

DFT = DFR + DFE

(n − 1) = (1) + (n − 2)

To calculate the mean squares, use the formula

MS =
sum of squares

degrees of freedom

To test the null hypothesis H0 : β1 = 0 (Y is not linearly related to
X ) against
Ha : β1 ̸= 0 (Y is linearly related to X ), one can use the F statistic:

F =
MSR

MSE

When H0 is true, this statistic has an F distribution with 1 degree
of freedom in the numerator and n − 2 degrees of freedom in the
denominator. Large values of F are evidence against H0 in favor of
the two-sided alternative.

The coe�cient of determination, R2, is the fraction of the variation
in the value of Y that is explained by the least-squares regression of
Y on X . That is,

R2 =
SSR

SST
=


(ŷi − ȳ)2∑
(yi − ȳ)2

.

Relationship between R2 and the correlation coe�cient r :

r =

R2 if β1 > 0,

r = 0 if β1 = 0,

r = −

R2 if β1 < 0. The correlation coe�cient r tells the strength of a straight line relation between the response and explanatory variables. Example 9.1 continuation From the results on page 15, r=0.92, the linear relationship is postive and strong. R2 = 0.83777732, meaning that about 84% variation in software sales is explained by its linear relationship with computer sales, and 16% variation in software sales in not explained by its linear regression model with computer sales. The F-test is 113.61605, with a p-value smaller than 0.0001, which is smaller than 0.05, we can reject the null hypothesis H0 : β1 = 0 in favor of H0 : β1 ̸= 0, and conclude that there is a signi�cant linear relationship between computer sales and software sales. 9.4 Inferences about the slope parameter β1 Con�dence intervals and signi�cance tests for the regression slope β1 Type of Test Lower-Tailed Two-Tailed Upper-Tailed Hypotheses H0: β1 = c H0: β1 = c H0: β1 = c Ha: β1 < c Ha: β1 ̸= c Ha: β1 > c

Test Statistic TCalc =
β̂1−c
s
β̂1

RR {T : T < −tα,n−2} {T : |T | > tα
2
,n−2} {T : T > tα,n−2}

P−value P(T ≤ TCalc) 2P(T ≥ |TCalc |) P(T ≥ TCalc)
CI β̂1 ± (tα

2
,n−2)sβ̂1

where c is a constant, T has a t distribution with n − 2 degrees of
freedom, and s

β̂1
can be read from the Minitab regression analysis

output.

To test the null hypothesis that the linear model contributes no in-
formation for the prediction of Y against the alternative hypothesis
that the linear model is useful in predicting Y , we test

H0 : β1 = 0

Ha : β1 ̸= 0

In other words, the null hypothesis says that the mean of Y does not
vary with X , and that there is no straight-line relationship between
X and Y . If the data support the alternative hypothesis, we will
conclude that X does contribute information for the prediction of Y
with the straight-line model.

The algebraic relationship between the t statistic used for this two-
tailed test and the F test is: F = t2. For linear regression with one
explanatory variable, we prefer the t form of the test because it more
easily allows us to test one-sided alternatives and is closely related
to the con�dence interval for β1.

Example 9.1 continuation Conduct a test at α = 0.05 to deter-
mine per 1 dollar increase in computer sales, does the software sales
increase by more than 1 dollar, on average.

H0: β1 = 1
Ha: β1 > 1

Rejection region: {T : T > tα,n−2 = t0.05,22 = 1.71714}

Test Statistic: TCalc =
β̂1−c
s
β̂1

= 0.63441913−1
0.059519108

= −6.14 < 1.71714 Since the calculated T−value does not fall into the rejection region, we fail to reject the null hypothesis and conclude that per 1 unit increase in compute sales the average increase in software sales is not more than 1 unit. A 95% con�dence interval estimate of β1. β̂1±tα 2 ,n−2sβ̂1 = 0.63441913±2.07387(0.059519108) = (0.51, 0.76) 9.5 Con�dence Interval for the Response Mean and Prediction Interval for a Single Response For any speci�c value of X , say x∗, the mean of the response Y in this sub-population is given by µY = β0 + β1x ∗, which can be estimated as µ̂Y = β̂0 + β̂1x ∗. The standard error that we use for a con�dence interval for the estimated mean response corresponding to X = x∗ is SEµ̂Y = s √ 1 n + (x∗ − x̄)2 SSxx , where SSxx = ∑ (xi − x̄)2 and s = √ MSE are calculated based on the original n observations. A 100(1 − α)% con�dence interval for the mean response µY at X = x ∗ is µ̂Y ± tα 2 ,n−2SEµ̂Y . These con�dence intervals tell us what average value of Y to expect for a single observation at a particular x value. To know the range of Y to expect for a particular x , we need the prediction interval. The predicted response y for an individual case is ŷ = β̂0 + β̂1x ∗, which is the same as the expression for µ̂Y . That is, the �tted line is used both to estimate the mean response when X = x∗ and to predict a single future response. We use the two notations µ̂Y and ŷ to remind ourselves of these two distinct uses. The margin of error for the prediction intervals is larger than the prediction of the mean at x∗, this is because it is harder to predict an individual value than to predict the mean. The standard error used for a prediction interval at X = x∗ is SEŷ = s √ 1+ 1 n + (x∗ − x̄)2 SSxx , and a 100(1− α)% prediction interval for y is ŷ ± tα 2 ,n−2SEŷ . Example 9.1 continuation From the prediction results on page 15, the 95% con�dence interval for the average software sales when computer sales is 300,000 is given by (150716.19, 166132.83); and the 95% prediction interval for a software sales value when the computer sales is 300,000 is given by (119892.62, 196956.4). Example 9.2 Suppose a �re insurance company wants to relate the amount of �re damage in major residential areas to the distance between the burning house and the nearest �re station. The study is to be conducted in a large suburb of a major city; a sample of 40 recent �res in this suburb is selected. The amount of damage in thousands of dollars, Y , and the distance between the �re and the nearest �re station, X , in kilometers, are recorded for each �re. (a) Check the linear relationship between Y and X . Find the least squares line. Interpret the estimates of β0 and β1. (b) Interpret the coe�cient of determination. Compute the correla- tion coe�cient. (c) Is there su�cient evidence to indicate that X and Y are linearly correlated? Use α = 0.05 level of signi�cance. (d) Find a 95% con�dence interval for β1. Interpret this estimate. (e) Can we say that per 1 kilometer increase in the distance between a burning house and the nearest �re station, the average damage increases by more than 5,000 dollars? Use α = 0.05. (f) Construct an interval estimate of the average damage for houses that are 5 kilometers away from the nearest �re station, with a 95% con�dence. (g) Check the residual plot and comment whether the residuals have a constant variance. Distance Damage Distance Damage Distance Damage 8.5 42.5 9.5 54.5 6.6 40.9 7.8 46.3 10.9 62.4 11.9 67.0 9.7 58.0 11.9 67.0 13.2 85.5 11.2 69.0 13.4 71.8 9.6 54.5 8.2 48.2 16.7 86.7 7.8 46.3 7.4 44.5 8.6 58.1 8.3 48.6 12.5 69.6 8.3 50.5 9.8 68.5 7.1 39.2 8.0 47.2 12.1 65.8 11.4 62.7 9.8 55.4 10.0 56.3 10.7 59.5 9.7 56.4 8.1 47.7 11.3 62.2 11.2 60.0 7.7 46.4 4.5 31.3 9.3 53.2 11.3 62.6 13.6 84.7 12.0 65.4 7.5 46.0 10.8 68.8 StatCrunch: Stat-Regression-Simple linear-X variable: Distance, Y variable: Damage. Graphs: Fitted line plot, QQ plot of residuals, Y- values vs. residuals. For multiple graphs: Rows per page: 2, columns per page: 2. (a) The estimated least squares line is: ŷ = 6.894+ 5.1154x . The estimated y−intercept, β̂0 = 6.894, has the interpretation that a �re 0 kilometers from the station has an estimated mean damage of $6894. Although this would seem to apply to the �re station itself, however, since X = 0 is outside the sample range, β̂0 has no practical interpretation. The estimated slope, β̂1 = 5.1154, implies that per 1 km increase in the distance between a burning house and the nearest �re station, the damage will increase by 5115.4 dollars, on average. Answers to (b)-(g) will be discussed in the lecture video. Example 9.3 Refer to Exercise 4.45 in the Pearson Etext. (a) Check the linear relationship between calories and fat content (in grams). Let calories be Y and fat in grams be X . Find the least squares line. Interpret the estimates of the y-intercept and slope. (b) Find and interpret the coe�cient of determination and the cor- relation coe�cient. (c) Is there su�cient evidence to indicate that X and Y are linearly correlated? Use α = 0.05 level of signi�cance. (d) Find a 95% con�dence interval for β1. Interpret this estimate. (e) Use the regrssion line to estimate the calories of breakfast items with a fat content of 25 grams. (f) Construct an interval estimate of the average calories for breakfast items with a fat content of 25 grams. Compute the residual for the data point (x, y)=(25,430). (g) Check the residual plot and comment whether the residuals have a constant variance. Solutions to be discussed in the lecture video.