Correlation and Regression
• a scatterplot is used to assess the
relationship between two variables
• each point shows the values of the two
variables (xi, yi) measured on the same
individual
• look for the overall pattern and for
striking deviations from it
• two variables are associated if some
values of one variable tend to occur more
often with some values of the the other
variable
• can describe the form, direction and
strength of any association
– form can be linear or nonlinear,
positive or negative
1
• sometimes we hope to explain one
variable by the other
– we call them the response and
explanatory variables
– the response variable is shown on
the vertical axis
• we may want to explain or predict the
useable volume in board feet/10 of a
tree given a measurement at chest
height in inches
MTB > set c1
DATA> 36 28 28 41 19 32 22 38 25 17 31 20 25 19 39 33 17 37 23 39
DATA> set c2
DATA> 192 113 88 294 28 123 51 252 56 16 141 32 86 21 231 187 22 205 57 265
MTB > name c1 ’diameter’
MTB > name c2 ’volume’
MTB > plot c2 c1
–
300+ *
–
volume – * *
– *
–
200+ * *
– *
–
– *
– * *
100+
– * *
– * * *
– *
– 2 2
0+
+———+———+———+———+———+——diameter
15.0 20.0 25.0 30.0 35.0 40.0
2
Correlation
• the correlation coefficient measures the
direction and strength of the linear
association between two quantitative
variables
• given data (xi, yi), i = 1 . . . n, the
correlation coefficient is
r =
1
n − 1
∑(xi − x̄
sx
)(
yi − ȳ
sy
)
• the product of the two terms in braces is
positive if both xi and yi are above or
below their means
• r must be between -1 and 1
• r = 0 means no linear association
• r = 1(−1) means all points fall on a line
with positive (negative) slope
• calculating correlation coefficient in
MINITAB
3
MTB > corr c1 c2
Correlation of diameter and volume = 0.976
• some sample plots
•
•
•
•
•
•
•
••
•
• •
•
•
•
• ••
• •
x
y
2 3 4 5 6 7 8
7
8
9
1
0
1
1
1
2
1
3
r = -0.67
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
••
•
•
x
y
2 3 4 5 6 7 8
7
8
9
1
0
1
1
1
2
1
3
r = 0
•
•
•
•
•
•
•
••
•
• •
•
•
•
• ••
• •
x
y
2 3 4 5 6 7 8
7
8
9
1
0
1
1
1
2
1
3
r = 0.42
•
•
•
•
•
•
•
••
•
• •
•
•
•
• ••
• •
x
y
2 3 4 5 6 7 8
7
8
9
1
0
1
1
1
2
1
3
r = 0.86
• •
•
•
•
•• ••
•
••
•
•
•
• ••
••
x
y
2 3 4 5 6 7 8
1
0
1
2
1
4
1
6
r = -0.1
•
•
•
•
•
•
x
y
8 10 12 14 16 18 20
2
4
6
8
1
0
1
2
1
4
1
6
r = 0.79
• top left – moderately strong negative
linear association (r = −.67)
• top right – no association (r = 0)
• middle left – weak positive association
(r = .42)
• middle right – strong positive association
(r = .86)
4
• bottom left – strong quadratic
association (zero linear, r = 0)
• bottom right – perfect negative
association with one influential outlier
(r = .79)
Alternative Formulae
• the numerator of the formula for r is
SSXY =
n∑
i=1
(xi − x̄)(yi − ȳ)
• so
r =
1
n − 1
SSXY
sxsy
• we can also write
SSXX =
n∑
i=1
(xi − x̄)
2
and
sx =
√
SSXX
n − 1
so
r =
SSXY
√
SSXXSSY Y
5
where SSY Y is defined similarly to SSXX
• note that SSXY can be written in
various ways
SSXY =
n∑
i=1
(xi − x̄)yi
=
n∑
i=1
xi(yi − ȳ)
=
n∑
i=1
xiyi − nx̄ȳ
=
n∑
i=1
xiyi −
n∑
i=1
xi
n∑
i=1
yi/n
• the version to use depends on what you
are given
Example: To study the effect of ozone
pollution on soybean yield, data were
collected at four ozone dose levels and the
resulting soybean seed yield monitored. Ozone
dose levels (in ppm)were reported as the
average ozone concentration during the
6
growing season. Soybean yield was reported
in grams per plant.
X Y
Ozone(ppm) Yield (gm/plant)
.02 242
.07 237
.11 231
.15 201
• to calculate the correlation coefficient by
hand, we obtain the squares and cross
products and their sums
X Y X2 Y 2 XY
.02 242 .0004 58564 4.84
.07 237 .0049 56169 16.59
.11 231 .0121 53361 25.41
.15 201 .0225 40401 30.15
• Column sums:
∑
xi = .35,
∑
yi = 911,∑
x2i = .0399,
∑
y2i = 208, 495, and∑
xiyi = 76.99
7
• Means: x̄ = .0875 and ȳ = 227.95
• Intermediate terms:
SSxx =
∑
i
(xi − x̄)
2 =
∑
i
x2i −
(
∑
xi)
2
n
= .0399 −
(.35)2
4
= .009275
SSyy =
∑
i
(yi − ȳ)
2 =
∑
i
y2i −
(
∑
yi)
2
n
= 208, 495 −
(911)2
4
= 1014.75
and
SSxy =
∑
i
(xi − x̄)(yi − ȳ)
=
∑
i
xiyi −
(
∑
xi)(
∑
yi)
n
= 76.99 −
.35(911)
4
= −2.7225
8
• the correlation coefficient is
r =
SSXY
√
SSXXSSY Y
=
−2.7225
√
.009275(1014.75)
= −.8874
• there is a strong negative linear
association between yield and ozone
9
Simple Linear Regression
• a line summarizing the relationship
between two variables
• has form y = β0 + β1x
– must choose which variable is the
response y and which the
explanatory variable x
– β0 is the y-intercept, the value for y
when x = 0
– β1 is the slope, the change in y for a
unit change in x
• can be used to predict value of y for a
given x
• obtain by minimizing the sum of squares
of vertical deviations from the line
SSE =
n∑
i=1
(yi − β0 − β1xi)
2
• note that SSE is a function of β0 and β1
only because the data (xi, yi),
i = 1, . . . , n is known
10
• the least squares slope has a surprisingly
simple formula
β̂1 = r
sy
sx
• the fitted intercept is
β̂0 = ȳ − β̂1x̄
• the equation of the least squares line is
ŷ = β̂0 + β̂1x
= ȳ − β̂1x̄ + β̂1x
= ȳ + β̂1(x − x̄)
= ȳ + r
sy
sx
(x − x̄)
• from the latter formula, we see that the
fitted value when x = x̄ is ȳ, so the least
squares line always goes through the
point x̄, ȳ
• rearranging further, we get
y − ȳ
sy
= r
x − x̄
sx
11
• this gives another interpretation of the
correlation coefficient, namely that it is
the slope of the best fitting line if both
the x and y variables are standardized
Example: for the tree data, ȳ = 123.0,
x̄ = 28.45, r = .976, sy = 91.7 and sx = 8.11
• the estimated slope is
β̂1 = rsy/sx = .976(91.7)/8.11 = 11.036
• the estimated intercept is
β̂0 = ȳ − β̂1x̄ = 123.0 − 11.036(28.45)
= −190.96
• the fitted line is
volume = −190.96 + 11.036diameter
• if the diameter were 27 inches, we would
predict a volume of 107.012 board
feet/10)
• these results differ from MINITAB due to
round-off error
12
MTB > regress c2 1 c1;
SUBC> residuals c3.
The regression equation is
volume = – 191 + 11.0 diameter
Predictor Coef Stdev t-ratio p
Constant -191.12 16.98 -11.25 0.000
diameter 11.0413 0.5752 19.19 0.000
s = 20.33 R-sq = 95.3% R-sq(adj) = 95.1%
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 152259 152259 368.43 0.000
Error 18 7439 413
Total 19 159698
Example: For the ozone data,
• β̂1 =
SSxy
SSxx
= −293.531
• β̂0 = ȳ − β̂1x̄ =
227.95 − (−293.531)(.0875) = 253.434
• the least squares line is
̂yield = 253.434 − 293.531ozone
13
Derivation of formulae for intercept and slope
• those who have taken calculus will know
that one can use derivatives to find the
maximum or minimum of a function
• in this case there are two variables β0
and β1, and so both partial derivatives
can be set to zero and solved
• the partial derivatives are
∂SSE
∂β0
= −2
n∑
i=1
(yi − β0 − β1xi)
and
∂SSE
∂β1
= −2
n∑
i=1
xi(yi − β0 − β1xi)
• when equated to zero and rearranged,
these give the so-called “normal
equations”.
nβ0 + β1
n∑
i=1
xi =
n∑
i=1
yi
14
and
β0
n∑
i=1
xi + β1
n∑
i=1
x2i =
n∑
i=1
xiyi
• (the term ’normal’ here has nothing to
do with the normal distribution, but
rather to the geometric idea of
orthogonality or perpendicularity)
• the two normal equations are solved
simultaneously to obtain
β̂1 =
SSXY
SSXX
and
β̂0 = ȳ − β̂1x̄
• there is a second derivation which
doesn’t require calculus
• notice that SSE is a quadratic function
of β0 and β1
• first consider β1 to be fixed and find the
value of β0 which minimizes SSE
15
• completing the square and summing
gives
SSE = Aβ2
0
+ Bβ0 + C
where
A = n
B = 2β1
∑
xi − 2
∑
yi
C =
∑
y2i − 2β1
∑
xiyi + β
2
1
∑
x2i
• the minimum of a quadratic occurs at
−B/2A, so whatever the value of β1 the
best choice for β0 is
β0 =
−2β1
∑
xi + 2
∑
yi
2n
= ȳ − β1x̄
• now substitute this choice into SSE, so
that it is now a quadratic function of β1
only
SSE =
n∑
i=1
(yi − (ȳ − β1x̄) − β1xi)
2
16
=
n∑
i=1
(yi − ȳ − β1(xi − x̄))
2
= Aβ2
1
+ Bβ1 + C
• where now
A = SSXX
B = −2SSXY
C = SSY Y
• the minimum occurs at
β̂1 =
−B
2A
=
SSXY
SSXX
• substituting in β0 gives
β̂1 = ȳ − β̂1x̄
17