Statistical Inference STAT 431
Lecture 16: Multiple Regression (III) Specially Constructed Predictor Variables
Example: Arrival Pattern of Calls to a Financial Call Center
• The data set is from the call center of a major US bank
• Each data point contains two measurements
– Time: a time period of a quarter hour in a day (e.g. 7, 7.25, etc.) [predictor]
– N: total number of calls received in the corresponding quarter hour [response]
• Information about call arrival rates is needed in order to help plan staffing levels, i.e., (# agents on hand) at different times of day
• The current data is for the week of 07/15/2002 – 07/19/2002
• Sample size = 340 = 5(days) × 68 (1⁄4 hour periods from 7AM—12PM)
– So, we have 68 different values of Time • Total number of calls in 5 days = 277,680
STAT 431
• It has been found that transforming N to pN leads to reasonable homoscedasticity in the scatter plot [see plot]
● ●●
●● ● ●●● ●●● ●
●●●● ● ●●
●●●● ● ●●
● ●●
● ●●
●● ●●●●●
● ●
●●
●●
● ●●
● ●● ● ●●●
●
●●● ●●
●● ●●
●● ● ● ●●
● ●●●● ●●●●● ●
●
●
● ●●●●●●●●
●●●●● ●●● ● ●● ●● ●● ● ●● ●● ●●●
●●●
●●●●● ●●●●● ●●● ●●●●●● ●
●● ●●●●●●●●●● ●
●● ●●
● ●●●●●
● ●●●
● ● ● ● ● ●●●●
●●● ●● ● ●
●●●● ●●●
● ●● ●●
●●●● ●● ●
● ●
● ●● ●●
●● ●● ●●●
●●
●●●
●● ●●
●●● ●
●●● ●● ● ●●●●
●●●● ●●
●●● ●● ●●
●●●● ●
● ●● ●● ●
●● ●●●●● ●● ●
● ●●● ●● ●●
●●● ●● ●●●
● ●● ●●
●●●●●
● ●
●
• We still suffer non-linearity.
• Note that Tukey’s bulging rule does
not work here – no monotone transformation of the predictor leads to a linear relationship
• In other words, simple regression technique does not work for this data set!
STAT 431
10 15 20 Time
Root N
10 15 20 25 30 35 40
Polynomial Regression
• Basic idea: add powers of x as additional predictor variables
• This results in a polynomial regression model
– In general, there could be powers of several different predictors present in the model, e.g.:
• We treat different powers of x as different predictors in multiple regression [because they are linearly independent]: x1 = Time, x2 = Time2 , . . . , xk = Timek
• Connection to multiple regression:
– Estimating the j ’s by LS estimators
– Hypothesis testing, CI of the coefficients using t-distribution and F-distribution
– Confidence and prediction intervals of the response
• Major difference:
– The usual interpretation of regression coefficients in multiple regression no
longer applies [It is impossible to change x while holding x2 fixed] STAT 431
Fitting Polynomial Regression in R
• To fit a polynomial regression in R, we use the usual least square fit command lm
together with the polynomial construction command poly:
> callarr <- read.csv("callcenter.csv", header = TRUE)
> callfit2 <- lm(sqrt(N) ~ poly(Time, degree = 2, raw = TRUE), data = callarr) > summary(callfit2)
Call:
lm(formula = sqrt(N) ~ poly(Time, degree = 2, raw = TRUE), data = callarr)
… Coefficients:
(Intercept) -10.914447 poly(Time, degree = 2, raw = TRUE)1 7.001833 poly(Time, degree = 2, raw = TRUE)2 -0.267495 —
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
Residual standard error: 3.889 on 337 degrees of Multiple R-squared: 0.822,Adjusted R-squared: 0.8209 F-statistic: 778 on 2 and 337 DF, p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
STAT 431
2.192613 -4.978 1.03e-06 *** 0.304302 23.009 < 2e-16 *** 0.009797 -27.304 < 2e-16 ***
'.' 0.1 ' ' 1
freedom
Determining the Order of the Polynomial
• Question: Is it necessary to include the k-th order term in addition to the existing
(k-1) lower order terms?
• Casted as a model comparison problem:
– Full model:
– Reduced model:
– Wecandecidebytesting H0 : k =0 inthefullmodel
• The test can be performed in two ways using either a t-test or an F-test
• In practice, we could keep increasing the order of the polynomial as long as the
test ofH0 : k = 0 still rejects the null hypothesis [at level = 0.05, or0.01]
• Illustration on the call center data [see R code]
• For the call center data, the final model we select is a 4-th order polynomial
STAT 431
Example: Timing Production Runs
• The time [minutes] required to complete a production run increases with the number of items produced. Data were collected for the time required to process 20 randomly selected orders as supervised by three managers.
• Here are a few entries from the data set:
Manager
a
a
Time For Run
252
215
Run Size
204
103
.........
b 229 297
b 208 96
......
c 195 175
c 215 189
.........
• Question: How do the managers compare?
Need to adjust for the effect of different Run Sizes!
STAT 431
• Here is a scatter plot of Time for Run vs. Run Size, with different colors representing different managers. (Black: a; Red: b; Green: c)
●● ●●●●
● ●●●● ● ● ● ● ●●
●●●●●
● ●●● ● ●●
●
●
●●● ●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
● ●
●
●● ●●●
●●
50 100 150 200 250 300 350 RunSize
STAT 431
TimeForRun
150 200 250 300
Dummy Variables
• If we assume a common linear effect of Run Size on Time for Run for all the three
managers, then we could write three parallel simple regressions
– For manager a:
– For manager b:
– For manager c:
Time for Run = 0,a + 1 ⇥ Run Size
Time for Run = 0,b + 1 ⇥ Run Size
Time for Run = 0,c + 1 ⇥ Run Size
• Now, comparing the managers is equivalent to comparing the three intercepts
• It is more convenient to organize the three parallel regressions into one big model: Time for Run = 0 + 1 ⇥ Run Size
+ 2 ⇥ I(manager b) + 3 ⇥ I(manager c)
– Here, I(manager b) = 1 for any data entries corresponding to manager b, and
equals to 0 otherwise; I(manager c) is defined similarly
– Inthebigmodel: 0,a= 0, 0,b= 0+ 2, 0,c= 0+ 3
– The pair I (manager b), I (manager c) are called dummy variables
• Is there any other assumptions made? [Hint: count the number of parameters]
STAT 431
• In the production data set, the variable Manager is a categorical variables with 3 different categories, which is then represented by two dummy variables
I(manager b),I(manager c)
• We can include one categorical variable with k categories in a regression analysis
by introducing k-1 dummy variables.
– One of the categories serves as the reference/control level, or baseline. – E.g., in the production data, manager = a serves as the reference level – Question: Why not use k dummy variables?
• The multiple regression machinery is then used to do statistical inference – LS estimators, testing, CI, PI, etc.
• Interpretation: Regression coefficient for each dummy variable represents the effect on the response of changing the categorical variable from the reference level to the level this particular dummy variable stands for
STAT 431
Dummy Variables in R
• To fit a multiple regression model with a categorical predictor, we could apply the usual fit function lm together with the dummy variable construction function factor applied to the categorical predictor:
> prodtime <- read.csv("prodtime.csv", header = TRUE)
> prodfit <- lm(TimeForRun ~ RunSize + factor(Manager), data = prodtime) > summary(prodfit)
Call:
lm(formula = TimeForRun ~ RunSize + factor(Manager), data = prodtime) …
Coefficients:
Estimate (Intercept) 215.11848 RunSize 0.24337 factor(Manager)b -53.06082 factor(Manager)c -62.16817
Std. Error 6.14820 0.02508 5.24159 5.18003
t value Pr(>|t|) 34.989 < 2e-16 *** 9.705 1.34e-13 *** -10.123 2.93e-14 *** -12.002 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001
'**' 0.01 '*' 0.05 '.' 0.1 '
' 1
Residual standard error: 16.38 on 56 degrees of freedom Multiple R-squared: 0.8131, Adjusted R-squared: 0.8031 F-statistic: 81.23 on 3 and 56 DF, p-value: < 2.2e-16
STAT 431
• Back to the question: Are the three managers equally efficient?
• Under the model
Time for Run = 0 + 1 ⇥ Run Size
+ 2 ⇥ I(manager b) + 3 ⇥ I(manager c)
We can answer the above question by testing H0 : 2 = 3 = 0 vs. the alternative that H0 is false, using an F-test.
• We conduct the test in R, and reject H0 [they are equivalent] at level 0.05.
> prodfit2 <- lm(TimeForRun ~ RunSize, data = prodtime) > anova(prodfit2, prodfit)
Analysis
Model 1: Model 2: Res.Df 1 58 2 56
of Variance Table
TimeForRun ~ RunSize
TimeForRun ~ RunSize + factor(Manager)
RSS Df Sum of Sq F Pr(>F) 59792
15018 2 44774 83.477 < 2.2e-16 ***
STAT 431
Example: Timing Production Runs
• The analysis we have just done assumes common linear effect of Run Size on Time for Run for all the three managers.
● ●
●
●
●
●
●
●
●
●● ●●●●●
● ●●●●
●●●● ●
●●●●●
●
●●● ●
●
●
●● ●
●
●● ●
● ●●
●
●
● ●
●
●
●●●●● ●
• Looking at the scatter plot again, it might make sense to have different slopes in the simple regressions
for different managers
• But as we have seen, to compare managers, it is easier if we include everything in a single big model
• Can we construct a single model here?
STAT 431
50 100
150 200 250 300 350 RunSize
150 200
250 300
TimeForRun
Interaction
• Basic idea: including interaction terms in a multiple regression model
Time for Run = 0 + 1 ⇥ Run Size
+ 2 ⇥ I(manager b) + 3 ⇥ I(manager c)
+ 4 ⇥ Run Size ⇥ I(manager b) + 5 ⇥ Run Size ⇥ I(manager c)
• Then, we identify the parameters in the above model with those in separate simple regressions for each manager:
– Manager a: Time for Run = 0,a + 1,a ⇥ Run Size
0,a = 0, 1,a = 1
– Manager b: Time for Run = 0,b + 1,b ⇥ Run Size
0,b = 0 + 2, 1,b = 1 + 4 Time for Run = 0,c + 1,c ⇥ Run Size
0,c = 0 + 3, 1,c = 1 + 5 STAT 431
• Answer: Yes, we can!
– Manager c:
Fitting a Model with Interaction Terms
• In R, we fit the model on the previous page as the following. Pay special attention to the formula we use in the lm function. The symbol * tells R to include all the interaction terms [and also the individual terms!].
> prodfit3 <- lm(TimeForRun ~ RunSize * factor(Manager), data = prodtime) > summary(prodfit3)
Call:
lm(formula = TimeForRun ~ RunSize * factor(Manager), data = prodtime) …
Coefficients:
Estimate (Intercept) 202.53415 RunSize 0.30727 factor(Manager)b -16.04029 factor(Manager)c -52.78645 RunSize:factor(Manager)b -0.17049 RunSize:factor(Manager)c -0.04802
…
Std. Error 9.26933 0.04358 14.82719 12.25963 0.06492 0.05639
t value Pr(>|t|) 21.850 < 2e-16 *** 7.051 3.41e-09 ***
-1.082 0.2841 -4.306 7.06e-05 *** -2.626 0.0112 * -0.852 0.3982
STAT 431
• Back to the model
Interpretation
Time for Run = 0 + 1 ⇥ Run Size
+ 2 ⇥ I(manager b) + 3 ⇥ I(manager c)
+ 4 ⇥ Run Size ⇥ I(manager b) + 5 ⇥ Run Size ⇥ I(manager c)
• If H0 : 4 = 5 = 0 is rejected, then we say that there is significant interaction between Run Size and Manager. This means that the impact of Run Size on Time for Run differs among different managers.
– For this data set, the F-test for H0 : 4 = 5 = 0 gives P-value = 0.033. So, the interaction is significant at level 0.05.
• In particular, 4 stands for the change of effect on Time for Run per unit change of Run Size if Manger b instead of Manger a supervises the production; 5 has an analogous interpretation
STAT 431
Class Summary
• Key points of this class
– Polynomial regression
– Dummy variables for categorical predictors
– Interaction terms in multiple regression
– Interpretation of the coefficients
• Read Section 11.6 of the textbook
• Next class: Multiple Regression (IV) – Collinearity
STAT 431