程序代写代做代考 flex kernel graph Excel Contents

Contents
Mulitple Regression
I. Building Models 1
I.1SimlpeModels……………………………………….. 3 I.2ModelswithFactorsandInteractions……………………………. 15 I.2.1AModelwithOneFactor(ANOVA) ………………………… 15 I.2.2 A Model with One Factor and One Quantitative Variable (ANCOVA) . . . . . . . . . . . 18 I.2.3ModelwithManyFactors(MANOVA)……………………….. 20 I.2.4AModelwithManyFactorsandQuantitativeVariables . . . . . . . . . . . . . . . . . . . 23 I.2.5AModelwithInteractionTerms(Effects)………………………. 30 I.3RegressionSplines……………………………………… 40 I.4StandardizedRegressionCoefficients……………………………. 44 I.5BoostrappingRegressionModels ……………………………… 47
Note: To access your textbook resources type the following on the console:
I. Building Models
The Multiple Linear Regression (MR) model is a natural extension of the SLR framework that allows for many predictors. However, the model is still linear in the parameters and thus, has the form
y = β1 +β2×2 +β3×3 +···βkxx +e,
where the parameters βk = ∂E[y]/∂xk are referred to as the partial marginal effects. The assumptions of the
model are given in the table below:
When working with more than one predictor there are many more issues that may arise when building an appropriate model. Some of the more common ones are:
#library(car)
#carWeb()
1

1. Multicollinearity
2. Omitted Variable Bias 3. Variable Selection
4. Irrelevant Variables
5. Heteroskedasticity
6. Model Misspecification
We will discuss each one in detail and learn how to mitigate them but first we will introduce the subject with a simple example. Consider the U.S. Gasoline Market. We will estimate a model for gasoline consump- tion/person using the predictors income/person, price of a gallon of gasoline, price of new cars, and price of used cars. We will consider taking logs to help equalize the variances. The model is therefore:
log(G/pop) = β1 + β2 log(income/pop) + β3 log(price) + β4 log(pricenew) + β5 log(priceused) + e
Question: What is the sign of β4? Cars and gasoline are complements:
• If the prices of new cars rise, then gasoline consumption should fall, i.e., β4 < 0 • If the prices of new cars rise, consumers will buy fewer of them; keep their used cars longer, but if older cars use more gasoline than newer ones, consumption should increase, i.e., β4 > 0
## Call: lm(formula = log(gas/population) ~ log(price) + log(income) + log(newcar)
## + log(usedcar), data = as.data.frame(USGasG))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.34184 0.67489 -18.287 <2e-16 *** ## log(price) -0.05910 0.03248 -1.819 0.0786 . ## log(income) 1.37340 0.07563 18.160 <2e-16 *** ## log(newcar) -0.12680 0.12699 -0.998 0.3258 ## log(usedcar) -0.11871 0.08134 -1.459 0.1545 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard deviation: 0.03304 on 31 degrees of freedom ## Multiple R-squared: 0.958 ## F-statistic: 176.7 on 4 and 31 DF, p-value: < 2.2e-16 ## AIC BIC ## -136.75 -127.25 # It looks like economic theory wins since beta_4 < 0. library(AER) data("USGasG", package = "AER") reg.mod <- lm(log(gas/population) ~ log(price) + log(income) S(reg.mod) + log(newcar) + log(usedcar),data = as.data.frame(USGasG)) 2 I.1 Simlpe Models Consider a model for sales revenue (in $1000s) as a function of advertising expenditure (in $1000s) and price/burger of a local burger restaurant. Our model is: SALES = β1 + β2PRICE + β3ADVERT + e. # Example 1: Sales Revenue (y=sales) vs. Price (x1=price) & Advertising Expenditure (x2=ad library(PoEdata) library("margins") data(andy) plot(andy$price,andy$sales, xlab="Price", ylab="Sales",pch=20,ylim=c(55,95)) vert) 5.0 5.5 6.0 6.5 plot(andy$advert,andy$sales, xlab="Advertising", ylab="Sales",pch=20,ylim=c(55,95)) Price 3 Sales 60 70 80 90 0.5 1.0 1.5 2.0 Advertising # We can look at the pairwise scatterplot matrix 2.5 3.0 library(lattice) scatterplotMatrix(andy) sales 5.0 5.5 6.0 6.5 price 0.5 1.5 2.5 65 75 85 5.0 5.5 6.0 6.5 Sales 60 70 80 90 65 70 75 80 85 90 advert 0.5 1.0 1.5 2.0 2.5 3.0 # We can look at the correlation plot library(corrplot) corrplot(cor(andy)) 4 sales price advert 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 # We can look at the hyperplane as well: mreg.mod <- lm(sales~price+advert, data=andy) persp(mreg.mod, "price", "advert") 85 80 75 70 5.0 5.5 6.0 3.0 2.5 2.0 1.5 1.0 0.5 5 sales price advert Predicted value price advert library(Rcmdr) attach(andy) scatter3d(advert, price, sales) # To visually inspect the fit we can plot y vs y_hat. A perfect fit would look like a line plot(predict(mreg.mod),andy$sales,xlab="Predicted",ylab="Actual",pch=20) abline(a=0,b=1,col="red",lwd=2) at 45o. 70 75 80 85 Predicted # The standard regression plots are generated by plotig the regression model object: par(mfrow=c(2,2)) plot(mreg.mod,pch=20,lwd=2) 6 Actual 65 70 75 80 85 90 Residuals vs Fitted Normal Q−Q −2 −1 0 1 2 Theoretical Quantiles Residuals vs Leverage 9 Cook's distance 38 3 39 70 75 80 85 Fitted values Scale−Location 75 80 85 Fitted values 70 0.00 0.02 39 3 0.04 0.06 0.08 Leverage ## ## Call: ## lm(formula = sales ~ price + advert, data = andy) ## ## Residuals: ## Min 1Q Median 3Q Max ## -13.4825 -3.1434 -0.3456 2.8754 11.3049 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.9136
## price -7.9079
## advert 1.8626
## —
6.3516 18.722 < 2e-16 *** 1.0960 -7.215 4.42e-10 *** 0.6832 2.726 0.00804 ** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.886 on 72 degrees of freedom ## Multiple R-squared: 0.4483, Adjusted R-squared: 0.4329 ## F-statistic: 29.25 on 2 and 72 DF, p-value: 5.041e-10 7 38 3 39 38 3 39 # Effects plot summary(mreg.mod) library(effects) plot(allEffects(mreg.mod)) Standardized residuals 0.0 1.5 Residuals Standardized residuals Standardized residuals −3 1 −3 1 −15 10 price effect plot advert effect plot 85 80 75 70 80 78 76 74 5.0 5.5 price 6.0 6.5 0.5 1.0 1.5 2.0 2.5 3.0 advert A popular econometric operation is to compute combinations of parameters. For example, assume Andy’s marketing team asks you to test if increasing adverting by $800/month and lowering their prices by $0.40/burger will lead to higher revenue sales than their current marketing strategy of $1500/month and $0.20 respectively? We can address this by estimating the regression SALES = β1 + β2PRICE + β3ADVERT + e, and once we have the parameter estimates, we can compute 􏰃λ = c1β􏰃2 + c2β􏰃3, where c1 and c2 are constants (they will be replaced with the values of PRICE and ADVERT respectively). For example, for the proposed marketing strategy, 􏰃λ = −0.4β􏰃2 + 0.8β􏰃3. # Example 1 (continued): Sales Revenue (y=sales) vs. Price (x1=price) & Advert. Expend. (x #install.packages("TH.data") library(PoEdata) library("margins") library(foreign) library(multcomp) data(andy) mreg.mod <- lm(sales~price+advert, data=andy) summary(glht(mreg.mod, linfct = c("-0.4*price + 0.8*advert = 0"))) ## ## Simultaneous Tests for General Linear Hypotheses ## ## Fit: lm(formula = sales ~ price + advert, data = andy) ## 8 2=advert) sales sales ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|)
## -0.4 * price + 0.8 * advert == 0 4.6532 0.7096 6.557 7.16e-09 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## (Adjusted p values reported — single-step method)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = sales ~ price + advert, data = andy)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## -0.2 * price + 1.5 * advert == 0 4.375 1.054 4.153 8.91e-05 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
## (Adjusted p values reported — single-step method)
# According to the output: The expected sales revenue from this
# marketing strategy is $46, 532.
summary(glht(mreg.mod, linfct = c(“-0.20*price + 1.5*advert = 0”)))
# According to the output: The expected sales revenue from this
# marketing strategy is $43, 750.
# The proposed strategy appears to be better for revenue.
# Q: Assume Andy is not doing too well financially, and they need at least $58,000 in
# monthly sales revenue. How would you test if their proposed marketing stretegy will
# help them meet this requirement?
# Method 1: Hypothesis Test: -0.4*price + 0.8*advert = 5.8 ?
library(car)
linearHypothesis(mreg.mod, “-0.4*price + 0.8*advert = 5.8”)
Res.Df RSS Df
73 1781.295 NA
72 1718.943 1 62.35251 2.61171
##
## Simultaneous Confidence Intervals
##
## Fit: lm(formula = sales ~ price + advert, data = andy)
##
## Quantile = 1.9935
## 95% family-wise confidence level
Sum of Sq F
Pr(>F)
NA 0.110451
NA NA
# Looks like they’ll stay in business!
# Metho 2: Confidence Interval: [low, high]
mod.lh <- glht(mreg.mod, linfct = c("-0.4*price + 0.8*advert = 0")) confint(mod.lh) 9 ## ## ## Linear Hypotheses: ## Estimate lwr upr ## -0.4 * price + 0.8 * advert == 0 4.6532 3.2386 6.0678 # Still in business, but $58k is too close to the upper bound of $60k. The previous example failed to capture the expected diminishing returns effect for advertising. We may account for this by also including a quadratic term in the model. For example, we might consider. sales = β1 +β2price+β3advert+β4advert2 +e. If β4 < 0, our intuition would be correct, and in fact, we can even ask and check what is the optimal level of advertising. We can do this by solving for the the marginal effect of advertising on sales, ME = ∂(sales)/∂advert = β3 + 2β4advert = 1. Why do we set it equal to 1 and not 0? Andy will advertise up to the point where another $1 of expenditure adds $1 of sales. # Example 2: Sales Revenue (y=sales) vs. Price (x1=price) & # Advertising Expenditure (x2=advert + x3 = advert^2) library(PoEdata) library(effects) data(andy) mreg.mod <- lm(sales~price+advert + I(advert^2), data=andy) persp(mreg.mod, "price", "advert") 85 80 75 70 5.0 5.5 6.0 summary(mreg.mod) 3.0 2.5 2.0 1.5 1.0 0.5 ## ## Call: ## lm(formula = sales ~ price + advert + I(advert^2), data = andy) 10 Predicted value price advert ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.2553 -3.1430 -0.0117 2.8513 11.8050 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.7190
## price -7.6400
## advert 12.1512
## I(advert^2) -2.7680
## —
6.7990 16.137 < 2e-16 *** 1.0459 -7.304 3.24e-10 *** 3.5562 3.417 0.00105 ** 0.9406 -2.943 0.00439 ** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.645 on 71 degrees of freedom ## Multiple R-squared: 0.5082, Adjusted R-squared: 0.4875 ## F-statistic: 24.46 on 3 and 71 DF, p-value: 5.6e-11 plot(allEffects(mreg.mod)) price effect plot advert effect plot 0.5 1.0 1.5 2.0 2.5 3.0 advert 85 80 75 80 78 76 74 72 70 80 78 76 74 72 70 5.0 5.5 price 6.0 6.5 advert effect plot 0.5 1.0 1.5 2.0 2.5 3.0 advert # Q: From the effects plots, what would be the optimal level of advertising? # We can use the 'margins' library to estimate the Averag Marginal Effect (AME) library("margins") m <- margins(mreg.mod) summary(m) 11 sales sales sales factor AME SE z p advert 1.942988 0.6500951 2.988775 0.002801 price -7.640000 1.0459308 -7.304499 0.000000 lower upper 0.6688248 3.217151 -9.6899867 -5.590013 plot(m) 4 2 0 −2 −4 −6 −8 −10 price advert summary(margins(mreg.mod, variables = "advert")) factor AME SE z p advert 1.942988 0.6500951 2.988775 0.002801 ## xvals yvals upper lower ## 1 0.5000000 71.65245 74.58998 68.71493 ## 2 0.6083333 72.63649 75.12920 70.14379 ## 3 0.7166667 73.55556 75.67128 71.43984 ## 4 0.8250000 74.40965 76.22435 72.59495 ## 5 0.9333333 75.19878 76.79607 73.60149 ## 6 1.0416667 75.92293 77.38899 74.45688 ## 7 1.1500000 76.58212 77.99518 75.16906 ## 8 1.2583333 77.17633 78.59567 75.75700 ## 9 1.3666667 77.70558 79.16666 76.24449 ## 10 1.4750000 78.16985 79.68671 76.65299 ## 11 1.5833333 78.56915 80.13996 76.99835 ## 12 1.6916667 78.90349 80.51603 77.29095 ## 13 1.8000000 79.17285 80.80886 77.53684 ## 14 1.9083333 79.37724 81.01568 77.73881 ## 15 2.0166667 79.51667 81.13643 77.89690 ## 16 2.1250000 79.59112 81.17379 78.00845 ## 17 2.2333333 79.60060 81.13359 78.06761 lower 0.6688248 3.217151 upper # We can also show the conditional predicted value plots or conditional effect plots: cplot(mreg.mod, "advert", what = "prediction", main = "Predicted Sales Revenue, Given Adve 12 rtising") Average Marginal Effect ## 18 2.3416667 79.54511 81.02586 78.06437 ## 19 2.4500000 79.42466 80.86580 77.98351 ## 20 2.5583333 79.23923 80.67406 77.80440 80 78 76 74 72 70 Predicted Sales Revenue, Given Advertising 0.5 1.0 1.5 2.0 2.5 3.0 advert cplot(mreg.mod, "advert", what = "effect", main = "Average Marginal Effect of Advertising") Average Marginal Effect of Advertising 15 10 5 0 −5 −10 0.5 1.0 1.5 2.0 2.5 3.0 advert 13 Marginal effect of advert Predicted value # Image Plot (Contours) image(mreg.mod, "price", "advert") 5.0 5.5 6.0 6.5 􏰄 price For Andy’s Burger example, we could mathematically compute the optimal level of advertising by simply solving ME = ∂sales/∂advert = 0. In this case, however, we a nonlinear function of the parameters, namely, ME = 0 ⇐⇒ β3 + 2β4advert = 0 → advert∗ = −β3/2β4. Since we have the estimates b3 and b4, a point estimate of ME can be readily obtained as well. However, this marginal effect is a nonlinear function of the parameters, and therefore, the respective standard error is not as trivial to obtain. We can use the Delta Method to estimate the standard error for any nonlinear combination of parameters. For example, in our case, let λ = −β3/2β4, then we can estimate the variance as: var(􏰃λ) = The standard error is simply var(b3) + var(􏰃λ). Below is an example of how to do this in R. 􏰀 ∂λ 􏰁 ∂β3 􏰀 ∂λ 􏰁 ∂β4 􏰀 ∂λ 􏰁 􏰀 ∂λ 􏰁 ∂β3 ∂β4 var(b4) + 2 cov(b3, b4). # Example: Delta Method # Sales Revenue (y=sales) vs. Price (x1=price) & # Advertising Expenditure (x2=advert + x3 = advert^2) library(PoEdata) library(effects) data(andy) mreg.mod <- lm(sales~price+advert + I(advert^2), data=andy) deltaMethod(mreg.mod, "-b3/(2*b4)", parameterNames= paste("b", 1:4, sep="")) Estimate SE 2.5 % 97.5 % -b3/(2 * b4) 2.194978 0.1638247 1.873887 2.516068 # From the output it looks like the optimal level is $2,195. 14 72 73 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 advert 0.5 1.0 1.5 2.0 2.5 3.0 I.2 Models with Factors and Interactions The multiple regression framework can easily be extended to include factor variables and interactions between the predictors. In this section we will discuss the following models: 1. One Factor (ANOVA) 2. One Factor and Several Quantitative Variables (ANCOVA) 3. Many Factors (MANOVA) 4. Several Factors and Quantitative Variables 5. Interaction Terms I.2.1 A Model with One Factor (ANOVA) We can consider a linear regression model where there are no predictors and the response variable is an indicator variable. This model is known as a ‘One-Way Analysis of Variance (ANOVA)’. Following the textbooks example, consider the study by Moore and McCabe, 1993 on teaching reading using 3 different methods (Standard =Basal (baseline or control group), New method 1 = DTRA, and New method 2 = Strat). The data consist of 2 pretests and 3 post tests of reading comprehension, but we will only focus on the third post test. group pretest.1 pretest.2 post.test.1 post.test.2 post.test.3 Basal 4 3 5 4 41 Basal 6 5 9 5 41 Basal 9 4 5 3 43 Basal 12 6 8 5 46 Basal 16 5 10 9 46 Basal 15 13 9 8 45 # Example 3: Reading Compression Scores based on Different Teaching Methods library(car) head(Baumann) # Step 1: Count the number of observations in ech group xtabs(~group,data=Baumann) Basal DRTA Strat 22 22 22 # Step 2: Compute the means for each method with(Baumann, tapply(post.test.3, group, mean)) ## Basal ## 41.04545 46.72727 44.27273 ## Basal DRTA Strat ## 5.635578 7.388420 5.766750 DRTA Strat # Step 3: Compute the std. devs for each method with(Baumann, tapply(post.test.3, group, sd)) 15 # Step 4: Plot the results plot(post.test.3 ~ group, data=Baumann, xlab="Group", ylab="Reading Score") Basal DRTA Strat Group ## Call: lm(formula = post.test.3 ~ group, data = Baumann) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
# Step 5: Estimate the one-way ANOVA model:
S(baum.mod.1 <- lm(post.test.3 ~ group, data=Baumann)) ## (Intercept) 41.045 ## groupDRTA 5.682 ## groupStrat 3.227 ## --- 1.346 30.490 < 2e-16 *** 1.904 2.985 0.00404 ** 1.904 1.695 0.09498 . ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard deviation: 6.314 on 63 degrees of freedom ## Multiple R-squared: 0.1245 ## F-statistic: 4.481 on 2 and 63 DF, p-value: 0.01515 ## AIC BIC ## 435.48 444.24 # The estimates are the differences in reading scores between the baseline # Basal group and the respective one indicated. # The y-interept is the mean of the baseline group. # Note: you can change reference groupd using 'relevel' # A nicer way to look at the pairwise comprisons: library("emmeans") emmeans(baum.mod.1, pairwise ~ group) 16 Reading Score 30 35 40 45 50 55 ## $emmeans ## group emmean SE df lower.CL upper.CL ## Basal 41.0 1.35 63 ## DRTA 46.7 1.35 63 ## Strat 44.3 1.35 63 ## 38.4 43.7 44.0 49.4 41.6 47.0 ## Confidence level used: 0.95 ## ## $contrasts ## contrast ## Basal - DRTA ## Basal - Strat ## DRTA - Strat ## estimate SE df t.ratio p.value -5.68 1.9 63 -2.985 0.0111 -3.23 1.9 63 -1.695 0.2150 2.45 1.9 63 1.289 0.4064 ## P value adjustment: tukey method for comparing a family of 3 estimates # Or if you just want to look at the results w.r.t the control group: emmeans(baum.mod.1, trt.vs.ctrl ~ group) ## $emmeans ## group emmean SE df lower.CL upper.CL ## Basal 41.0 1.35 63 ## DRTA 46.7 1.35 63 ## Strat 44.3 1.35 63 ## 38.4 43.7 44.0 49.4 41.6 47.0 ## Confidence level used: 0.95 ## ## $contrasts ## contrast estimate SE df t.ratio p.value 5.68 1.9 63 2.985 0.0079 3.23 1.9 63 1.695 0.1712 ## DRTA - Basal ## Strat - Basal ## ## P value adjustment: dunnettx method for 2 tests # We can look at the respective effects plot for further insights: library(effects) plot(effect("group",baum.mod.1)) 17 group effect plot 48 46 44 42 40 Basal DRTA Strat group I.2.2 A Model with One Factor and One Quantitative Variable (ANCOVA) This case and the more general one where we can have more than one quantitative variable is known as ANCOVA for Anaylysis of Covaraince. If the factor variable (call it D) is additive, it is known as an Intercept Indicator Variable because it has the effect of shifting the regression line by an amount δ which represents the estimated coefficient of D. For example, consider modeling home prices as a function of living area (sqft) and location (D): PRICE = β1 + δD + β2SQFT + e. This suggests we have two regression lines: 􏰂 (β1+δ)+β2SQFT whenD=1 E(PRICE) = β1 + β2SQFT when D = 0. The effect of the dummy variable D can be shown graphically as: To better understand this model, we will estimate the model above where the variable utown is an indicator variable = 1 (if the house located in the university town), and = 0 (otherwise). ## ## Call: # Example 4: Home Prices(y=price) vs. Area(x1=sqft) + Location(D=utown) library(PoEdata) library(effects) data("utown", package="PoEdata") mreg.mod <- lm(price~utown+sqft, data=utown) summary(mreg.mod) 18 post.test.3 Figure 1: Intercept Dummy Variable ## lm(formula = price ~ utown + sqft, data = utown) ## ## Residuals: ## Min 1Q Median 3Q Max ## -47.536 -10.144 0.544 10.271 44.469 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6809
## utown 60.3690
## sqft 8.3557
## —
4.2902 1.324
0.9827 61.432
0.1683 49.642
0.186
<2e-16 *** <2e-16 *** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 15.52 on 997 degrees of freedom ## Multiple R-squared: 0.8649, Adjusted R-squared: 0.8647 ## F-statistic: 3192 on 2 and 997 DF, p-value: < 2.2e-16 ## ## utown effect ## utown ## 0 0.2 0.5 0.8 1 ## 216.3242 228.3980 246.5087 264.6194 276.6932 # We can look at the adjusted means (fitted values at the various levels of the factor). # In our case, the factor utown only has two levels (=0,1) effect("utown",mreg.mod) # Note: As expected, if you atke the differene between the two effects, we get # the coefficient of utown. # Interpretation: There appears to be a statistically and economically significant 19 # 'location premium' for houses located in University Town. For example. # the same house would cost about $60k more in Univ. Town. # Q: Does the result that this location premium is the same for all houses # make subject-matter sense? I.2.3 Model with Many Factors (MANOVA) This class of models are referred to as Multiway ANOVA (MANOVA) which are simply multifactor regression models. These problems are typically characterized by examining the between and withing group variations. Following your textbook’s example, we will look at the data by Moore and Krupat (1971) on partner status and conformity. partner.status conformity fcategory fscore # Example 5: Conformity vs. Authoritarianism and Partner Status # Prepare the data and cehck the numbe of obs in each category head(Moore) low low low low low low 8 low 37 4 high 57 8 high 65 7 low 20 10 low 36 6 low 18 fcategory/partner.status low high low 10 5 medium 4 11 high 87 Moore$fcategory = factor(Moore$fcategory,levels=c("low","medium","high")) Moore$partner.status = relevel(Moore$partner.status, ref="low") xtabs(~fcategory + partner.status, data = Moore) # We can cumpute the means for all cases with(Moore, tapply(conformity, list("Authoritarianism"=fcategory, "Partener's Status"=partner.status),mean)) low medium high low high 8.900 17.40000 7.250 14.27273 12.625 11.85714 # We can cumpute the std. devs for all cases with(Moore, tapply(conformity, list("Authoritarianism"=fcategory, "Partener's Status"=partner.status),sd)) 20 low high low 2.643651 4.505552 medium 3.947573 3.951985 high 7.347254 3.933979 boxplot(conformity ~ fcategory:partner.status, data=Moore, xlab="Condition (Authorit. & Status)", ylab="Conformity") abline(v=3.5,lty="dashed",col="red",lwd=2) low.low high.low low.high Condition (Authorit. & Status) high.high # The treatment combinations are represented by the 3 boxplots on the right. # Next, we will model the data: mod.moore.1 = lm(conformity ~ fcategory*partner.status, data=Moore) summary(mod.moore.1) ## ## Call: ## lm(formula = conformity ~ fcategory * partner.status, data = Moore) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.6250 -2.9000 -0.2727 2.7273 11.3750 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.900
## fcategorymedium -1.650
## fcategoryhigh 3.725
## partner.statushigh 8.500
## fcategorymedium:partner.statushigh -1.477
## fcategoryhigh:partner.statushigh -9.268
1.448 6.146 3.24e-07 ***
2.709 -0.609 0.54601
2.172 1.715 0.09429 .
2.508 3.389 0.00162 **
3.666 -0.403 0.68917
3.451 -2.686 0.01057 *
21
Conformity
5 10 15 20

## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 4.579 on 39 degrees of freedom
## Multiple R-squared: 0.3237, Adjusted R-squared: 0.237
## F-statistic: 3.734 on 5 and 39 DF, p-value: 0.007397
HL H
“))})
with(Moore, {interaction.plot(fcategory, partner.status,conformity,
type = “b”, pch=c(1,16),cex=2, ylim=range(conformity),
leg.bty =”o”);points(jitter(as.numeric(fcategory),
factor=0.5),conformity,
pch=ifelse(partner.status==”low”,”L”,”H
H
H HH
L
partner.status
H
low high
HH
H H
L HLL L LH H L
H LH
L LH LLHL
L H HL LL
LLL low medium high
fcategory
##
## Call:
## lm(formula = conformity ~ fcategory + partner.status, data = Moore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7236 -3.1978 -0.1978 2.8831 13.8831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.19778 1.37267 7.429 4.11e-09 ***
## fcategorymedium -1.17600 1.90203 -0.618 0.5398
## fcategoryhigh -0.08089 1.80919 -0.045 0.9646
## partner.statushigh 4.60667 1.55646 2.960 0.0051 **
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
# We can remove the interaction term and review the main effects plot:
mod.moore.2 = update(mod.moore.1, .~. -fcategory:partner.status, data=Moore)
summary(mod.moore.2)
22
mean of conformity
5 10 15 20

##
## Residual standard error: 4.922 on 41 degrees of freedom
## Multiple R-squared: 0.1786, Adjusted R-squared: 0.1185
## F-statistic: 2.971 on 3 and 41 DF, p-value: 0.04279
plot(allEffects(mod.moore.2)) fcategory effect plot
15
14
13
12
11
10
partner.status effect plot
16
14
12
10
98 low medium high
fcategory
I.2.4 A Model with Many Factors and Quantitative Variables
low high
partner.status
This model can be treated the same way as the one before but instead of two models, if we e.g., consider 2 factors, with two levels each, we now have 4 models to evaluate. Lets revisit the Prestige data set.
# Example 6: Presige(y=prestige) vs. Education(x1=education), # Income(x2=income) and Type(D=type)
library(“car”)
library(effects)
summary(Prestige)
education
Min. : 6.380 1st Qu.: 8.445 Median :10.540 Mean :10.738 3rd Qu.:12.648 Max. :15.970
income
Min. : 611
1st Qu.: 4106 Median : 5930 Mean : 6798 3rd Qu.: 8187 Max. :25879
women
Min. : 0.000 1st Qu.: 3.592 Median :13.600 Mean :28.979 3rd Qu.:52.203 Max. :97.510
prestige
Min. :14.80 1st Qu.:35.23 Median :43.60 Mean :46.83 3rd Qu.:59.27 Max. :87.20
census
Min. :1113 1st Qu.:3120 Median :5135 Mean :5402 3rd Qu.:8312 Max. :9517
type
bc :44 prof:31 wc :23 NA’s: 4 NA NA
23
conformity
conformity

mreg.mod = lm(prestige ~ education + log2(income) + type, data=Prestige)
summary(mreg.mod)
##
## Call:
## lm(formula = prestige ~ education + log2(income) + type, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.511 -3.746 1.011 4.356 18.438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -81.2019
## education 3.2845
## log2(income) 7.2694
## typeprof 6.7509
## typewc -1.4394
## —
13.7431 -5.909 5.63e-08 ***
0.6081 5.401 5.06e-07 ***
1.1900 6.109 2.31e-08 ***
3.6185 1.866 0.0652 .
2.3780 -0.605 0.5465
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 6.637 on 93 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.8555, Adjusted R-squared: 0.8493
## F-statistic: 137.6 on 4 and 93 DF, p-value: < 2.2e-16 ## ## type effect ## type ## bc prof wc ## 47.01505 53.76594 45.57565 plot(allEffects(mreg.mod)) # Based on the output we an see that the baseline group is type = bc. Therefore, # the coefficients of typeprof and typewc represent the difference in intercepts # (whch are the same as the differences in means -only in this case) between bc # and prof, and bc and wc respectively. effect("type",mreg.mod) 24 education effect plot income effect plot 50001000105002000020500300000 income 70 60 50 40 30 55 50 45 70 60 50 40 8 10 12 education 14 16 type effect plot prestige prestige prestige bc prof wc type # We can also look at a different regression model that includes the percentage of women i mreg.mod = lm(prestige ~ education + log2(income) + women, data=Prestige) plot(allEffects(mreg.mod)) 25 n the occupati education effect plot income effect plot 0 50010000105002000020500300000 income 70 60 50 40 30 56 54 52 50 48 46 70 60 50 40 30 20 10 8 10 12 14 16 education women effect plot 0 20 40 60 80 100 women 26 w to display t , data=Prestig prestige prestige prestige prestige.mod <- lm(prestige ~ education + log2(income) + women, data=Prestige) # Let's add interaction terms: # Note: In the next section we will discuss interaction terms, but here's an exmaple of ho mreg.mod = lm(prestige ~ education + log2(income) +type +log2(income):type +education:type plot(allEffects(mreg.mod),layout=c(1,3)) income*type effect plot type = wc type = prof type = bc 50010000105002000020500300000 income education*type effect plot 80 70 60 50 40 30 20 80 70 60 50 40 30 20 80 70 60 50 40 30 20 80 60 40 20 80 60 40 20 type = wc type = prof type = bc 8 10 12 14 16 education 80 60 40 20 # We can include one more interaction term prestige.mod.2 <- update(prestige.mod, ~ . -women + education:type + log2(income):type) # # Show the predictor effects plot by predictor: education and income plot(predictorEffects(prestige.mod.2, ~ education), lattice=list(key.args=list(cex=.8, cex.title=.8)), lines=list(multiline=TRUE)) 27 prestige prestige education predictor effect plot type bc prof wc 70 60 50 40 30 20 8 10 12 14 16 education plot(predictorEffects(prestige.mod.2, ~ income), lines=list(multiline=TRUE)) 28 prestige income predictor effect plot type bc prof wc 60 50 40 30 5000 10000 15000 20000 income 25000 plot(predictorEffects(prestige.mod.2, ~ type, xlevels=list(income=c(1000, 5000, 8000, 12000), education=c(8, 10, 12, 14))), lines=list(multiline=TRUE), confint=list(style="bars")) 29 prestige type predictor effect plot education 8 12 10 14 bc prof wc income = 8000 income = 12000 income = 1000 income = 5000 70 60 50 40 30 20 10 70 60 50 40 30 20 10 bc prof wc type I.2.5 A Model with Interaction Terms (Effects) An interaction effect between two predictors is represented by the product of the two variables (we can have more than two). For example, if one of the variables is factor (=D) and the other variable is quantitative (e.g., PRICE), then D × PRICE is called a slope interaction (or indicator) variable because its effect is to change the slope of the regression line. We can examine Example 4 (Home Prices vs. Location) but instead introduce an interaction term between location and area. Effectively we are capturing the influence that one predictor has on the other. 􏰂 β1 +(β2 +γ)SQFT when D = 1 E(PRICE) = β1 + β2SQFT + γ(SQFT × D) = β1 + β2SQFT when D = 0. The figure below illustrates this case. 30 prestige # Example 7: (Slope-Interaction) Home Prices(y=price) vs. Area(x1=sqft) x Location(D=utown library(PoEdata) library(effects) data("utown", package="PoEdata") mreg.mod <- lm(price~sqft + utown*sqft, data=utown) summary(mreg.mod) ## ## Call: ## lm(formula = price ~ sqft + utown * sqft, data = utown) ## ## Residuals: ## Min 1Q Median 3Q Max ## -49.298 -9.970 0.301 10.526 44.363 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.0625
## sqft 7.6642
## utown 28.1235
## sqft:utown 1.2795
## —
6.2392 3.696 0.000231 ***
0.2466 31.079 < 2e-16 *** 8.5106 3.305 0.000985 *** 0.3355 3.814 0.000145 *** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 15.42 on 996 degrees of freedom ## Multiple R-squared: 0.8669, Adjusted R-squared: 0.8665 ## F-statistic: 2162 on 3 and 996 DF, p-value: < 2.2e-16 ) # Note that the slope has changed from 8.36 (w/o the interaction term) to a higher value # of (7.66 + 1.28 = 8.94) # We can obtain this also by looking directly at the AME: #library("margins") #margins(mreg.mod, at = list(utown = 0:1)) #margins(mreg.mod, variables=c("utown","sqft")) 31 # In this case, the slope-indicator variable sugests a higher location premium for # larger houses than for smaller houses. For exmaple, if we compute # E[Price]_utown - E[Price]_other = gamma x sqft, the larger sqft, the larger the # difference in expected prices. # Lastly, can also look at the adjusted means (fitted values at the various # levels of the factor). #effect("utown",mreg.mod) Another possibility is that we have both intercept- and slope- interaction variables. 􏰂 (β1 +δ)+(β2 +γ)SQFT when D = 1 when D = 0. E(PRICE) = β1 + δD + β2SQFT + γ(SQFT × D) = β1 + β2SQFT The figure below illustrates this case. # Example 8: (Slope and y-Intercept Interactions): # Home Prices(y=price) vs. Area(x1=sqft) x Location(D=utown) library(PoEdata) library(effects) data("utown", package="PoEdata") mreg.mod <- lm(price~utown + sqft + utown*sqft, data=utown) summary(mreg.mod) ## ## Call: ## lm(formula = price ~ utown + sqft + utown * sqft, data = utown) ## ## Residuals: ## Min 1Q Median 3Q Max ## -49.298 -9.970 0.301 10.526 44.363 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.0625 6.2392 3.696 0.000231 ***
## utown 28.1235 8.5106 3.305 0.000985 ***
32

## sqft 7.6642 0.2466 31.079 < 2e-16 *** ## utown:sqft 1.2795 0.3355 3.814 0.000145 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 15.42 on 996 degrees of freedom ## Multiple R-squared: 0.8669, Adjusted R-squared: 0.8665 ## F-statistic: 2162 on 3 and 996 DF, p-value: < 2.2e-16 Interactions between quantitative variables are very common in economics, and exemplify how the effect of one variable influences the other. For example, consider a model for pizza expenditure as a function of income and age, with an interaction variable for age and income as follows: PIZZA = β1 + β2AGE + β3INCOME + β4(AGE × INCOME) + e. In this case the effects of AGE and INCOME make more economic sense (assuming the coefficients have the expected signs) than a model without the interaction. For example, ∂PIZZA/∂AGE = β2 + β4INCOME suggests that the older the person, the lower the expenditure on pizza but even lower at higher levels of income. ## ## Call: ## lm(formula = pizza ~ age * income, data = pizza4) ## ## Residuals: ## Min 1Q Median 3Q Max ## -200.86 -83.82 20.70 85.04 254.23 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161.46543 120.66341 1.338 0.1892
## age -2.97742 3.35210 -0.888 0.3803
## income 6.97991 2.82277 2.473 0.0183 *
## age:income -0.12324 0.06672 -1.847 0.0730 .
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 127 on 36 degrees of freedom
## Multiple R-squared: 0.3873, Adjusted R-squared: 0.3363
## F-statistic: 7.586 on 3 and 36 DF, p-value: 0.0004681
# Example 9: Intercations between Quantitative variables
# Pizza Expenditure (y=pizza) vs. Age(x1=age) and Income(x2=income) library(PoEdata)
library(effects)
data(“pizza4″,package=”PoEdata”)
mreg.mod <- lm(pizza~age*income, data=pizza4) summary(mreg.mod) # Let's compute the effect of age on pizza expenditure at two levels of income: $25k & $90 #library("margins") #margins(mreg.mod, at = list(income = c(25,90))) # We can compute them directly as well: 33 k inc <- c(25, 90) b2 <- coef(mreg.mod)[[2]] b4 <- coef(mreg.mod)[[4]] (DpDa <- b2+b4*inc) ## [1] -6.058407 -14.068965 Lastly, we will consider another popular type of model in econometrics; one that includes interactions between factor variables. As an illustration we will consider a model for hourly wages as a function of years of education, race and gender. WAGE = β1 + β2EDUC + δ1BLACK + δ2FEMALE + γ(BLACK × FEMALE) + e. The indicator variable FEMALE=1 (person is a female), and BLACK= 1(person is black). In this case we have # This suggests that the marginal effect of age is a decrease in annual spending # of $6.05 for an income of $25, and $14.07 for an income of $90k. four models to consider: E(WAGE) =  β1 + β2EDUC  (β1 + δ1) + β2EDUC  (β1 + δ2) + β2EDUC  (β1 +δ1 +δ2 +γ)+β2EDUC WHITE − BLACK BLACK − MALE WHITE − FEMALE BLACK−FEMALE. Given the four models above, we can see that the baseline (reference or control) group (BLACK = 0 and FEMALE = 0) would be WHITE-MALE. The table below shows the estimates of the model: ## ## Call: ## lm(formula = wage ~ educ + black * female, data = cps4_small) ## ## Residuals: ## Min 1Q Median 3Q Max # Example 10: Intercations between Factors # Wages (y=wage) vs. Education (x1=educ), Race (x2=black), and Gender (x3=female) library(PoEdata) library(effects) data("cps4_small", package="PoEdata") mreg.mod = lm(wage~educ+black*female, data=cps4_small) summary(mreg.mod) 34 ## -27.162 -6.912 -2.845 4.901 57.470 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.2812
## educ 2.0704
## black -4.1691
## female -4.7846
## black:female 3.8443
## —
1.9005 -2.779 0.00556 **
0.1349 15.350 < 2e-16 *** 1.7747 -2.349 0.01901 * 0.7734 -6.186 8.98e-10 *** 2.3277 1.652 0.09894 . ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.44 on 995 degrees of freedom ## Multiple R-squared: 0.2089, Adjusted R-squared: 0.2057 ## F-statistic: 65.67 on 4 and 995 DF, p-value: < 2.2e-16 # From the output, we can provide a couple of interpretations by comparing different groups: # (1) WHITE-MALE vs. WHITE-FEMALE: = delta_2 = -$4.78 Other things being equal, we expect white males to # (2) WHITE-MALE vs. BLACK-FEMALE: = (delta_1 + delta_2 + gamma) = -4.1691 -4.7846 + 3.8443 = -$5.11. Ot #(3) WHITE-FEMALE vs. BLACK-FEMALE: delta_1 + gamma = -$0.3248. Other things being equal, we expect whit # We can show the effects plot plot(allEffects(mreg.mod)) educ effect plot black*female effect plot 30 20 10 0 −10 0.0.02.04.06.18.0 female = 0fe.8male = 1 24 22 20 18 16 female =fe0male =f0e.m2ale = 0.5 wage wage 0 5 10 educ 15 20 0.0.02.04.06.18.0 0.0.02.04.06.18.0 # We can also test directly the joint hypothesis that that neither race nor gender affects wage: # H0: delta_1 = 0, delta_2 =0, gamma = 0 against that at least one of them is different from 0. # Note: hyp <- c("black=0", "female=0", "black:female=0") 35 24 22 20 18 16 black linearHypothesis(mreg.mod, hyp) Res.Df 998 995 RSS Df 135771.1 NA 130194.7 3 Sum of Sq NA 5576.473 F Pr(>F)
NA NA 14.20588 0
# Note: The test-statistic is F = [(SSE_R -SSE_U)/J] / [SSE_U/(N-K)] where R =restricted,
# U = un-restricted, J = # of restrictions, N = # of observations, K = # of paramters.
# N = 1000, K = 5, J =3. SSE_R = 135771, SSE_U = 130195
# The Restricted model is: WAGE = beta_1 + beta_2 EDUC and the Unrestricted model
# is the original one with all the predictors.
#
In many applications we may need to include factor variables with more than 2 levels. For example, suppose we were interested in estimating the previous model but based on regional differences by comparing wages between the Northeast, Midwest, West, and South of the country. Our dummy variable (D) would then take on 4 values. We can easily incorporate this into our model by adding 3 factor variables as follows:
WAGE = β1 + β2EDUC + δ1SOUTH + δ2MIDWEST + δ3WEST + e.
Notice that we omitted a variable for the Northeast region. This is to avoid the Dummy Variable Trap. Basically,
we only need N-1 dummy variables for the N levels. In this case, the Northeast would be our baseline group.
# Example 11: Intercations between Factors (with more than two levels)
# Wages (y=wage) vs. Education (x1=educ), Race (x2=black), Gender (x3=female),
# and Location (Northeast, South, West, Midwest)
# For simplicity, assume we are interested in comparing regional wage differences # between the South and the rest of the country.
library(PoEdata)
library(effects)
data(“cps4_small”, package=”PoEdata”)
mreg.mod = lm(wage~educ+south+midwest+west+black*female, data=cps4_small)
summary(mreg.mod)
##
## Call:
## lm(formula = wage ~ educ + south + midwest + west + black * female,
## data = cps4_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.646 -7.121 -2.608 4.690 57.393
##
## Coefficients:
36

## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.8062 2.0287 -2.369 0.018 *
## educ
## south
## midwest
## west
## black
## female
## black:female 3.6250
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 11.38 on 992 degrees of freedom
## Multiple R-squared: 0.2189, Adjusted R-squared: 0.2134
## F-statistic: 39.71 on 7 and 992 DF, p-value: < 2.2e-16 The regression results suggest that maybe the wages differences between e.g., the South and the rest of the country are not statistically significant. We can perform a Chow Test to see if the two regressions are the same or not. The null hypothesis is that they are the same. If they are, we may want to consider pooling the data together. We can write the model as: In this case the null hypothesis can be written as: H0 : θ1 = 0,θ2 = 0,θ3 = 0,θ4 = 0,θ5 = 0 against H1 :θi ̸=0,foranyi. 2.0712 -0.4499 -2.6084 0.1345 15.403 < 2e-16 *** 0.9866 -3.9055 -4.7441 0.661 0.014 * 0.352 0.029 * 1.0250 -0.439 1.0596 -2.462 1.0598 0.931 1.7863 -2.186 0.7698 -6.163 1.04e-09 *** 2.3184 1.564 0.118 # Example 12: Intercations between Factors (with more than two levels) # Wages (y=wage) vs. Education (x1=educ), Race (x2=black), Gender (x3=female), # and Location (South) # For simplicity, assume we are interested in comparing regional wage differences # between the South and the rest of the country. library(PoEdata) library(effects) data("cps4_small", package="PoEdata") 37 mod6 <- lm(wage~educ+black*female+south/(educ+black*female), data=cps4_small) summary(mreg.mod) ## ## Call: ## lm(formula = wage ~ educ + south + midwest + west + black * female, ## data = cps4_small) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.646 -7.121 -2.608 4.690 57.393 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.8062 2.0287 -2.369 0.018 *
## educ
## south
## midwest
## west
## black
## female
## black:female 3.6250
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 11.38 on 992 degrees of freedom
## Multiple R-squared: 0.2189, Adjusted R-squared: 0.2134
## F-statistic: 39.71 on 7 and 992 DF, p-value: < 2.2e-16 2.0712 -0.4499 -2.6084 0.1345 15.403 < 2e-16 *** 0.9866 -3.9055 -4.7441 0.661 0.014 * 0.352 0.029 * 1.0250 -0.439 1.0596 -2.462 1.0598 0.931 1.7863 -2.186 0.7698 -6.163 1.04e-09 *** 2.3184 1.564 0.118 # Note: We can estimate the two regression models separately, and then # compare them using the Chow Test: # install.packages("stargazer") library(stargazer) dnosouth <- cps4_small[which(cps4_small$south==0),]#no south dsouth <- cps4_small[which(cps4_small$south==1),] #south modns <- lm(wage~educ+black*female, data=dnosouth) mods <- lm(wage~educ+black*female, data=dsouth) modfull <- lm(wage~educ+black*female+south/(educ+black*female), data=cps4_small) #stargazer(modfull, modns, mods) hyp <- c("south=0", "educ:south=0", "black:south=0", "female:south=0", "black:female:south linearHypothesis(modfull, hyp) Sum of Sq F Pr(>F)
NA 0.9009447
=0″)
Res.Df RSS Df
995 130194.7 NA
990 129984.4 5 210.2582 0.3202779
NA NA
# We fail to reject H0 and conclude that there does not appear to be statistically
# significant differences between wages in the south and the rest of the country.
38

(1)
2.173∗∗∗ (0.166)
−5.089∗ (2.643)
−5.005∗∗∗ (0.899)
3.944 (4.048)
Table 14:
Dependent variable:
wage (2)
2.173∗∗∗ (0.164)
−5.089∗ (2.604)
−5.005∗∗∗
(0.886) (1.581)
educ
black
female
south black:female educ:south black:south female:south black:female:south Constant
Observations R2
Adjusted R2 Residual Std. Error F Statistic
Note:
(3)
1.864∗∗∗ (0.240)
−3.385 (2.579)
5.306 5.306 2.370
(3.497)
−0.309 (0.286)
1.704 (3.633)
0.901 (1.773)
−2.936 (4.788)
−6.606∗∗∗ (2.337)
1,000
0.210
0.203
11.459 (df = 990) 29.264∗∗∗ (df = 9; 990)
(3.446) (3.383)
−6.606∗∗∗ (2.302)
704
0.221
0.217
11.289 (df = 699) 49.717∗∗∗ (df = 4; 699)
−2.662 (3.420)
296
0.184
0.173
11.855 (df = 291) 16.427∗∗∗ (df = 4; 291)
−4.104∗∗∗
39
∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 # Example 13: Intercations between Factors (with more than two levels) I.3 Regression Splines Regression splines are an excellent alternative to the models we have been considering because they provide a greater deal flexibility at the expense of interpretability. These are a special case of the more general Piecewise REgression. Splines are designed for modeling complex dynamics that high degree polynomials are not able to capture. Briefly, they entail subsetting the data and fitting piecewise polynomials to each region. The points where the segments connect are called the knots of the spline. Two popular methods are the Simple Splines and B-Splines (B = Basis). Below we examine a couple of examples. # Example 13: Splines: (Per Person GDP vs. Infant Mortality Rates per 1000) data(UN) attach(UN) plot(ppgdp,infantMortality,pch=20) 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 ppgdp # Model 1: We can try log-log transformation: plot(log(ppgdp),log(infantMortality)) 40 infantMortality 0 20 40 60 80 100 5 6 7 8 9 10 11 plot(ppgdp,infantMortality,pch=20) loglog.fit <- lm(I(log(infantMortality)) ~ I(log(ppgdp))) curve(exp(coef(loglog.fit)[1] + coef(loglog.fit)[2]*log(x)),add=T,col="red", lwd=2) 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 ppgdp log(ppgdp) # Model 2: Linear Spline # Notice that for GDP > 2000 the decline in infant mortality is not as high. Therefore, we
# Piecewise regression by using an indicator variable to separate the two regions.
# We will use a Linear Spline Regression:
# E(Y|X) = beta_0 +beta_1 X + beta_2(X-a1) +beta_3(X-a2) + … +beta_k-1(X-ak), a_i = knot
41
can try a
s
infantMortality log(infantMortality)
0 20 40 60 80 100 1 2 3 4

#E(child.mortality|gdp) = beta_0 + beta_1 gdp + beta_2 (gdp > 2000) + beta_3 gdp * (gdp >
# Note: We will only use 2 knots
mod.linspline <- lm(infantMortality ~1 + ppgdp + I((ppgdp-1750)*(ppgdp>1750)))
summary(mod.linspline)
##
## Call:
## lm(formula = infantMortality ~ 1 + ppgdp + I((ppgdp – 1750) *
## (ppgdp > 1750)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.377 -9.290 -3.038 5.420 76.074
##
## Coefficients:
##
## (Intercept)
## ppgdp
## I((ppgdp – 1750) * (ppgdp > 1750)) 0.040420 0.002797 14.45 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 17.39 on 190 degrees of freedom ## (20 observations deleted due to missingness) ## Multiple R-squared: 0.6501, Adjusted R-squared: 0.6464 ## F-statistic: 176.5 on 2 and 190 DF, p-value: < 2.2e-16 Estimate Std. Error t value Pr(>|t|)
95.012335 4.016111 23.66 <2e-16 *** -0.040837 0.002767 -14.76 <2e-16 *** 2000) b.0 <- coef(mod.linspline)[1] b.1 <- coef(mod.linspline)[2] b.2 <- coef(mod.linspline)[3] x.0 <- seq(0,1750,1) x.1 <- seq(1750,42000,1) y.0 <- b.0 + b.1 * x.0 y.1 <- (b.0 + b.1 * 1750 + (b.1 + b.2)* x.1) plot(ppgdp,infantMortality,pch=20) lines(x.0,y.0, col="red",lwd=3) lines(x.1,y.1, col="blue",lwd=3) 42 0e+00 2e+04 4e+04 6e+04 8e+04 ppgdp # Model 3: Cubic Spline # E(Y|X) = beta_0 + beta_1X + beta_2 X2 + beta_3 X3 + # beta_4 (X-a1)^3 + beta_5 (X-a2)^3+ ...+ beta_{k+3}(X-ak)^3 data=na.omit(UN) attach(data) library(splines) plot(ppgdp,infantMortality,pch=20) smoothspline.reg <- smooth.spline(ppgdp,infantMortality,nknots=10) lines(smoothspline.reg,col="red",lwd=3) 1e+05 infantMortality infantMortality 0 20 40 60 80 100 0 20 40 60 80 100 0e+00 2e+04 4e+04 6e+04 8e+04 ppgdp 1e+05 43 # Example 14: Splines: Canadian Survey of Labor Income Dynamics (SLID) # wages vs age, education, and sex(M/F) library(car) data(SLID) attach(SLID) mod.slid.int <- lm(log(wages) ~ sex*(ns(education, df=5) + poly(age, degree=2)), data=SLID, subset = age >= 18 & age <= 65) plot(predictorEffects(mod.slid.int, predictors = ~ education + age, transformation=list(link=log, inverse=exp)), lines=list(multiline=TRUE), confint=list(style="bands"), axes=list(y=list(lab="Hourly Wage Rate"))) education predictor effect plot age predictor effect plot 20 15 10 20 18 16 14 12 10 8 Female Male Female Male 0 5 10 education 15 20 20 30 40 50 60 age detach(SLID) I.4 Standardized Regression Coefficients sex sex Hourly Wage Rate Hourly Wage Rate Often times it may be helpful to standardize (i.e., convert to z-scores) both the response and predictor variables in order to better model underlying associations between the variables. The respective estimated coefficients are known as standardized coefficients. For example, consider a model that includes major financial indices in their original currencies. 44 Given the different scales, it would make sense to have a common currency, or even better, a dimensionless scale that we can all easily interpret such as z-scores. Below we show a plot of the standardized indices. In R we can easily standardize variables by using the scale function. Below is an example using the Prestige dataset. 45 # Example 15: Prestige Dataset Regression with Standardized Variables library("car") Prestige.1 = with(Prestige, data.frame(prestige=prestige, log2inc=log2(income), education=education,women=women)) Prestige.scaled = data.frame(scale(Prestige.1)) head(Prestige.scaled) prestige 1.2767988 1.2942361 0.9629272 0.5793063 1.5499834 1.7882933 Stdrd. Prestige log2inc 1.2866434 2.5372918 0.8016552 0.7259423 0.6354491 1.0953873 education 0.8693455 0.5578127 0.7447324 0.2499449 1.4227745 1.7966139 # We can verify the scaling is correct by simply plotting the respective histograms. par(mfrow=c(2,2)) hist(Prestige.scaled$prestige,main="Stdrd. Prestige") hist(Prestige.scaled$log2inc,main="Stdrd. log(inc)") hist(Prestige.scaled$education,main="Stdrd. Education") hist(Prestige.scaled$women,main="Stdrd. Women") −2 −1 0 1 2 Prestige.scaled$prestige Stdrd. Education −2 −1 0 1 2 Prestige.scaled$education women -0.5616725 -0.7867320 -0.4185673 -0.6262904 -0.5452816 -0.7517438 Stdrd. log(inc) −4 −3 −2 −1 0 1 2 3 Prestige.scaled$log2inc Stdrd. Women −1.0 0.0 1.0 2.0 Prestige.scaled$women # Now we can estimated the regression model summary(lm(prestige ~ log2inc + education + women, data=Prestige.scaled)) ## 46 Frequency Frequency 0 10 0 10 20 Frequency Frequency 0 20 50 0 20 40 ## Call: ## lm(formula = prestige ~ log2inc + education + women, data = Prestige.scaled) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.00927 -0.25745 -0.00587 0.25087 1.11479 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.548e-17 4.082e-02 0.000 1.00
## log2inc
## education
## women
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 0.4123 on 98 degrees of freedom
## Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
## F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16 I.5 Boostrapping Regression Models In general, for any regression model estimated, it is a good idea to also include bootstrap estimates as way to test for the robustness of the estimates. The core principle being simple random sampling with replacement from the original dataset, and for each sample estimating the same regression model and then collecting the usual statistics. 4.620e-01 6.579e-02 7.022 2.9e-10 *** 5.916e-01 5.620e-02 10.527 < 2e-16 *** 8.647e-02 5.513e-02 1.568 0.12 # Example 16: Bank Transcation Time as a function of two types of transcations' times library(boot) library(car) set.seed(3435) head(Transact) t1 0 0 0 516 623 395 R original (Intercept) 999 144.369443 t1 999 5.462057 t2 999 2.034549 t2 time 1166 2396 1656 2348 899 2403 3315 13518 3969 13437 3087 7914 bootBias bootSE bootMed trans.mod = lm(time ~t1+t2, data=Transact) betahat.boot = Boot(trans.mod, R=999) usualEsts = summary(trans.mod)$coef[, 1:2] summary(betahat.boot) -0.0267235 195.1679081 150.650703 0.0357081 0.6881960 5.517476 -0.0047433 0.1562401 2.028783 47 confint(betahat.boot) hist(betahat.boot) Normal Density Kernel Density bca 95% CI Obs. Value −600 −200 200 600 (Intercept) 1.6 1.8 2.0 2.2 2.4 2.6 t2 2.5 % 97.5 % (Intercept) -297.646909 498.719579 t1 t2 3.863345 1.790541 6.599426 2.372642 3 4 5 6 7 8 t1 Density Density 0.0 1.5 3.0 0.0000 0.0020 Density 0.0 0.4 # Bootstrap for the estimated residual standard deviation: sigmahat.boot <- Boot(trans.mod, R=199, f=sigmaHat, labels="sigmaHat") summary(sigmahat.boot) R original bootBias bootSE bootMed sigmaHat 199 1142.556 -12.35542 98.92577 1129.662 confint(sigmahat.boot) 2.5 % 97.5 % sigmaHat 955.7003 1315.347 48