程序代写代做代考 Excel algorithm Contents

Contents
Linear Regression
I. The Linear Model and Least-Squares Regression 1
I.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 I.2NonlinearTransformations…………………………………. 13 I.3Predictionvs.ConfidenceIntervals …………………………….. 18 I.4ModelswithFactors(IndicatorVariables) …………………………. 20
Note: To access your textbook resources type the following on the console:
I. The Linear Model and Least-Squares Regression I.1 Simple Linear Regression
The Simple Linear Regression (SLR) model is given by:
y = β0 + β1x + e
In R (and most other software packages) the default null hypothesis tested is H0 : βk = 0 against the alternative H1 : βk ̸= 0, k = 1, 2. Therefore, if the respective estimated p-value is small (p−value < α), we reject H0 and conclude that the respective parameter is statistically significantly different from 0. The Simple Linear Regression (SR) assumptions of the model are that: 1. Thevalueofy,foreachvalueofx,isy=β0+β1x+e 2. E[e]=0whichisthesameasE[y]=β0+β1x 3. var[e] = var[y] = σ2 (constant variance condition, i.e., the errors are homoskedastic) 4. cov(yi, yj) = cov(ei, ej) = 0 (even better if the errors are independent which is a stronger condition) 5. x is not random and takes on at least two different values 6. Optional: The errors are normally distributed ## Call: lm(formula = weight ~ repwt, data = Davis) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3363 3.0369 1.757 0.0806 .
## repwt 0.9278 0.0453 20.484 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #library(car) #carWeb() # Example 1: Actual measured weight (y=weight) vs. reported weight (x=repwt) library("car") #summary(Davis) davis.mod <- lm(weight ~ repwt, data=Davis) S(davis.mod) 1 ## ## Residual standard deviation: 8.419 on 181 degrees of freedom ## (17 observations deleted due to missingness) ## Multiple R-squared: 0.6986 ## F-statistic: 419.6 on 1 and 181 DF, p-value: < 2.2e-16 ## AIC BIC ## 1303.06 1312.69 # In this case we see that the y-intercept is not stat. significant but the slope is. names(davis.mod) ## [1] "coefficients" "residuals" ## [5] "fitted.values" "assign" "effects" "qr" "call" Estimate "rank" "df.residual" "terms" 97.5 % ## [9] "na.action" ## [13] "model" Confint(davis.mod) "xlevels" (Intercept) 5.3362605 -0.6560394 11.328561 repwt 0.9278428 0.8384665 1.017219 2.5 % #Note: When the interval crosses zero like in the case of the y-intercept, this suggests #that the paramter estimate is not statistically significant. plot(weight ~ repwt, data=Davis,pch=20) abline(0, 1, lty="dashed", lwd=2,col="red") abline(davis.mod, lwd=2, col="blue") legend("bottomright", c("Unbiased Reporting", "Least Squares"), lty=c("dashed", "solid"), col=c("red","blue"),lwd=2, inset=0.02) with(Davis, showLabels(repwt, weight, n=2, method="mahal")) 12 21 Unbiased Reporting Least Squares ## [1] 12 21 40 60 80 100 120 repwt 2 weight 40 60 80 100 140 # We can now try to test the SLR assumtions 1-6: # SLR1: From the scatterplot, the linear relationship does appear to hold. # Also, the regression output confirms this. S(davis.mod) ## Call: lm(formula = weight ~ repwt, data = Davis) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3363 3.0369 1.757 0.0806 .
## repwt 0.9278 0.0453 20.484 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard deviation: 8.419 on 181 degrees of freedom ## (17 observations deleted due to missingness) ## Multiple R-squared: 0.6986 ## F-statistic: 419.6 on 1 and 181 DF, p-value: < 2.2e-16 ## AIC BIC ## 1303.06 1312.69 # SLR2: E[e] = 0 plot(davis.mod$residuals,pch=20, ylab="Residuals",ylim=c(-10,10)) abline(h=0, lwd=2, col="red") 0 50 100 150 mean(davis.mod$residuals) ## [1] -3.342068e-18 ## [1] 70.48366 Index # SLR3: var[e] = var[y] var(davis.mod$residuals) 3 Residuals −10 −5 0 5 10 var(Davis$weight) ## [1] 227.8593 #SLR4: cov(y_i, y_j) = cov(e_i, e_j) = 0 # This can be difficult to say # SLR5: From the plot x takes on many different values # SLR6: e ~ normal hist(davis.mod$residuals,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density", main = "Histogram of the Residuals",xlim=c(-10,10)) lines(density(davis.mod$residuals),lwd = 2, col ="red") Histogram of the Residuals −10 −5 0 5 10 davis.mod$residuals We can see that observation 12 is an influential outlier, i.e., can destabilize the regression fit. Therefore, we might consider removing it and re-estimating the regression fit. ## Call: lm(formula = weight ~ repwt, data = Davis, subset = -12) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.73380 0.81479 3.355 0.000967 ***
## repwt 0.95837 0.01214 78.926 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #Note: A Jarque-Bera Test is more objective but for now a visual # inspection of the histogram will suffice. davis.mod.2 = update(davis.mod, subset=-12) S(davis.mod.2) 4 Density 0.00 0.05 0.10 0.15 0.20 ## ## Residual standard deviation: 2.254 on 180 degrees of freedom ## (17 observations deleted due to missingness) ## Multiple R-squared: 0.9719 ## F-statistic: 6229 on 1 and 180 DF, p-value: < 2.2e-16 ## AIC BIC ## 816.27 825.88 Original NoCase12 (Intercept) 5.3362605 2.7338020 repwt 0.9278428 0.9583743 When determining how well the model fit the data, we look at: • t-stats and respective p−values of the estimated parameters • Coefficient of Determination: R2 • F−stat of the model and respective p-value • Plot of the residuals: – (1) Randomly distributed about 0 and – (2) Constant Variance • Interpretability of the estimates • Robustness of the model estimates • High correlation between actual and predicted values The normality of the residuals can be helpful but is not required. We can perform a Jarque-Bera (JB) test on the residuals to determine if they are normally distributed. The JB statistic is given by: N􏰀 2 (K−3)2􏰁 JB=6S+4 , where K = kurtosis and S = skeweness. For a normal distribution K = 3 and S = 0. We can show that JB ∼ χ2ν=2, where the null hypothesis is H0 : JB = 0. Another useful diagnostic tool is K-Fold Cross Validation. This is a test of how well the model generalizes to new data. Briefly, the data are separated into testing (we predict these values) and training sets (use these to estimate the model parameters). By comparing the true values to the predicted values, we have a measure of how well the model predicted the correct values (This method is known as the Holdout Method). Since this result can be biased based on the training set, a better option is to split the data into k mutually exclusive random subsets, and repeat holdout method k times. # We can compare the estimates with and without the outlier: cbind(Original=coef(davis.mod),NoCase12=coef(davis.mod.2)) # Given that the slope did not change much, this outlier is considered a # low-leverage point (i.e., its predicted value is near # the centroid of the predictions). # Example 2: Wages (y=wage) vs. Years of Education (x=education) #library(PoEdata) #data("cps_small") 5 Figure 1: K-Fold Cross Validation Algorithm (from Wikipedia) library(tidyverse) library(readxl) library(car) cps_small = read_excel('cps_small.xls') reg.mod = lm(cps_small$wage~cps_small$educ) scatterplot(cps_small$educ, cps_small$wage, xlab="Years of Education", ylab="Hourly Wages ($)",pch=20) abline(reg.mod, lwd=2, col="red") 6 Hourly Wages ($) 0 10 20 30 40 50 60 S(reg.mod) 5 10 15 Years of Education ## Call: lm(formula = cps_small$wage ~ cps_small$educ) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.91218 0.96679 -5.081 4.48e-07 ***
## cps_small$educ 1.13852 0.07155 15.912 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard deviation: 5.582 on 998 degrees of freedom ## Multiple R-squared: 0.2024 ## F-statistic: 253.2 on 1 and 998 DF, p-value: < 2.2e-16 ## AIC BIC ## 6280.86 6295.58 # The linear fit does not appear to be adequate despite # the stat. sig. of the parameter estimates.We can look at # the Residuals plot to determine if the model is appropriate: plot(cps_small$educ,reg.mod$residuals,pch=20, ylab="Residuals", xlab="Years of Education") abline(h=0, lwd=2, col="red") 7 5 10 15 Years of Education # The plot shows that both of the residuals' conditions are not satisfied. # Q1: Do you think a log transformation of wages might improve our model? Try it! # Q2: Are the residuals normally distributed? hist(reg.mod$residuals,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density", xlab="Residuals",main = "Histogram of the Residuals") lines(density(reg.mod$residuals),lwd = 2, col ="red") Histogram of the Residuals Density Residuals 0.00 0.02 0.04 0.06 0.08 0.10 −10 0 10 20 30 40 −10 0 10 20 30 40 Residuals 8 library(tseries) jarque.bera.test(reg.mod$residuals) ## ## Jarque Bera Test ## ## data: reg.mod$residuals ## X-squared = 3023.5, df = 2, p-value < 2.2e-16 # From the histogram and JB Test we can conclude that the # residuals are not normally distributed. # Another diagnostic is the Prediction Accuracy. # A correlation between the actuals and predicted values can # be used as a form of accuracy measure actuals_preds <- data.frame(cbind(actuals=cps_small$wage, predicteds=reg.mod$fitted.values correlation_accuracy <- cor(actuals_preds) print(correlation_accuracy ) ## actuals predicteds ## actuals 1.0000000 0.4498506 ## predicteds 0.4498506 1.0000000 head(actuals_preds) )) actuals 2.03 2.07 2.12 2.54 2.68 3.09 predicteds 9.888543 8.750025 8.750025 13.304094 8.750025 9.888543 library(DAAG) ## Loading required package: lattice ## ## Attaching package: 'DAAG' ## The following object is masked from 'package:car': ## ## vif ## The following object is masked from 'package:survival': ## ## lung ## The following object is masked from 'package:MASS': ## ## hills ## The following object is masked from 'package:plyr': ## ## ozone 9 cvResults <- suppressWarnings(CVlm(data=as.data.frame(cps_small), form.lm=formula(wage~educ), m=5, dots=FALSE, legend.pos="topleft", printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals.")) Small symbols are predicted values while bigger ones are actuals. Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 wage 0 10 20 30 40 50 60 attr(cvResults, 'ms') ## [1] 31.29037 # Example 3: Stopping Distance vs. Speed 5 10 15 educ library(DAAG) library(car) cvResults <- suppressWarnings(CVlm(data=cars, form.lm=dist ~ speed, m=5, dots=FALSE, seed=29, legend.pos="topleft", printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals")) 10 Small symbols are predicted values while bigger ones are actuals Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 5 10 15 20 25 speed attr(cvResults, 'ms') ## [1] 254.2661 # We need to check if the model’s prediction accuracy isn’t # varying too much for any one particular sample, and if the # lines of best fit don’t vary too much with respect the the slope and level. # Example 4: Speed vs. Stopping Distance require(boot) library(MASS) plot(speed~dist, data=cars, main = "Cars" , xlab = "Stopping Distance", ylab = "Speed", pch =20) 11 dist 0 20 40 60 80 100 120 Cars 0 20 40 60 80 100 120 Stopping Distance glm.fit = glm(speed~dist, data=cars) degree=1:5 cv.error5=rep(0,5) for(d in degree){ glm.fit = glm(speed~poly(dist, d), data=cars) cv.error5[d] = cv.glm(cars,glm.fit,K=5)$delta[1]} # The plot suggest that a low degree polynomial fits best. At higher orders, # we are overfitting and thus loose predictive performance. plot(degree,cv.error5,type='ol', xlab="Polynomial Fit Degree",lwd=2,ylab="5-Fold CV Error" 12 ) Speed 5 10 15 20 25 12345 Polynomial Fit Degree I.2 Nonlinear Transformations When the data are not suitable for a linear model, we can consider non-linear transformations on y and/or x. In econometrics 4 very popular non-linear models are the (1) log-log, (2) log-linear, (3) linear-log, and (4) quadratic models. The table below shows a summary of these models and their respective interpretations. Next we examine a quadratic model (y = β0 + β1x2 + e) as an illustration of a non-linear model. 13 5−Fold CV Error 20 40 60 80 100 # install.packages('devtools') # library(devtools) # install_git('https://github.com/ccolonescu/PoEdata') library(PoEdata) # Example 5: Home Prices (y=price) vs. Living Area (x=sqft) library(PoEdata) data(br) reg.mod <- lm(price~I(sqft^2), data=br) S(reg.mod) ## Call: lm(formula = price ~ I(sqft^2), data = br) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.578e+04 2.890e+03 19.30 <2e-16 *** ## I(sqft^2) 1.542e-02 3.131e-04 49.25 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard deviation: 68210 on 1078 degrees of freedom ## Multiple R-squared: 0.6923 ## F-statistic: 2426 on 1 and 1078 DF, p-value: < 2.2e-16 ## AIC BIC ## 27110.35 27125.30 b1 <- coef(reg.mod)[[1]] b2 <- coef(reg.mod)[[2]] plot(br$sqft, br$price, xlab="Area (ft^2)", ylab="Home Price ($)",pch=20) curve(b1+b2*x^2, col="red", add=TRUE,lwd=2) Home Price ($) 0 500000 1000000 1500000 2000 4000 6000 8000 Area (ft^2) # Q1: Is the quadratic fit a good one? # A1: From the scatterplot we can see that for 14 # homes with sqft>4000, the spread increases and the number
# of observations decreases.
# This suggests that our fit will worsen as we consider
# larger values of x beyond ~ 4000ft^2.
#Q2: Are the residuals normally distributed?
hist(reg.mod$residuals,breaks =”FD”,col=”skyblue2″, freq = FALSE, ylab = “Density”,
main = “Histogram of the Residuals”)
lines(density(reg.mod$residuals),lwd = 2, col =”red”)
Histogram of the Residuals
−6e+05
−2e+05 0e+00 2e+05 4e+05 6e+05
reg.mod$residuals
library(tseries)
jarque.bera.test(reg.mod$residuals)
##
## Jarque Bera Test
##
## data: reg.mod$residuals
## X-squared = 61385, df = 2, p-value < 2.2e-16 ## DpriceDsqft (x =2000, 4000, 6000) = 61.68521 123.3704 185.0556 cat("elasticity (x =2000, 4000, 6000) =", elasticity) ## elasticity (x =2000, 4000, 6000) = 1.050303 1.631251 1.817408 # We can also compute traditional metrics such as predicted # values and elasticities sqftx=c(2000, 4000, 6000) #given values for sqft pricex=b1+b2*sqftx^2 #prices corresponding to given sqft DpriceDsqft <- 2*b2*sqftx # marginal effect of sqft on price elasticity=DpriceDsqft*sqftx/pricex cat("DpriceDsqft (x =2000, 4000, 6000) =", DpriceDsqft) 15 Density 0.0e+00 4.0e−06 8.0e−06 1.2e−05 # Example 6: Weekly Food Expenditure (y=food_exp in $) vs. # Income (x=income in units of $100) # We will consider a Linear-Log Model: y = beta_0 + beta_1 log(x) #library(PoEdata) data(food) head(food) food_exp income 115.22 3.69 135.98 4.39 119.34 4.75 114.96 6.03 187.05 12.47 243.92 12.98 plot(food$income, food$food_exp, xlab="Income ($100)", ylab="Food Expenditure ($)",pch=20) reg.mod <- lm(food$food_exp~log(food$income)) S(reg.mod) ## Call: lm(formula = food$food_exp ~ log(food$income)) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -97.19 84.24 -1.154 0.256
## log(food$income) 132.17 28.80 4.588 4.76e-05 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard deviation: 91.57 on 38 degrees of freedom
## Multiple R-squared: 0.3565
## F-statistic: 21.05 on 1 and 38 DF, p-value: 4.76e-05
## AIC BIC
## 478.83 483.90
y=predict(reg.mod,interval=”confidence”)
matlines(food$income,y,lwd=2,col=”blue”)
16

5 10 15 20 25 30
Income ($100)
#Interpretation 1: For a household with $1000 weekly income,
# we expect an additional weekly food spending of $13.22
# for an additional $100 increase in weekly income.
#Interpretation 2: For a 1% increase in income,
# we expect food expediture to increase by $1.32/week.
plot(food$income,reg.mod$residuals,pch=20, ylab=”Residuals”, xlab=”Income”)
abline(h=0, lwd=2, col=”red”)
5 10 15 20 25 30
Income
#Q: Are the residuals normally distributed?
hist(reg.mod$residuals,breaks =”FD”,col=”skyblue2″, freq = FALSE, ylab = “Density”,
main = “Histogram of the Residuals”,xlab=”Residuals”)
17
Residuals
Food Expenditure ($)
−200 −100 0 100 200
100 200 300 400 500 600

lines(density(reg.mod$residuals),lwd = 2, col =”red”) Histogram of the Residuals
−200 −100 0 100 200
Residuals
##
## Jarque Bera Test
##
## data: reg.mod$residuals
## X-squared = 0.19989, df = 2, p-value = 0.9049
I.3 Prediction vs. Confidence Intervals
In Example 4 we showed the confidence interval for the predicted values. However, any predicted value, y􏰃i is a random variable since the value depends on the sample. This suggests that we should also estimate a corresponding prediction interval for y􏰃i. Consider for example, finding y(x = x0) = y0 = β0 + β1×0. Based on the SR assumptions 1-5, the confidence interval estimates are about E[y0] which is not random. We can definetheforecasterroras f =y0−y􏰃0 =(β0+β1×0+e0)−(β􏰃0+β􏰃1×0),whereE[f]=0and
􏰄
theforecast,se(f)= var[f],fromwhichwecancomputethepredictionintervalfory􏰃0asy􏰃0±tcse(f),where tc = t1−α/2,N−2. Below we show a general plot of prediction interval behavior.
library(tseries)
jarque.bera.test(reg.mod$residuals)
2􏰀 1 (x0−x)2 􏰁 1 + N + ∑(xi − x)2 .
var[ f ] = σ
Since we do not know σ2 we can estimate it (call it 􏰃σ2), and then we have an estimate for the standard error of
􏰅
18
0.000 0.002 0.004
Density

# Example 7: Weekly Food Expenditure (y=food_exp in $) vs. # Income (x=income in units of $100)
# Prediction vs, Confidence Intervals
library(PoEdata)
data(food)
attach(food)
reg.mod <- lm(food_exp~income, data=food) predict(reg.mod, newdata=data.frame(income=20), interval = "confidence",level=0.95) fit lwr upr 287.6089 258.9069 316.3108 predict(reg.mod, newdata=data.frame(income=20), interval = "prediction",level=0.95) fit lwr upr 287.6089 104.1323 471.0854 plot(income, food_exp,pch=20,xlab="Income ($100)", ylab="Food Expenditure ($)") abline(reg.mod, col="blue",lwd=2) pred_int = predict(reg.mod, newdata=data.frame(income), interval = "prediction",level=0.95 conf_int = predict(reg.mod, newdata=data.frame(income), interval = "confidence",level=0.95 # We can plot these for the entie dataset: lines(income, conf_int[,2], col="orange", lty=2,lwd =2) lines(income, conf_int[,3], col="orange", lty=2,lwd=2) lines(income, pred_int[,2], col="red", lty=2,lwd =2) lines(income, pred_int[,3], col="red", lty=2,lwd=2) ) ) 5 10 15 20 25 30 Income ($100) 19 Food Expenditure ($) 100 200 300 400 500 600 #As expected, the prediction interval is wider than the confidence interval for the # same level of confidence. # A prettier version of this plot: library(ggplot2) temp_var <- predict(reg.mod, interval="prediction") new_df <- cbind(food, temp_var) ggplot(new_df, aes(income, food_exp))+ geom_point() + geom_line(aes(y=lwr), color = "red", linetype = "dashed")+ geom_line(aes(y=upr), color = "red", linetype = "dashed")+ geom_smooth(method=lm, se=TRUE) 600 400 200 0 10 20 30 income I.4 Models with Factors (Indicator Variables) Factor variables (also known as indicator or dummy variables) are used in regression to represent categorical variables, i.e., nonquantitative characteristics such as gender, race, drink preference, etc. The simplest example of a factor variable is a binary one, i.e., it can only take on two values (these are referred to as the levels of the factor). For example, we can create a binary indicator variable for gender, I, such that I = 1 (gender = female) and I = 0 (gender = male). For example, consider a regression of hourly wages (WAGE) on gender (FEMALE = 1(female)). In this case our model can be expressed as: 􏰂β0+β1 ifFEMALE=1 E(WAGE) = β0 + β1FEMALE = β0 if FEMALE = 0. 20 food_exp In this case, our interpretation of the model parameters is different from the previous examples since β0 = WAGEMale and β1 = WAGEFemale − WAGEMale. # Example 8: Wages (y=wage) vs. Gender (x=female) library(PoEdata) data("cps_small") attach(cps_small) # Note: When dealing with indicator variables it is important to check the number of # observations in each category to see if the data are balanced or not. S(cps_small[,1:4]) wage Min. : 2.03 1st Qu.: 5.53 Median : 8.79 Mean :10.21 3rd Qu.:12.78 Max. :60.19 educ Min. : 1.00 1st Qu.:12.00 Median :13.00 Mean :13.29 3rd Qu.:16.00 Max. :18.00 exper Min. : 0.00 1st Qu.:10.00 Median :18.00 Mean :18.78 3rd Qu.:26.00 Max. :52.00 female Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.494 3rd Qu.:1.000 Max. :1.000 S(cps_small[,5:9]) black Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.088 3rd Qu.:0.000 Max. :1.000 white Min. :0.000 1st Qu.:1.000 Median :1.000 Mean :0.912 3rd Qu.:1.000 Max. :1.000 midwest Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.237 3rd Qu.:0.000 Max. :1.000 south Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.315 3rd Qu.:1.000 Max. :1.000 west Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.222 3rd Qu.:0.000 Max. :1.000 # Did you find any indicator variables that are not well balanced? wage_female = wage[which(female==1)] wage_male = wage[which(female==0)] # First we will look at the distribution of wages conditioned on gender par(mfrow=c(2,1)) hist(wage_female,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density", main = "Histogram of the Wages (Female)",xlab="Wages",ylim=c(0,0.17),xlim=c(0,60)) lines(density(wage_female),lwd = 2, col ="red") hist(wage_male,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density", main = "Histogram of the Wages (Male)",xlab="Wages",ylim=c(0,0.17),xlim=c(0,60)) lines(density(wage_female),lwd = 2, col ="red") 21 Histogram of the Wages (Female) 0 10 20 30 40 50 60 Wages Histogram of the Wages (Male) 0 10 20 30 40 50 60 Wages ## Call: lm(formula = wage ~ female) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.5255 0.2715 42.455 < 2e-16 *** ## female -2.6568 0.3862 -6.878 1.07e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard deviation: 6.107 on 998 degrees of freedom ## Multiple R-squared: 0.04526 ## F-statistic: 47.31 on 1 and 998 DF, p-value: 1.067e-11 ## AIC BIC ## 6460.65 6475.37 # Now we can estimate the model and interpret the results reg.mod = lm(wage~female) S(reg.mod) # We interpret the y-intercept as the mean hourly wage for men (beta_0 = $11.52), and # the slope as the difference in hourly wages between women and men (beta_1 = $-2.66). # On average, all other things being equal, women are predicted to earn $2.66/hr less # than men. 22 Density Density 0.00 0.00