程序代写代做代考 data mining graph Linear Regression

Linear Regression
Linear Regression
Mariia Okuneva
The main difference to Econometrics course: when using linear models in Econometrics course, we often emphasize distributional results useful for hypothesis testing. Our main goal is to explain a relationship.
Now, we use linear regression as a tool to predict. We seek a model which minimizes errors!
Remember that library() function is used to load libraries that are not included in the base R distribution. Here we load the MASS package which is a very large collection of data sets and functions. We also load the ISLR package, which includes the data sets associated with this book.
## package ‘ISLR’ successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mokuneva\AppData\Local\Temp\2\Rtmpm2sYIJ\downloaded_packages
library(ISLR)
## Warning: package ‘ISLR’ was built under R version 3.6.3
Simple Linear Regression
Task: The MASS library contains the Boston data set, which records medv (median house value) for 506 neighbourhoods around Boston.
How many variables are included in this data set? What is the size of the dataframe?
## [1] “crim” “zn” “indus” “chas” “nox” “rm” “age”
ere!
setwd(“X:/Doctor/data mining/SS20/Tutorial_2”) # change the working directory, your path h library(MASS)
install.packages(‘ISLR’, repos = “http://cran.us.r-project.org”)
fix(Boston) # put the dataframe into your working environment and get the chart names(Boston) # How many variables are included in this data set?
## [8] “dis”
dim(Boston)
## [1] 506 14
?Boston
“rad” “tax” “ptratio” “black” “lstat” “medv”
# What is the size of the dataframe?
# information on the dataset
## starting httpd help server … done
Explore the data before running any regression!
1

About problems with the data: Gilley, O.W., and R. Kelley Pace, On the Harrison and Rubinfeld Data, Journal of Environmental Economics and Management, 31 (1996),403-405.
plot(density(Boston$medv), xlab = “Median value of homes”)
density.default(x = Boston$medv)
0 10 20 30 40 50
Median value of homes
Boston[Boston$medv==50, ]
## crim zn indus chas nox rm age dis rad tax ptratio black
## 162 1.46336 0 19.58
## 163 1.83377 0 19.58
## 164 1.51902 0 19.58
## 167 2.01019 0 19.58
## 187 0.05602 0 2.46
## 196 0.01381 80 0.46
## 205 0.02009 95 2.68
## 226 0.52693 0 6.20
## 258 0.61154 20 3.97
## 268 0.57834 20 3.97
## 284 0.01501 90 1.21
## 369 4.89822 0 18.10
## 370 5.66998 0 18.10
## 371 6.53876 0 18.10
## 372 9.23230 0 18.10
## 373 8.26725 0 18.10
## lstat medv
## 162 1.73 50
## 163 1.92 50
0 0.6050 7.489 90.8 1.9709
1 0.6050 7.802 98.2 2.0407
1 0.6050 8.375 93.9 2.1620
0 0.6050 7.929 96.2 2.0459
0 0.4880 7.831 53.6 3.1992
0 0.4220 7.875 32.0 5.6484
0 0.4161 8.034 31.9 5.1180
0 0.5040 8.725 83.0 2.8944
0 0.6470 8.704 86.9 1.8010
0 0.5750 8.297 67.0 2.4216
1 0.4010 7.923 24.8 5.8850
0 0.6310 4.970 100.0 1.3325 24 666
1 0.6310 6.683 96.8 1.3567 24 666
1 0.6310 7.016 97.5 1.2024 24 666
0 0.6310 6.216 100.0 1.1691 24 666
1 0.6680 5.875 89.6 1.1296 24 666
14.7 374.43
14.7 389.61
14.7 388.45
14.7 369.30
17.8 392.63
14.4 394.23
14.7 390.55
17.4 382.00
13.0 389.70
13.0 384.54
13.6 395.52
20.2 375.52
20.2 375.33
20.2 392.05
20.2 366.15
20.2 347.88
2
5 403
5 403
5 403
5 403
3 193
4 255
4 224
8 307
5 264
5 264
1 198
Density
0.00 0.02 0.04 0.06

## 164 3.32 50
## 167 3.70 50
## 187 4.45 50
## 196 2.97 50
## 205 2.88 50
## 226 4.63 50
## 258 5.12 50
## 268 7.44 50
## 284 3.16 50
## 369 3.26 50
## 370 3.73 50
## 371 2.96 50
## 372 9.53 50
## 373 8.88 50
It is always a good idea to visualize your data.
plot(medv~lstat, data=Boston) # plot the response against the feature
medv
10 20 30 40 50
10 20 30
lstat
# As the lower status percentage in the region increases, the median housing value decreas
We will start by using the lm() function to fit a simple linear regression model, with medv as the dependent variable and lstat (percent of households with low socioeconomic status) as the predictor (feature/independent variable).
Task: The basic syntax is lm(y~x, data). Try to fit a simple linear model with medv and lstat variables.
es
store it as fi
fit1 = lm(medv ~ lstat, data=Boston) # fit1 is just the name of the model, you could also # if we attach Boston, we might skip data parameter.
3

attach(Boston)
fit1 = lm(medv ~ lstat)
fit1 # some basic information about the model
##
## Call:
## lm(formula = medv ~ lstat)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
We get a negative estimate of the coefficient (as expected).
summary(fit1) # more detailed information
##
## Call:
## lm(formula = medv ~ lstat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 *** ## lstat -0.95005 0.03873 -24.53 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.216 on 504 degrees of freedom ## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432 ## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16 Here we get residuals, the call that created the fit, coefficients, their standard errors, t-statistics and p-values. Are the coefficient estimates significant? names(fit1) # use the names function in order to find out what else is stored in fit1 ## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "xlevels" "call" "terms" "model" fit1$coefficients # we can extract these quantities by name ## (Intercept) lstat ## 34.5538409 -0.9500494 fit1$coefficients[2] ## lstat ## -0.9500494 coef(fit1) # or you could you this command to extact coefficients ## (Intercept) lstat ## 34.5538409 -0.9500494 4 In order to obtain a confidence inteval for the coefficient estimates, we can use confint() command: confint(fit1) # by default level = 0.95 ## 2.5 % 97.5 % ## (Intercept) 33.448457 35.6592247 ## lstat -1.026148 -0.8739505 confint(fit1, level = 0.99) ## 0.5 % 99.5 % ## (Intercept) 33.099101 36.0085810 ## lstat -1.050199 -0.8498995 The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat. Confidence intervals are used to quantify uncertainty surrounding the average medv over a large number of neighbourhoods. Prediction interval can be used to quantify the uncertainty surrounding medv for a particular neighbour- hood. Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for f(X) and the uncertainty as to how much an individual point will differ from the population regression plane. Increased uncertainty about medv for a given neighbourhood in comparison to the average medv over many neighbourhoods. (See ISLR p.82) ## fit lwr upr ## 1 29.80359 29.00741 30.59978 ## 2 25.05335 24.47413 25.63256 ## 3 20.30310 19.73159 20.87461 predict(fit1, data.frame(lstat=(c(5,10,15))), interval = 'prediction') ## fit lwr upr ## 1 29.80359 17.565675 42.04151 ## 2 25.05335 12.827626 37.27907 ## 3 20.30310 8.077742 32.52846 For example: the 95% ci associated with lstat value of 10 is (24.47, 25.63), and the 95% prediction interval is (12.828, 37.28). Interpretation: 1) CI: given that each neighbourhood is characterized with lstat value of 10, the 95% ci is (24.47, 25.63). 95% of confindence intervals of this form will contain the true value of average medv. 2) PI: 95% of confindence intervals of this form will contain the true value of medv for this neighbourhood. Now we can add that linear model fit to the plot: t (5, 10, 15) # confidence and prediction intervals for the prediction of medv for a given value of lsta predict(fit1, data.frame(lstat=(c(5,10,15))), interval = 'confidence') plot(medv~lstat) abline(fit1) # add the regression line 5 Play with some graphics options: 10 20 30 lstat plot(medv~lstat) abline(fit1, lwd=3) # add a thicker regression line 6 medv 10 20 30 40 50 10 20 30 lstat Plot the data using different graphic options: plot(lstat, medv, col='red') # change the colour 7 medv 10 20 30 40 50 10 20 30 lstat plot(lstat, medv, pch=20) # change a symbol used to plot points 8 medv 10 20 30 40 50 10 20 30 lstat plot(lstat, medv, pch='+') # Sometimes you can replace the number of the symbol with the symbol 9 medv 10 20 30 40 50 +++++ + + + ++ ++ + + ++ + ++ + ++ + +++++ + + ++ + + + ++ + ++++ + ++ + +++ +++ +++ ++ + ++ + + +++ +++ + + + ++ + +++++ ++ ++ + + +++ + +++ + + ++ + + + + + +++ + + +++++ +++ + + ++ ++++ ++ + +++++++++++ + ++ +++++++++++ + + + ++++++ ++ + + + ++ ++ + ++ ++ + + ++ + ++ + ++ + ++++ + + +++++++++++ + + + + +++ + +++++++++ +++++ + + +++ ++ ++++ + ++ ++++++++ + +++ +++ + + + +++++ + ++ + + + +++++++++ ++ + +++++ ++ + +++++++++ + + + +++++++++ ++ + +++ +++ ++++++ ++++++++ + + +++++++ + ++ 10 20 30 lstat ?pch # information on different options available plot(1:20, 1:20, pch = 1:20) # types of data points icons 10 medv 10 20 30 40 50 5 10 15 20 1:20 Now we want to examine some diagnostic plots. If you apply plot() function directly to the output from lm() in normal R script (not in a notebook), we will produce one plot at a time. Hit Enter to generate the next plot. # mfrow - a vector of the form c(number of rows, number of columns) par(mfrow=c(2,2)) # combine plots in a 2 by 2 matrix plot(fit1) 11 1:20 5 10 15 20 Residuals vs Fitted 337723 268 0 5 10 15 20 25 30 Fitted values Normal Q−Q Scale−Location 268 0 5 10 15 20 25 30 Fitted values Theoretical Quantiles Residuals vs Leverage 337723 These plots help to see if there are outliers (unusual value of y), high leverage points (unusual value of x), or nonlinear relationships. Ideally the residual plot will show no discernible pattern. # return the display options to 1 plot per frame par(mfrow=c(1,1)) Multiple linear regression # Now we are gonna fit linear regression with lstat and age: fit1 = lm(medv ~ lstat + age) summary(fit1) ## ## Call: ## lm(formula = medv ~ lstat + age) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.981 -3.978 -1.283 1.968 23.158 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 *** ## lstat -1.03207 0.04819 -21.416 < 2e-16 *** 12 −3 −2 −1 0 1 2 3 0.000 Cook's distance 0.010 0.020 Leverage 215 413 375 372373 268 Standardized residuals 0.0 1.0 2.0 Residuals Standardized residuals Standardized residuals −2 2 −2 1 3 −20 0 20 ## age 0.03454 0.01223 2.826 0.00491 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.173 on 503 degrees of freedom ## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495 ## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16 Another important characteristic of the output - R-squared. It’s a percentage of variance explained. In multiple regression setting we are interested in adjusted R-squared. If we want to use all features from the Boston dataset, we can run the following command: ## ## Call: ## lm(formula = medv ~ ., data = Boston) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.595 -2.730 -0.518 1.777 26.199 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
fit2 = lm(medv ~ ., data=Boston) summary(fit2)
## crim
## zn
## indus
## chas
## nox
## rm
## age
## dis
## rad
## tax
## ptratio
## black
## lstat
## —
-1.080e-01 3.286e-02 -3.287 0.001087 **
4.642e-02 1.373e-02 3.382 0.000778 ***
2.056e-02 6.150e-02 0.334 0.738288
2.687e+00 8.616e-01 3.118 0.001925 **
-1.777e+01 3.820e+00 -4.651 4.25e-06 ***
3.810e+00 4.179e-01 9.116 < 2e-16 *** 6.922e-04 1.321e-02 0.052 0.958229 -1.476e+00 1.995e-01 -7.398 6.01e-13 *** 3.060e-01 6.635e-02 4.613 5.07e-06 *** -1.233e-02 3.760e-03 -3.280 0.001112 ** -9.527e-01 1.308e-01 -7.283 1.31e-12 *** 9.312e-03 2.686e-03 3.467 0.000573 *** -5.248e-01 5.072e-02 -10.347 < 2e-16 *** ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.745 on 492 degrees of freedom ## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338 ## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16 Most of the parameter estimates are significant, some are not. E.g., age is no longer significant.Why? Well, there are a lot of other predictors correlated with age, in the presence of them age is no longer required. ## ## Call: ## lm(formula = medv ~ . - age, data = Boston) ## fit2 = lm(medv ~ .-age, data=Boston) # all regressors but age summary(fit2) 13 ## Residuals: ## Min 1Q Median 3Q Max ## -15.6054 -2.7313 -0.5188 1.7601 26.2243 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006
## zn 0.046334
## indus 0.020562
## chas 2.689026
## nox -17.713540
## rm 3.814394 0.408480 9.338 < 2e-16 *** 0.032832 -3.290 0.001075 ** 0.013613 0.061433 0.859598 3.679308 -4.814 1.97e-06 *** ## dis -1.478612 ## rad 0.305786 ## tax -0.012329 ## ptratio -0.952211 0.130294 -7.308 1.10e-12 *** ## black 0.009321 0.002678 3.481 0.000544 *** ## lstat -0.523852 0.047625 -10.999 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.74 on 493 degrees of freedom ## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343 ## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16 ## ## Call: ## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + ## tax + ptratio + black + lstat, data = Boston) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.5984 -2.7386 -0.5046 1.7273 26.2373 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
3.404 0.000719 ***
0.335 0.737989
3.128 0.001863 **
0.190611 -7.757 5.03e-14 ***
0.066089 4.627 4.75e-06 ***
0.003755 -3.283 0.001099 **
# another command to update a model:
fit2 = update(fit2, ~.-indus) summary(fit2)
## nox
## rm
## dis
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 -17.376023 3.535243 -4.915 1.21e-06 *** 3.801579 0.406316 9.356 < 2e-16 *** -1.492711 0.185731 -8.037 6.84e-15 *** 14 ## ## Residual standard error: 4.736 on 494 degrees of freedom ## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348 ## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16 Interaction terms, qualitative predictors and nonlinearities library(ISLR) fix(Wage) names(Wage) ## [1] "year" ## [6] "region" ## [11] "wage" help(Wage) Examine the data: range(Wage$year) ## [1] 2003 2009 class(Wage$education) ## [1] "factor" levels(Wage$education) ## [1] "1. < HS Grad" ## [4] "4. College Grad" sum(Wage$education == '3. Some College') ## [1] 650 "age" "maritl" "jobclass" "health" "race" "education" "health_ins" "logwage" "2. HS Grad" "5. Advanced Degree" "3. Some College" # Construct new data set new.Wage = Wage[Wage$year==2003,] # Exclude year new.Wage = new.Wage[,c(2:11)] Mincer-Becker human capital hypothesis (non-linear relationship between age and wage) ## ## Call: ## lm(formula = wage ~ age, data = new.Wage) ## ## Residuals: ## Min 1Q Median 3Q Max ## -93.792 -25.129 -6.664 17.898 210.758 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.0119 6.6569 11.869 < 2e-16 *** ## age 0.6463 0.1523 4.244 2.61e-05 *** fit3 = lm(wage ~ age, data=new.Wage) summary(fit3) 15 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 41 on 511 degrees of freedom ## Multiple R-squared: 0.03405, Adjusted R-squared: 0.03215 ## F-statistic: 18.01 on 1 and 511 DF, p-value: 2.61e-05 # Investigate the patterns in the data plot(new.Wage$age, new.Wage$wage) 20 30 40 50 60 70 80 new.Wage$age # Normality of residuals, outliers? par(mfrow=c(2,2)) plot(fit3) 16 new.Wage$wage 50 100 150 200 250 300 90 100 110 120 Fitted values 130 Residuals vs Fitted Normal Q−Q −3 −2 −1 0 1 2 3 Theoretical Quantiles Residuals vs Leverage 155313 159704159583 Cook's distance 0.000 0.005 0.010 0.015 0.020 Leverage 154582 158400 154582 158400 155313 155313 90 100 110 120 130 Fitted values Scale−Location 154582 158400 155313 Standardized residuals 0.0 1.5 Residuals Standardized residuals Standardized residuals −2 2 6 −2 2 6 −100 150 par(mfrow=c(1,1)) The lm function can accommodate non-linear transformations of the predictors. # linear regression with a polynomial term fit4 = lm(wage ~ age + I(age^2), data=new.Wage) summary(fit4) ## ## Call: ## lm(formula = wage ~ age + I(age^2), data = new.Wage) ## ## Residuals: ## Min 1Q Median 3Q Max ## -93.179 -23.981 -4.117 16.785 205.308 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.067585 18.033034 -0.336 0.737
## age 4.870285 0.848212 5.742 1.61e-08 ***
## I(age^2) -0.048464 0.009581 -5.058 5.91e-07 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40.05 on 510 degrees of freedom
## Multiple R-squared: 0.08019, Adjusted R-squared: 0.07658
## F-statistic: 22.23 on 2 and 510 DF, p-value: 5.533e-10
17

plot(new.Wage$age, new.Wage$wage)
# Include the quadratic fit.We can’t use abline anymore, because that only works if you have a straight points(new.Wage$age, (coef(fit4)[1]+coef(fit4)[2]*new.Wage$age+coef(fit4)[3]*new.Wage$age^2),
pch=20, col=’red’)
new.Wage$wage
50 100 150 200 250 300
Alternatively:
20 30 40 50 60 70 80
new.Wage$age
plot(new.Wage$age, new.Wage$wage) points(new.Wage$age, fitted(fit4), pch=20, col=’red’)
18

new.Wage$wage
50 100 150 200 250 300
20 30 40 50 60 70 80
new.Wage$age
# easier way to fit polynomials in R
fit5 = lm(wage ~ poly(age,8), data=new.Wage) plot(new.Wage$age, new.Wage$wage)
points(new.Wage$age, fitted(fit4), pch=20, col=’red’) points(new.Wage$age, fitted(fit5), pch=20, col=’green’)
19

new.Wage$wage
50 100 150 200 250 300
20 30 40 50 60 70 80
new.Wage$age
# Adding race and education
fit6 = lm(wage ~ age + I(age^2) + race + education, data = new.Wage) summary(fit6)
##
## Call:
## lm(formula = wage ~ age + I(age^2) + race + education, data = new.Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.66 -19.42 -4.10 14.11 209.05
##
## Coefficients:
##
## (Intercept)
## age
## I(age^2)
## race2. Black
## race3. Asian
## race4. Other
## education2. HS Grad
## education3. Some College
## education4. College Grad
## education5. Advanced Degree 53.849743
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Estimate Std. Error t value Pr(>|t|)
-1.216273 16.630304 -0.073 0.941727
3.550465
-0.034887
-2.491223
11.132666
0.777016 4.569 6.16e-06 ***
0.008762 -3.981 7.86e-05 ***
6.105730 -0.408 0.683437
6.625862
7.060022 12.932366
1.680 0.093542 .
0.546 0.585364
1.772 0.076945 .
3.670 0.000269 ***
6.637 8.32e-11 ***
8.112 3.83e-15 ***
10.023424
21.952763
39.889255
5.655510
5.982125
6.010439
6.637965
20

##
## Residual standard error: 36.13 on 503 degrees of freedom
## Multiple R-squared: 0.2618, Adjusted R-squared: 0.2486
## F-statistic: 19.83 on 9 and 503 DF, p-value: < 2.2e-16 summary(fit6)$adj.r.squared ## [1] 0.2486383 # Dummies are generated automatically from factor variables # contrasts function shows you how R will code a variable when it is put in a linear model contrasts(new.Wage$education) ## ## 1.|t|)
-1.112027 16.591625 -0.067 0.946590
3.600899
-0.035781
9.844077
21.631993
40.424315
0.776115 4.640 4.45e-06 ***
0.008744 -4.092 4.98e-05 ***
5.655088
5.978405
5.997684
6.604022
1.741 0.082336 .
3.618 0.000326 ***
6.740 4.33e-11 ***
8.295 9.90e-16 ***
21

#Interaction terms
fit8 = lm(wage ~ age + I(age^2) + education*race, data=new.Wage) summary(fit8)
##
## Call:
## lm(formula = wage ~ age + I(age^2) + education * race, data = new.Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.161 -18.228 -2.322 13.726 208.504
##
## Coefficients: (1 not defined because of singularities)
##
## (Intercept)
## age
## I(age^2)
## education2. HS Grad
## education3. Some College
## education4. College Grad
## education5. Advanced Degree
## race2. Black
## race3. Asian
## race4. Other
## education2. HS Grad:race2. Black
## education3. Some College:race2. Black
## education4. College Grad:race2. Black
## education5. Advanced Degree:race2. Black -2.776e+00 3.996e+01 -0.069 0.944655
## education2. HS Grad:race3. Asian
## education3. Some College:race3. Asian
## education4. College Grad:race3. Asian
## education5. Advanced Degree:race3. Asian 3.936e+01 2.499e+01 1.575 0.115913
## education2. HS Grad:race4. Other
## education3. Some College:race4. Other
## education4. College Grad:race4. Other
## education5. Advanced Degree:race4. Other NA
##
## (Intercept)
## age ***
## I(age^2) ***
## education2. HS Grad .
## education3. Some College ***
## education4. College Grad ***
## education5. Advanced Degree ***
## race2. Black
## race3. Asian
## race4. Other
## education2. HS Grad:race2. Black
## education3. Some College:race2. Black
## education4. College Grad:race2. Black
## education5. Advanced Degree:race2. Black
## education2. HS Grad:race3. Asian
## education3. Some College:race3. Asian
## education4. College Grad:race3. Asian .
Estimate Std. Error t value Pr(>|t|)
-2.665e+00 1.685e+01 -0.158 0.874350
3.570e+00 7.796e-01 4.579 5.93e-06
-3.494e-02 8.783e-03 -3.979 7.98e-05
1.144e+01 6.120e+00
2.330e+01 6.502e+00
4.041e+01 6.475e+00
5.305e+01 7.116e+00
1.505e+01 1.696e+01
1.869 0.062160
3.584 0.000373
6.241 9.39e-10
7.455 4.09e-13
0.887 0.375536
-1.820e+01 2.145e+01 -0.849 0.396391
5.011e+01 3.629e+01 1.381 0.167975
-2.257e+01 1.959e+01 -1.152 0.249946
-1.468e+01 1.987e+01 -0.739 0.460263
-3.740e+01 2.496e+01 -1.498 0.134717
2.472e+01 2.616e+01 0.945 0.345182
-5.764e+00 2.823e+01 -0.204 0.838303
4.569e+01 2.430e+01 1.880 0.060657
-4.431e+01 4.062e+01 -1.091 0.275859
-2.538e+01 4.442e+01 -0.571 0.567961
-1.188e+02 5.118e+01 -2.322 0.020660
22
NA NA NA

## education5. Advanced Degree:race3. Asian
## education2. HS Grad:race4. Other
## education3. Some College:race4. Other
## education4. College Grad:race4. Other *
## education5. Advanced Degree:race4. Other
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 35.88 on 492 degrees of freedom
## Multiple R-squared: 0.288, Adjusted R-squared: 0.259
## F-statistic: 9.948 on 20 and 492 DF, p-value: < 2.2e-16 predict(fit6, newdata = data.frame(age=35, race = '2. Black', education = '1. < HS Grad')) ## 1 ## 77.82192 predict(fit6, newdata = data.frame(age=35, race = '1. White', education = '1. < HS Grad')) ## 1 ## 80.31314 predict(fit6, newdata = data.frame(age=35, race = '2. Black', education = '4. College Grad')) ## 1 ## 117.7112 predict(fit6, data.frame(age=35, race = '1. White', education = '4. College Grad')) ## 1 ## 120.2024 Generalized error loss function ˆ 􏰃 ˆ􏰃p LGEL(Y −Y;α,p)=[α+(1−2α)∗1Y|t|)
## (Intercept) -6.067585 18.033034 -0.336 0.737
## age 4.870285 0.848212 5.742 1.61e-08 ***
## I(age^2) -0.048464 0.009581 -5.058 5.91e-07 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 40.05 on 510 degrees of freedom
## Multiple R-squared: 0.08019, Adjusted R-squared: 0.07658
## F-statistic: 22.23 on 2 and 510 DF, p-value: 5.533e-10
Initial values:
start.b = c(-6, 4, 0)
Use optim function to minimize an objective function:
## The estimated coefficients are -5.963193 4.865477 -0.04841186
## The estimated coefficients are -5.963193 4.865477 -0.04841186
opt.b = optim(start.b, objective_function, method = “BFGS”)$par cat(“The estimated coefficients are” , opt.b)
opt.b_0.1 = optim(start.b, objective_function, method = ‘BFGS’)$par cat(“The estimated coefficients are” , opt.b_0.1)
plot(new.Wage$age, new.Wage$wage)
points(new.Wage$age, opt.b[1] + X %*% opt.b[2:length(start.b)], pch=20, col=’red’) points(new.Wage$age, opt.b_0.1[1] + X %*% opt.b_0.1[2:length(start.b)], pch=20, col=’green
25
‘)

new.Wage$wage
50 100 150 200 250 300
20 30 40 50 60
new.Wage$age
to_compare = lm(wage ~ age + I(age^2)+education, data=new.Wage) summary(to_compare)
70 80
##
## Call:
## lm(formula = wage ~ age + I(age^2) + education, data = new.Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.153 -19.326 -4.244 14.084 208.642
##
## Coefficients:
##
## (Intercept)
## age
## I(age^2)
## education2. HS Grad
## education3. Some College
## education4. College Grad
## education5. Advanced Degree 54.780530
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 36.14 on 506 degrees of freedom
## Multiple R-squared: 0.2569, Adjusted R-squared: 0.2481
## F-statistic: 29.16 on 6 and 506 DF, p-value: < 2.2e-16 Estimate Std. Error t value Pr(>|t|)
-1.112027 16.591625 -0.067 0.946590
3.600899
-0.035781
9.844077
21.631993
40.424315
0.776115 4.640 4.45e-06 ***
0.008744 -4.092 4.98e-05 ***
5.655088
5.978405
5.997684
6.604022
1.741 0.082336 .
3.618 0.000326 ***
6.740 4.33e-11 ***
8.295 9.90e-16 ***
26

#install.packages(“dummies”)
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
## Warning in model.matrix.default(~x – 1, model.frame(~x – 1), contrasts = FALSE):
## non-list contrasts argument ignored
Initial values:
ducation_dummi
education_dummies = dummy(new.Wage$education)
start.b = c(-1, 4, 0, 10, 22, 40, 55)
X = cbind(new.Wage$age, I(new.Wage$age^2), education_dummies[,2], education_dummies[,3], e education_dummies[,5])
y = new.Wage$wage
opt.b = optim(start.b, objective_function, method = “BFGS”)$par cat(“The estimated coefficients are” , opt.b)
## The estimated coefficients are -1.070875 3.598445 -0.03575212 9.847834 21.64831 40.43435 54.77937
27