Introduction to information system
Linear Regression
Bowei Chen, Deema Hafeth and Jingmin Huang
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M
Data Science 2016 – 2017 Workshop
Today’s Objectives
• Study the slides in Part I, including:
– Implementation of linear regression in R
– Interpretation of results of linear regression in R
• Do the exercises and advanced exercises in Part II and Part III
– Part II is compulsory and you should be able to complete Part II during
the workshop.
– Part III is a bit challenging but will help you to further understand the
lecture contents. If you’ve completed Part II, please try Part III!
Part I:
Examples of Linear Regression in R
Example (1/13)
This is an example of implementing linear regression models in R.
We will use the R dataset Cars93 in the MASS library
> library(MASS)
> df <- Cars93 > dim(df)
[1] 93 27
Using dim() function to see the size of data. There are 93
observations and 27 features/predictors in the dataset
Example (2/13)
> head(df,3)
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain
1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front
2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front
3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front
Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
1 4 1.8 140 6300 2890 Yes 13.2 5 177
2 6 3.2 200 5500 2335 Yes 18.0 5 195
3 6 2.8 172 5500 2280 Yes 16.9 5 180
Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1 102 68 37 26.5 11 2705 non-USA Acura Integra
2 115 71 38 30.0 15 3560 non-USA Acura Legend
3 102 67 37 28.0 14 3375 non-USA Audi 90
Using head() function to look at a few
sample observations of the data. This
is an important step in data analysis!
Example (3/13)
> sapply(df, class)
Manufacturer Model Type Min.Price Price Max.Price
“factor” “factor” “factor” “numeric” “numeric” “numeric”
MPG.city MPG.highway AirBags DriveTrain Cylinders EngineSize
“integer” “integer” “factor” “factor” “factor” “numeric”
Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers
“integer” “integer” “integer” “factor” “numeric” “integer”
Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room
“integer” “integer” “integer” “integer” “numeric” “integer”
Weight Origin Make
“integer” “factor” “factor”
Using sapply() can look at what are
the data types of each variables
Example (4/13)
> plot(df$Horsepower, df$Price,
+ xlab = “Horsepower”,
+ ylab = “Price”)
Let’s look at two variables of cars:
horsepower and price. Do they have
some correlations?
Example (5/13)
> # Simple linear regression (method 2) —————–
> model <- lm(y ~ x) > model$coefficients
(Intercept) x
-1.3987691 0.1453712
> beta0 <- model$coefficients[1] > beta1 <- model$coefficients[2] >
> plot(df$Horsepower, df$Price,
+ xlab = “Horsepower”,
+ ylab = “Price”)
> y_hat_vec <- beta1 * df$Horsepower + beta0 > lines(df$Horsepower, y_hat_vec, lty = 2, col = 4)
> legend(50,
+ 30,
+ lty = 2,
+ col = 4,
+ “Regression line”)
Estimate parameters of a simple linear
regression model by using R function
> residuals_vec <- df$Price - y_hat_vec > summary(residuals_vec)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-16.4100 -2.7920 -0.8208 0.0000 1.8030 31.7500
Example (6/13)
> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 *** --- Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 5.977 on 91 degrees of freedom Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171 F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16 The residual here means the error 𝑦𝑖 − 𝑦𝑖 Estimate parameters of a simple linear regression model by using R function Example (7/13) > summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 *** --- Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 5.977 on 91 degrees of freedom Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171 F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16 This is the standard deviation of the sampling distribution of the coefficient estimate under standard regression assumptions. It should be noted that you are not required to understand how standard errors are calculated. However, if you are interested, please read Casella’s book Chapters 11-12 Estimate parameters of a simple linear regression model by using R function Example (8/13) > summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
• t value is the t-statistic value for testing
whether the corresponding regression
coefficient is different from 0.
• Pr(> |𝑡|) is the p-value for the
hypothesis test for the 𝑡 value. The null
hypothesis is that the coefficient is zero;
It should be noted that you are not required
to understand how t value and p-value are
calculated. However, if you are interested,
please read Casella’s book Chapters 11-12
Estimate parameters of a simple linear
regression model by using R function
Example (9/13)
> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 *** --- Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 5.977 on 91 degrees of freedom Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171 F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16 R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, simply defined by 𝑅2 = Explained variation Total variation In general, the higher the R-squared, the better the model fits your data. It should be noted that you are not required to understand how R-squared, multiple R- squared, adjusted R-squared and their tests are calculated. However, if you are interested, please read Casella’s book Chapters 11-12 Estimate parameters of a simple linear regression model by using R function Example (10/13) Prediction If a new Audi A4 has 175 horsepower, what is the selling price of this Audi A4? > # Prediction ——————————————
>
> x_i <- 175 > y_hat_i <- beta1 * x_i + beta0 >
> plot(df$Horsepower, df$Price,
+ xlab = “Horsepower”,
+ ylab = “Price”)
> y_hat <- beta1 * df$Horsepower + beta0 > lines(df$Horsepower, y_hat, lty = 2, col = 4)
> points(x_i, y_hat_i, col = 2, pch=9)
> legend(75,
+ 50,
+ lty = c(2,NA),
+ pch = c(NA,9),
+ col = c(4,2),
+ c(“Regression line”, “New Audi A4”))
Example (11/13)
> attach(df)
> pairs(
+ data.frame(
+ MPG.city,
+ MPG.highway,
+ EngineSize,
+ Horsepower,
+ Fuel.tank.capacity,
+ Length,
+ Width,
+ Rear.seat.room,
+ Luggage.room
+ )
+ )
> detach(df)
Let’s look at many
variables of cars
Example (12/13)
> attach(df)
> model.multiple <- + lm( + Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room + ) > detach(df)
> model.multiple$coefficients
(Intercept) MPG.city MPG.highway EngineSize Horsepower
Fuel.tank.capacity Length
59.1474034 0.2363122 -0.3766282 1.8048313 0.1290087
0.6154648 0.1150924
Width Rear.seat.room Luggage.room
-1.3785983 0.1206144 0.2735771
Estimate parameters of a multiple linear
regression model by using R function
> summary(model.multiple)
Call:
lm(formula = Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room)
Residuals:
Min 1Q Median 3Q Max
-11.7444 -3.7098 -0.2932 2.9824 28.7627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.14740 27.51934 2.149 0.03497 *
MPG.city 0.23631 0.44678 0.529 0.59848
MPG.highway -0.37663 0.44106 -0.854 0.39598
EngineSize 1.80483 1.85233 0.974 0.33314
Horsepower 0.12901 0.02576 5.008 3.78e-06 ***
Fuel.tank.capacity 0.61546 0.50620 1.216 0.22801
Length 0.11509 0.11504 1.000 0.32044
Width -1.37860 0.49336 -2.794 0.00666 **
Rear.seat.room 0.12061 0.33957 0.355 0.72348
Luggage.room 0.27358 0.39166 0.699 0.48711
—
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.868 on 72 degrees of freedom (11 observations deleted due to missingness)
Multiple R-squared: 0.6914, Adjusted R-squared: 0.6528
F-statistic: 17.92 on 9 and 72 DF, p-value: 3.547e-15
Example (13/13)
Part II:
Exercises
Exercise 1/3
a) Download and import the “Housing.csv” data
b) Scatter plot the two variables price and house size
c) Take price as the responsible variable and house size as the predictor variable.
Implement your own simple linear regression model – calculate the parameters’
values directly [Hint: check the lecture slides]
d) Plot the data points and the “predicting line”. Does this line fit the data well?
Exercise 2/3
a) Now use the R built-in linear regression function “lm()” to generate the simple linear
regression model for the “Housing.csv” data
– Still examine the two variables: price and house size
– Do you get the same intercept and slop values?
b) Use R built-in function “predict()” to get the predicted price value when the house
size is 7000 sq ft [Hint: use ‘help(predict)’ or Google to learn how to pass the
parameters]
a) Calculate the sum of squared errors (SSE)
[Hint: you can get SSE by using the predicted price values and the real price values]
Exercise 3/3
a) Build a multiple linear regression model based on all predictors for the “Housing.csv”
data by using ‘lm’ function
b) What is the SSE now?
– How much it is different from the previous SSE value?
– More predictor variables better results?
c) Can we remove some predictor variables for the multiple linear regression? If so,
which variables do you think can be removed?
Part III:
Advanced Exercises
Advanced Exercise 1/2
a) Generate two random number series. Each with 1000 numbers follow 𝑈(0,100)
b) Build a simple linear regression model based on these two series
– Do you think it is an appropriate model?
c) What is the R-squared value for this trained model? Why do we need this value?
Advanced Exercise 2/2
For a multiple linear regression, please complete the following two questions:
a) Estimate the parameters by using the least squared errors (LSE) method
b) Estimate the parameters by using the maximum likelihood method
c) Implement your multiple linear regression models based on the “Housing.csv” data
and compare your results with the results by using the R function lm
Thank You
bchen@lincoln.ac.uk
dabdalhafeth@lincoln.ac.uk
jhua8590@gmail.com
mailto:bchen@lincoln.ac.uk
mailto:dabdalhafeth@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/jhua8590@gmail.com