程序代写代做代考 data science Introduction to information system

Introduction to information system

Linear Regression

Bowei Chen, Deema Hafeth and Jingmin Huang

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M

Data Science 2016 – 2017 Workshop

Today’s Objectives

• Study the slides in Part I, including:

– Implementation of linear regression in R

– Interpretation of results of linear regression in R

• Do the exercises and advanced exercises in Part II and Part III

– Part II is compulsory and you should be able to complete Part II during

the workshop.

– Part III is a bit challenging but will help you to further understand the

lecture contents. If you’ve completed Part II, please try Part III! 

Part I:

Examples of Linear Regression in R

Example (1/13)

This is an example of implementing linear regression models in R.

We will use the R dataset Cars93 in the MASS library

> library(MASS)

> df <- Cars93 > dim(df)

[1] 93 27

Using dim() function to see the size of data. There are 93
observations and 27 features/predictors in the dataset

Example (2/13)

> head(df,3)

Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain

1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front

2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front

3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front

Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length

1 4 1.8 140 6300 2890 Yes 13.2 5 177

2 6 3.2 200 5500 2335 Yes 18.0 5 195

3 6 2.8 172 5500 2280 Yes 16.9 5 180

Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make

1 102 68 37 26.5 11 2705 non-USA Acura Integra

2 115 71 38 30.0 15 3560 non-USA Acura Legend

3 102 67 37 28.0 14 3375 non-USA Audi 90

Using head() function to look at a few
sample observations of the data. This

is an important step in data analysis!

Example (3/13)

> sapply(df, class)

Manufacturer Model Type Min.Price Price Max.Price

“factor” “factor” “factor” “numeric” “numeric” “numeric”

MPG.city MPG.highway AirBags DriveTrain Cylinders EngineSize

“integer” “integer” “factor” “factor” “factor” “numeric”

Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers

“integer” “integer” “integer” “factor” “numeric” “integer”

Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room

“integer” “integer” “integer” “integer” “numeric” “integer”

Weight Origin Make

“integer” “factor” “factor”

Using sapply() can look at what are
the data types of each variables

Example (4/13)

> plot(df$Horsepower, df$Price,

+ xlab = “Horsepower”,

+ ylab = “Price”)

Let’s look at two variables of cars:

horsepower and price. Do they have

some correlations?

Example (5/13)
> # Simple linear regression (method 2) —————–

> model <- lm(y ~ x) > model$coefficients

(Intercept) x

-1.3987691 0.1453712

> beta0 <- model$coefficients[1] > beta1 <- model$coefficients[2] >

> plot(df$Horsepower, df$Price,

+ xlab = “Horsepower”,

+ ylab = “Price”)

> y_hat_vec <- beta1 * df$Horsepower + beta0 > lines(df$Horsepower, y_hat_vec, lty = 2, col = 4)

> legend(50,

+ 30,

+ lty = 2,

+ col = 4,

+ “Regression line”)

Estimate parameters of a simple linear

regression model by using R function

> residuals_vec <- df$Price - y_hat_vec > summary(residuals_vec)

Min. 1st Qu. Median Mean 3rd Qu. Max.
-16.4100 -2.7920 -0.8208 0.0000 1.8030 31.7500

Example (6/13)
> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 *** --- Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 5.977 on 91 degrees of freedom Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171 F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16 The residual here means the error 𝑦𝑖 − 𝑦𝑖 Estimate parameters of a simple linear regression model by using R function Example (7/13) > summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 *** --- Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 5.977 on 91 degrees of freedom Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171 F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16 This is the standard deviation of the sampling distribution of the coefficient estimate under standard regression assumptions. It should be noted that you are not required to understand how standard errors are calculated. However, if you are interested, please read Casella’s book Chapters 11-12 Estimate parameters of a simple linear regression model by using R function Example (8/13) > summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 *** --- Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 5.977 on 91 degrees of freedom Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171 F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16 • t value is the t-statistic value for testing whether the corresponding regression coefficient is different from 0. • Pr(> |𝑡|) is the p-value for the
hypothesis test for the 𝑡 value. The null
hypothesis is that the coefficient is zero;

It should be noted that you are not required

to understand how t value and p-value are

calculated. However, if you are interested,

please read Casella’s book Chapters 11-12

Estimate parameters of a simple linear

regression model by using R function

Example (9/13)
> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 *** --- Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 5.977 on 91 degrees of freedom Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171 F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16 R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, simply defined by 𝑅2 = Explained variation Total variation In general, the higher the R-squared, the better the model fits your data. It should be noted that you are not required to understand how R-squared, multiple R- squared, adjusted R-squared and their tests are calculated. However, if you are interested, please read Casella’s book Chapters 11-12 Estimate parameters of a simple linear regression model by using R function Example (10/13) Prediction If a new Audi A4 has 175 horsepower, what is the selling price of this Audi A4? > # Prediction ——————————————

>

> x_i <- 175 > y_hat_i <- beta1 * x_i + beta0 >

> plot(df$Horsepower, df$Price,

+ xlab = “Horsepower”,

+ ylab = “Price”)

> y_hat <- beta1 * df$Horsepower + beta0 > lines(df$Horsepower, y_hat, lty = 2, col = 4)

> points(x_i, y_hat_i, col = 2, pch=9)

> legend(75,

+ 50,

+ lty = c(2,NA),

+ pch = c(NA,9),

+ col = c(4,2),

+ c(“Regression line”, “New Audi A4”))

Example (11/13)

> attach(df)

> pairs(

+ data.frame(

+ MPG.city,

+ MPG.highway,

+ EngineSize,

+ Horsepower,

+ Fuel.tank.capacity,

+ Length,

+ Width,

+ Rear.seat.room,

+ Luggage.room

+ )

+ )

> detach(df)

Let’s look at many

variables of cars

Example (12/13)

> attach(df)

> model.multiple <- + lm( + Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room + ) > detach(df)

> model.multiple$coefficients

(Intercept) MPG.city MPG.highway EngineSize Horsepower
Fuel.tank.capacity Length

59.1474034 0.2363122 -0.3766282 1.8048313 0.1290087
0.6154648 0.1150924

Width Rear.seat.room Luggage.room

-1.3785983 0.1206144 0.2735771

Estimate parameters of a multiple linear

regression model by using R function

> summary(model.multiple)

Call:

lm(formula = Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room)

Residuals:

Min 1Q Median 3Q Max

-11.7444 -3.7098 -0.2932 2.9824 28.7627

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 59.14740 27.51934 2.149 0.03497 *

MPG.city 0.23631 0.44678 0.529 0.59848

MPG.highway -0.37663 0.44106 -0.854 0.39598

EngineSize 1.80483 1.85233 0.974 0.33314

Horsepower 0.12901 0.02576 5.008 3.78e-06 ***

Fuel.tank.capacity 0.61546 0.50620 1.216 0.22801

Length 0.11509 0.11504 1.000 0.32044

Width -1.37860 0.49336 -2.794 0.00666 **

Rear.seat.room 0.12061 0.33957 0.355 0.72348

Luggage.room 0.27358 0.39166 0.699 0.48711

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 5.868 on 72 degrees of freedom (11 observations deleted due to missingness)

Multiple R-squared: 0.6914, Adjusted R-squared: 0.6528

F-statistic: 17.92 on 9 and 72 DF, p-value: 3.547e-15

Example (13/13)

Part II:

Exercises

Exercise 1/3

a) Download and import the “Housing.csv” data

b) Scatter plot the two variables price and house size

c) Take price as the responsible variable and house size as the predictor variable.

Implement your own simple linear regression model – calculate the parameters’

values directly [Hint: check the lecture slides]

d) Plot the data points and the “predicting line”. Does this line fit the data well?

Exercise 2/3

a) Now use the R built-in linear regression function “lm()” to generate the simple linear
regression model for the “Housing.csv” data

– Still examine the two variables: price and house size

– Do you get the same intercept and slop values?

b) Use R built-in function “predict()” to get the predicted price value when the house
size is 7000 sq ft [Hint: use ‘help(predict)’ or Google to learn how to pass the

parameters]

a) Calculate the sum of squared errors (SSE)

[Hint: you can get SSE by using the predicted price values and the real price values]

Exercise 3/3

a) Build a multiple linear regression model based on all predictors for the “Housing.csv”

data by using ‘lm’ function

b) What is the SSE now?

– How much it is different from the previous SSE value?

– More predictor variables better results?

c) Can we remove some predictor variables for the multiple linear regression? If so,

which variables do you think can be removed?

Part III:

Advanced Exercises

Advanced Exercise 1/2

a) Generate two random number series. Each with 1000 numbers follow 𝑈(0,100)

b) Build a simple linear regression model based on these two series

– Do you think it is an appropriate model?

c) What is the R-squared value for this trained model? Why do we need this value?

Advanced Exercise 2/2

For a multiple linear regression, please complete the following two questions:

a) Estimate the parameters by using the least squared errors (LSE) method

b) Estimate the parameters by using the maximum likelihood method

c) Implement your multiple linear regression models based on the “Housing.csv” data

and compare your results with the results by using the R function lm

Thank You

bchen@lincoln.ac.uk

dabdalhafeth@lincoln.ac.uk

jhua8590@gmail.com

mailto:bchen@lincoln.ac.uk
mailto:dabdalhafeth@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/jhua8590@gmail.com