程序代写代做代考 Predictive Modelling for House Prices¶

Predictive Modelling for House Prices¶
Version: 1.0
Language: R 3.3.0 and Jupyter notebook
Libraries used:
• car
• ggplot2
• grid
• gridExtra
• RColorBrewer
• reshape2
• scales
• stats
Introduction¶
This notebook contains the results of the data analysis performed on a set of house sales data from a residential real estate market. The aim of the data analysis is to build a linear regression model from the real estate transaction data that can be used to predict the sales price of a house.
The first section of the notebook shows the exploratory data analysis (EDA) performed to explore and understand the data. It looks at each attribute (variable) in the data to understand the nature and distribution of the attribute values. It also examines the correlation between the variables through visual analysis. A summary at the end highlights the key findings of the EDA.
The second section shows the development of the linear regression model. It details the process used to build the model and shows the model at key points in the development process. The final model is then presented along with an analysis and interpretation of the model. This section concludes with the results of using the model to predict house prices for the data in the development dataset.
The final section provides the details of the model to enable it to be rebuilt. In addition to the model itself, it includes the functions used to transform the data and run the model.
Two datasets were provided for the assignment – training.csv and dev.csv. The exploratory data analysis and the model building were done using the training.csv dataset; the dev.csv dataset was only used to test the generated model.

Load the libraries used in the notebook
In [1]:
library(ggplot2)
library(reshape2)
library(car)
library(stats)
library(scales)
library(grid)
library(gridExtra)
library(RColorBrewer)

Loading required package: carData

Exploratory Data Analysis¶
Overview of the Training Dataset¶
In [2]:
# Load the training dataset
houseData <- read.csv("training.csv") In [3]: # Display the dimensions cat("The housing dataset has", dim(houseData)[1], "records, each with", dim(houseData)[2], "attributes. The structure is:\n\n") # Display the structure str(houseData) cat("\nThe first few and last few records in the dataset are:") # Inspect the first few records head(houseData) # And the last few tail(houseData) cat("\nBasic statistics for each attribute are:") # Statistical summary summary(houseData) cat("The numbers of unique values for each attribute are:") apply(houseData, 2, function(x) length(unique(x))) The housing dataset has 10000 records, each with 11 attributes. The structure is: 'data.frame': 10000 obs. of 11 variables: $ id : num 5.54e+09 2.03e+09 2.03e+09 9.48e+09 2.86e+09 ... $ price : int 211000 265000 1440000 800000 1059500 750000 229000 271115 428000 1240000 ... $ bedrooms : int 4 3 3 4 5 2 3 2 3 4 ... $ bathrooms : num 1 2.5 3.5 3.5 3.25 1 1.5 1.5 2.25 3.5 ... $ sqft_living: int 2100 1530 3870 2370 3230 1620 1200 830 2600 3820 ... $ sqft_lot : int 9200 6000 3819 3302 3825 6120 5000 1325 15000 13224 ... $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ... $ condition : int 3 4 3 3 3 3 3 3 3 3 ... $ grade : int 7 7 11 8 9 7 6 7 9 10 ... $ yr_built : int 1959 1991 2002 1926 2014 1951 1979 2005 1978 1990 ... $ zipcode : int 98168 98038 98102 98103 98117 98115 98146 98136 98038 98004 ... The first few and last few records in the dataset are: id price bedrooms bathrooms sqft_living sqft_lot waterfront condition grade yr_built zipcode 5537200043 211000 4 1.00 2100 9200 0 3 7 1959 98168 2025700080 265000 3 2.50 1530 6000 0 4 7 1991 98038 2025049111 1440000 3 3.50 3870 3819 0 3 11 2002 98102 9482700075 800000 4 3.50 2370 3302 0 3 8 1926 98103 2856102105 1059500 5 3.25 3230 3825 0 3 9 2014 98117 3364900375 750000 2 1.00 1620 6120 0 3 7 1951 98115 id price bedrooms bathrooms sqft_living sqft_lot waterfront condition grade yr_built zipcode 9995 8945100320 136500 3 1.50 1420 8580 0 3 6 1962 98023 9996 3832500230 245000 4 2.25 2140 8800 0 4 7 1963 98032 9997 7351200050 1335000 4 1.75 2300 13342 1 3 7 1934 98125 9998 2301400325 760000 3 2.00 1810 4500 0 4 7 1906 98117 9999 1201500010 833000 4 2.50 2190 12690 0 3 8 1973 98033 10000 3709600190 370000 4 2.50 2130 4750 0 3 8 2009 98058 Basic statistics for each attribute are: id price bedrooms bathrooms Min. :1.000e+06 Min. : 78000 Min. : 0.000 Min. :0.000 1st Qu.:2.126e+09 1st Qu.: 320000 1st Qu.: 3.000 1st Qu.:1.750 Median :3.905e+09 Median : 450000 Median : 3.000 Median :2.250 Mean :4.591e+09 Mean : 541434 Mean : 3.373 Mean :2.113 3rd Qu.:7.304e+09 3rd Qu.: 649950 3rd Qu.: 4.000 3rd Qu.:2.500 Max. :9.842e+09 Max. :6885000 Max. :10.000 Max. :8.000 sqft_living sqft_lot waterfront condition Min. : 370 Min. : 520 Min. :0.0000 Min. :1.000 1st Qu.: 1430 1st Qu.: 5058 1st Qu.:0.0000 1st Qu.:3.000 Median : 1920 Median : 7620 Median :0.0000 Median :3.000 Mean : 2080 Mean : 14947 Mean :0.0077 Mean :3.407 3rd Qu.: 2545 3rd Qu.: 10642 3rd Qu.:0.0000 3rd Qu.:4.000 Max. :13540 Max. :982998 Max. :1.0000 Max. :5.000 grade yr_built zipcode Min. : 3.000 Min. :1900 Min. :98001 1st Qu.: 7.000 1st Qu.:1951 1st Qu.:98033 Median : 7.000 Median :1975 Median :98065 Mean : 7.657 Mean :1971 Mean :98078 3rd Qu.: 8.000 3rd Qu.:1997 3rd Qu.:98117 Max. :13.000 Max. :2015 Max. :98199 The numbers of unique values for each attribute are: id 9964 price 2526 bedrooms 11 bathrooms 26 sqft_living 744 sqft_lot 5712 waterfront 2 condition 5 grade 11 yr_built 116 zipcode 70 Summary of Attributes¶ The following table identifies which attributes are numerical and whether they are continuous or discrete, and which are categorical and whether they are nominal or ordinal. It includes some initial observations about the ranges and common values of the attributes. Attribute Type Sub-type Comments id Categorical Nominal Contains duplicates - are these duplicate records or multiple sales of the same house? Can be dropped for the analysis. price Numerical Continuous Target variable - values range from 78000 to 6885000. Probably has outliers - especially high ones. bedrooms Numerical Discrete Values range from 0 to 10 - the majority are 3 and 4. bathrooms Numerical Discrete Has 26 values ranging from 0 to 8 - with most houses having 2.5, includes fractions - .25, .5 and .75. sqft_living Numerical Continuous Ranges from 370 to 13540 - could have outliers sqft_lot Numerical Continuous Ranges from 520 to 982998 - could have extreme outliers waterfront Categorical Nominal Only has two values - the majority are 0; only 77 are 1 condition Categorical Ordinal Has 5 values - range is 1 - 5 grade Categorical Ordinal Has 13 values potential values - range 1 - 13, but 1 & 2 are not used in training data. The majority are 7 and 8. yr_built Numerical Discrete Values range from 1900 to 2015. zipcode Categorical Nominal Has 70 values between 98001 and 98199. Check the Duplicate IDs¶ Display the records with duplicate ids to determine whether these are duplicate records In [4]: duplicates <- aggregate(houseData$id, list(houseData$id), NROW) duplicates <- houseData[houseData$id %in% duplicates[duplicates$x > 1,”Group.1″],]
head(duplicates[order(duplicates$id),],20)

id
price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
8449
7200179
175000
2
1.00
840
12750
0
3
6
1925
98055
9679
7200179
150000
2
1.00
840
12750
0
3
6
1925
98055
9012
643300040
719521
4
1.75
1920
9500
0
4
7
1966
98006
9674
643300040
481000
4
1.75
1920
9500
0
4
7
1966
98006
614
1139600270
300000
3
2.75
2090
9620
0
3
8
1987
98023
5810
1139600270
310000
3
2.75
2090
9620
0
3
8
1987
98023
2683
1446403850
212000
2
1.00
790
7153
0
4
6
1944
98168
4643
1446403850
118125
2
1.00
790
7153
0
4
6
1944
98168
2032
1721801010
302100
3
1.00
1790
6120
0
3
6
1937
98146
3303
1721801010
225000
3
1.00
1790
6120
0
3
6
1937
98146
4681
1901600090
359000
5
1.75
1940
6654
0
4
7
1953
98166
8112
1901600090
390000
5
1.75
1940
6654
0
4
7
1953
98166
5124
1995200200
313950
3
1.00
1510
6083
0
4
6
1940
98115
7588
1995200200
415000
3
1.00
1510
6083
0
4
6
1940
98115
518
2023049218
445000
2
1.00
930
7740
0
1
5
1932
98148
7854
2023049218
105500
2
1.00
930
7740
0
1
5
1932
98148
2058
2143700830
370000
4
2.50
2100
19680
0
3
6
1914
98055
5516
2143700830
207000
4
2.50
2100
19680
0
3
6
1914
98055
4966
2560801222
309950
3
2.25
1990
6350
0
3
7
1967
98198
7762
2560801222
180000
3
2.25
1990
6350
0
3
7
1967
98198

The prices are different – so they may represent the same house being sold more than once.
The ID is unlikely to be useful for further analysis, so remove it.
In [5]:
# Remove the ID
houseData <- houseData[,-1] Investigate Distribution of Each Variable¶ In [6]: attach(houseData) View the variable distributions using boxplots¶ In [7]: # Generate box plots of all variables except the two categorical/nomimal ones m1 <- melt(as.data.frame(houseData[,c(-6,-10)])) ggplot(m1,aes(x = variable,y = value)) + facet_wrap(~variable, scales="free") + geom_boxplot() + scale_y_continuous(labels=function (n) {format(n, scientific=FALSE)}) No id variables; using all as measure variables  View the variable distributions using histograms and bar charts¶ In [8]: # Plot a histogram or bar chart of each variable par(mfrow = c(4,3)) hist(price) hist(sqft_living) hist(sqft_lot) hist(yr_built) plot(as.factor(bedrooms),main="Bar Chart of bedrooms") plot(as.factor(bathrooms),main="Bar Chart of bathrooms") plot(as.factor(waterfront), main="Bar Chart of waterfront") plot(as.factor(condition), main="Bar Chart of condition") plot(as.factor(grade), main="Bar Chart of grade") # plot zipcode on a separate row par(fig=c(0,1,0,0.30),ps=10,new=TRUE) barplot(sort(table(zipcode)),las=2,main="Bar Chart of zipcode")  These graphs show: • Price, sqft_living and sqft_lot all have large positive skews • Most houses have 2.5 bathrooms, other common values are 1 and 1.75. • There are very few waterfront properties • Most properties have condition of 3 or 4 • Most properties have grade of 7 or 8 • Few properties built between 1900 and 1940, fairly even spread from 1940 to 2000, an increase in properties built between 2000 and 2010 • The numbers of properties sold by zip code varies from about 25 to about 250 Take a closer look at some interesting features¶ Replot price, sqft_living and sqft_log using a log scale to see if these variables have a log-normal distribution In [9]: # Set some colours using Colorbrewer gg.colour <- brewer.pal(12,"Paired")[12] gg.fill <- brewer.pal(12,"Paired")[11] # Re-plot some of the charts using log scales to counteract the skew p1 <- ggplot(aes(x=price), data=houseData) + geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) + scale_x_log10(labels=comma,breaks=c(100000,300000,1000000,3000000,10000000)) p2 <- ggplot(aes(x=sqft_living), data=houseData) + geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) + scale_x_log10(labels=comma,breaks=c(500,1000,2000,5000,10000)) p3 <- ggplot(aes(x=sqft_lot), data=houseData) + geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) + scale_x_log10(labels=comma,breaks=c(1000,4000,10000,40000,100000,500000,100000)) grid.arrange(p1, p2, p3, ncol=1, nrow=3)  These graphs show: • The log of price and sqft_living are (nearly) normally distributed. • The log of the sqft_lot is not quite normal. The majority of lot sizes are between 4000 and 10,000 sqft, with a few outliers > 100,000 sqft

There was a drop in sales of houses built during the 1930’s and an increase in sales of houses built during the 2000’s. The following graphs provide a closer look at sales for these years.
In [10]:
par(mfrow=c(2,1))
plot(as.factor(yr_built[yr_built >= 1920 & yr_built < 1950]), main = "Sales by Year Built, 1920-1950", col="lightblue") plot(as.factor(yr_built[yr_built >= 1990]),
main = “Sales by Year Built, Post 1990″, col=”lightblue”)

There were very few houses sold that were built between 1933 and 1936 – these were depression years so meybe fewer houses were built. Then there was an increase in sales for houses built during WWII (so this data is probably not for a European city) and a spike during the post-war years of 1947 and 1948.
There was a jump in the number of houses sold that were built in 2003 to 2008. These houses were between six and twelve years old when they were sold. Fewer houses built between 2009 and 2013 were sold. These figures indicate that owners of new homes tend to own the house for at least six years before reselling. The sales volume of houses built in 2014 and 2015 show that about 2.5% of sales were for new houses.

One possible anomaly in the data is that there are houses with no bedrooms and/or no bathrooms. Take a closer look at these records.
In [11]:
houseData[bedrooms == 0 | bathrooms == 0,]

price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
37
139950
0
0.00
844
4269
0
4
7
1913
98001
987
235000
0
0.00
1470
4800
0
3
7
1996
98065
1295
355000
0
0.00
2460
8049
0
3
8
1990
98031
3523
265000
0
0.75
384
213444
0
3
4
2003
98070
4110
380000
0
0.00
1470
979
0
3
8
2006
98133
7081
339950
0
2.50
2290
8319
0
3
8
1985
98042
8099
280000
1
0.00
600
24501
0
2
3
1950
98045

Many of these look very strange. The first record describes a house with no bedrooms and no bathrooms but is in above average condition and of average grade. Record 7081 describes a house with no bedrooms but 2.5 bathrooms, so would suit a very clean person who doesn’t sleep! Most (if not all) of these are likely to contain erroneous data.

Investigate Pairs of Variables¶
Correlation Plot Function¶
This is the DIY correlation plot provided in the tutorial
In [12]:
# DIY correlation plot
# http://stackoverflow.com/questions/31709982/how-to-plot-in-r-a-correlogram-on-top-of-a-correlation-matrix
# there’s some truth to the quote that modern programming is often stitching together pieces from SO

colorRange <- c('#69091e', '#e37f65', 'white', '#aed2e6', '#042f60') ## colorRamp() returns a function which takes as an argument a number ## on [0,1] and returns a color in the gradient in colorRange myColorRampFunc <- colorRamp(colorRange) panel.cor <- function(w, z, ...) { correlation <- cor(w, z) ## because the func needs [0,1] and cor gives [-1,1], we need to shift and scale it col <- rgb(myColorRampFunc((1 + correlation) / 2 ) / 255 ) ## square it to avoid visual bias due to "area vs diameter" radius <- sqrt(abs(correlation)) radians <- seq(0, 2*pi, len = 50) # 50 is arbitrary x <- radius * cos(radians) y <- radius * sin(radians) ## make them full loops x <- c(x, tail(x,n=1)) y <- c(y, tail(y,n=1)) ## trick: "don't create a new plot" thing by following the ## advice here: http://www.r-bloggers.com/multiple-y-axis-in-a-r-plot/ ## This allows par(new=TRUE) plot(0, type='n', xlim=c(-1,1), ylim=c(-1,1), axes=FALSE, asp=1) polygon(x, y, border=col, col=col) } # usage e.g.: # pairs(mtcars, upper.panel = panel.cor) Scatterplot Matrix¶ 1) Plot the variables using a scatterplot matrix to visualise the correlations between variables. In [13]: pairs(houseData[sample.int(nrow(houseData),1000),], lower.panel=panel.cor)  The correlation matrix shows: • Price is strongly correlated to sqft_living and grade, has a weaker correlation to bathrooms, bedrooms and waterfront • Bedrooms, bathrooms, sqft_living, grade and yr_built are all correlation to each other - in particular both bathrooms and grade are highly correlated to sqft_living • There is quite a strong negative correlation between condition and yr_built • There is very little correlation between the land size (sqft_lot) and the price • There is a correlation between zipcode and yr_built There is also a weak correlation between price and whether or not the property overlooks the waterfront. 2) How does looking at just these waterfront properties change the correlations? In [14]: pairs(houseData[houseData$waterfront == 1,-6], lower.panel=panel.cor)  The correlations are similar but a few stronger ones: • yr_built and price • bedrooms with bathrooms and grade • grade and yr_built The negative correlation between condition and yr_built is weaker There is a new correlation between zipcode and price - so the zipcode appears to have an effect on the price of waterfront properties, but not on other properties. 3) Take a closer look at the correlations between price, bedrooms, bathrooms, sqft_living and grade In [15]: scatterplotMatrix(~price+bedrooms+bathrooms+sqft_living+grade,data=houseData)  The scatterplot matrix shows many of these relationships are non-linear - in particular the relationships between price and the other variables. The relationship between bedrooms and the other variables looks to be linear for houses with between one and four bedrooms but changes for houses with more than four bedrooms. Correlation Coefficients¶ Show the correlation coefficients for all pairs of variables In [16]: options(digits=4) cor(houseData) price bedrooms bathrooms sqft_living sqft_lot waterfront condition grade yr_built zipcode price 1.00000 0.323447 0.52505 0.69585 0.09395 0.294854 0.03340 0.6688 0.06105 -0.04954 bedrooms 0.32345 1.000000 0.52173 0.59094 0.03469 -0.003382 0.02870 0.3671 0.14510 -0.14563 bathrooms 0.52505 0.521726 1.00000 0.75076 0.08911 0.075340 -0.12710 0.6701 0.50991 -0.19424 sqft_living 0.69585 0.590939 0.75076 1.00000 0.18653 0.119910 -0.06342 0.7651 0.32164 -0.19173 sqft_lot 0.09395 0.034691 0.08911 0.18653 1.00000 0.026782 -0.01656 0.1172 0.05836 -0.13530 waterfront 0.29485 -0.003382 0.07534 0.11991 0.02678 1.000000 0.01523 0.0854 -0.01816 0.02621 condition 0.03340 0.028699 -0.12710 -0.06342 -0.01656 0.015235 1.00000 -0.1503 -0.36385 0.01226 grade 0.66883 0.367060 0.67009 0.76515 0.11717 0.085398 -0.15031 1.0000 0.45182 -0.17989 yr_built 0.06105 0.145100 0.50991 0.32164 0.05836 -0.018161 -0.36385 0.4518 1.00000 -0.34943 zipcode -0.04954 -0.145632 -0.19424 -0.19173 -0.13530 0.026214 0.01226 -0.1799 -0.34943 1.00000 The top positive correlations are between: • sqft_living and grade • sqft_living and bathrooms • sqft_living and price • grade and bathrooms • grade and price The only significant negative correlation is between yr_built and condition - so older houses are generally in better condition than newer houses Re-format Data for Further Analysis¶ Factorise the categorical variables for the rest of the analysis In [17]: houseData$waterfront <- as.factor(houseData$waterfront) houseData$condition <- as.factor(houseData$condition) houseData$grade <- as.factor(houseData$grade) houseData$zipcode <- as.factor(houseData$zipcode) str(houseData) 'data.frame': 10000 obs. of 10 variables: $ price : int 211000 265000 1440000 800000 1059500 750000 229000 271115 428000 1240000 ... $ bedrooms : int 4 3 3 4 5 2 3 2 3 4 ... $ bathrooms : num 1 2.5 3.5 3.5 3.25 1 1.5 1.5 2.25 3.5 ... $ sqft_living: int 2100 1530 3870 2370 3230 1620 1200 830 2600 3820 ... $ sqft_lot : int 9200 6000 3819 3302 3825 6120 5000 1325 15000 13224 ... $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ condition : Factor w/ 5 levels "1","2","3","4",..: 3 4 3 3 3 3 3 3 3 3 ... $ grade : Factor w/ 11 levels "3","4","5","6",..: 5 5 9 6 7 5 4 5 7 8 ... $ yr_built : int 1959 1991 2002 1926 2014 1951 1979 2005 1978 1990 ... $ zipcode : Factor w/ 70 levels "98001","98002",..: 65 24 42 43 52 50 61 59 24 4 ... melt the data for visualisation and remove some outliers In [18]: m1 <- melt(as.data.frame(houseData)) m1 <- m1[!(m1$variable == "sqft_living" & m1$value > 10000),]
m1 <- m1[!(m1$variable == "sqft_lot" & m1$value > 100000),]

Using waterfront, condition, grade, zipcode as id variables

Show how Grade is Related to other Variables¶
The grade is correlated with several of the numeric variables. These box plots show how each numeric variable is distributed by grade.
In [19]:
ggplot(m1,aes(x = grade,y = value)) +
facet_wrap(~variable, scales=”free”) +
geom_boxplot() +
scale_y_continuous(labels=function (n) {format(n, scientific=FALSE)})

High grade houses:
• are higher priced than lower grade houses
• have more bedrooms and bathrooms
• are larger and on larger properties – but the lowest grade houses are also on larger properties
• tend to be newer houses

Show how Condition is Related to other Variables¶
The condition of the house is negatively correlated to the age of the house. These boxplots investigate this correlation and check for any relationships to other variables.
In [20]:
ggplot(m1,aes(x = condition,y = value)) +
facet_wrap(~variable, scales=”free”) +
geom_boxplot() +
scale_y_continuous(labels=function (n) {format(n, scientific=FALSE)})

Houses in the best condition are similar to houses in average condition, but tend to be older and so are probably older houses that have been renovated to a very high standard.
Houses in poor condition, compared to those in average condition:
• Sell for slightly less
• Have fewer bedrooms and bathrooms
• Are smaller but on similar sized properties
• Are older houses

Show how the Zipcode is Related to other Variables¶
Show a boxplot of each numeric variable by zipcode. Zipcodes are coloured according to whether at least one property with the zipcode is a waterfront property (red – a waterfront zipcode, blue – not a waterfront zipcode). Outliers are not shown to reduce graph clutter.
Code to separate the legends from the plots was copied from: http://stackoverflow.com/questions/12539348/ggplot-separate-legend-and-plot
In [21]:
wf.zipcodes <- houseData$zipcode %in% unique(as.character(houseData$zipcode[waterfront==1])) p1 <- ggplot(houseData,aes(x = reorder(zipcode,price,median),y = price)) + geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) + scale_y_continuous(limits=c(0,4000000), labels=comma) + scale_fill_discrete(name="Waterfront Zipcode", labels=c("No","Yes")) + theme(legend.position = "bottom", legend.box = "horizontal") + xlab("Zipcode (ordered by median price)") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6)) tmp <- ggplot_gtable(ggplot_build(p1)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] p2 <- ggplot(houseData,aes(x = reorder(zipcode,yr_built,median),y = yr_built)) + geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) + scale_fill_discrete(guide="none") + xlab("Zipcode (ordered by median yr_built)") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6)) grid.arrange(p1+theme(legend.position = 'none'), p2, legend, ncol=1, nrow=3, heights=c(3/7,3/7,1/7)) p3 <- ggplot(houseData,aes(x = reorder(zipcode,bedrooms,mean),y = bedrooms)) + geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) + scale_y_continuous(limits=c(0,7),breaks=seq(0,7)) + scale_fill_discrete(guide="none") + xlab("Zipcode (ordered by mean bedrooms)") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6)) p4 <- ggplot(houseData,aes(x = reorder(zipcode,bathrooms,mean),y = bathrooms)) + geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) + scale_y_continuous(limits=c(0,5),breaks=seq(0,5)) + scale_fill_discrete(guide="none") + xlab("Zipcode (ordered by mean bathrooms)") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6)) grid.arrange(p3, p4, legend, ncol=1, nrow=3, heights=c(3/7,3/7,1/7)) p5 <- ggplot(houseData,aes(x = reorder(zipcode,sqft_living,median),y = sqft_living)) + geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) + scale_y_continuous(limits=c(0,7000),breaks=seq(0,7000,1000)) + scale_fill_discrete(guide="none") + xlab("Zipcode (ordered by median sqft_living)") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6)) p6 <- ggplot(houseData,aes(x = reorder(zipcode,sqft_lot,median),y = sqft_lot)) + geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) + scale_y_continuous(limits=c(0,100000),breaks=seq(0,100000,10000), labels=comma) + scale_fill_discrete(guide="none") + xlab("Zipcode (ordered by median sqft_lot)") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6)) grid.arrange(p5, p6, legend, ncol=1, nrow=3, heights=c(3/7,3/7,1/7)) Warning message: “Removed 5 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 5 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 11 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 14 rows containing non-finite values (stat_boxplot).”  Warning message: “Removed 8 rows containing non-finite values (stat_boxplot).”Warning message: “Removed 214 rows containing non-finite values (stat_boxplot).”   The above graphs show zipcodes are related to the other attributes in the following ways: • There are a few high-priced zipcodes. The highest price zipcode contains waterfront properties. • Quite a few zipcodes have predominately newer houses, a few have predominately "middle-aged" houses and a few have a range of older and newer houses. • There is no real difference in number of bedrooms by zipcode • Most zipcodes have mainly houses with more than one bathroom, but a few zipcodes have quite a few one bathroom houses (the lower hinge of the IQR is one). • A few zipcodes have large houses and some very large properties • Several of the zipcodes with larger houses have waterfront properties • One zipcode (98039) stands out as having the highest prices, the most bedrooms and bathrooms and the largest houses. It has waterfront properties. Waterfront Properties¶ The correlation matrices showed there are differences between how waterfront properties and non-waterfront properties are related to the other variables. These differences can be seen in the following charts. Firstly, the differences of the numeric variable distributions between waterfront and non-waterfront properties are shown is the following density graphs. In [22]: p1 <- ggplot(aes(x=price),data = houseData) + geom_density(aes(fill = as.factor(waterfront))) + ggtitle('Price and Waterfront') + scale_x_continuous(labels=comma) + scale_fill_discrete(name="Waterfront", labels=c("No","Yes")) p2 <- ggplot(aes(x=sqft_living),data = houseData) + geom_density(aes(fill = as.factor(waterfront))) + ggtitle('House Size and Waterfront') + scale_x_continuous(limits=c(0,10000)) + scale_y_continuous(labels=comma) + scale_fill_discrete(guide="none") p3 <- ggplot(aes(x=sqft_lot),data = houseData) + geom_density(aes(fill = as.factor(waterfront))) + ggtitle('Lot Size and Waterfront') + scale_x_continuous(limits=c(0,100000)) + scale_fill_discrete(guide="none") p4 <- ggplot(aes(x=yr_built),data = houseData) + geom_density(aes(fill = as.factor(waterfront))) + ggtitle('Year Built and Waterfront') + scale_fill_discrete(guide="none") tmp <- ggplot_gtable(ggplot_build(p1)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] grid.arrange(p1+theme(legend.position = 'none'), p2, p3, p4, legend, ncol=2, nrow=3, heights=c(3/7,3/7,1/7)) Warning message: “Removed 1 rows containing non-finite values (stat_density).”Warning message: “Removed 214 rows containing non-finite values (stat_density).”  Properties overlooking the watefront have some quite different characteristics from other properties: • they are more expensive than non-waterfront ones and have a wider price range • they have a much wider range of both house size and plot size • they are slightly older than non-waterfront ones The following graphs show how waterfront properties differ from non-waterfront ones in terms of the number of bathrooms, condition and grade; and which zipcodes contain waterfront properties. In [23]: p1 <- ggplot(aes(x=bathrooms, fill=waterfront),data = houseData) + facet_wrap(~waterfront, scales="free") + geom_bar() + ggtitle('Bathrooms and Waterfront') + theme(legend.position = "bottom", legend.box = "horizontal") + scale_x_continuous(limits=c(0,6)) + scale_fill_discrete(name="Waterfront", labels=c("No","Yes")) p2 <- ggplot(aes(x=condition, fill=waterfront),data = houseData) + facet_wrap(~waterfront, scales="free") + geom_bar() + ggtitle('Condition and Waterfront') + theme(legend.position = "bottom", legend.box = "horizontal") + scale_fill_discrete(name="Waterfront", labels=c("No","Yes")) p3 <- ggplot(aes(x=grade, fill=waterfront),data = houseData) + facet_wrap(~waterfront, scales="free") + geom_bar() + ggtitle('Grade and Waterfront') + theme(legend.position = "bottom", legend.box = "horizontal") + scale_fill_discrete(name="Waterfront", labels=c("No","Yes")) p4 <- ggplot(aes(x=zipcode, fill=waterfront),data = houseData[houseData$waterfront == 1,]) + geom_bar(fill="#00BFC4") + ggtitle('Zipcodes with Waterfront Properties') + scale_y_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12)) + theme(axis.text.x = element_text(angle = 90, vjust=0.5)) grid.arrange(p1, p2, ncol=1, nrow=2) grid.arrange(p3, p4, ncol=1, nrow=2) #tmp <- ggplot_gtable(ggplot_build(p5)) #leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") #legend <- tmp$grobs[[leg]] #grid.arrange(legend, p5+theme(legend.position = 'none'), p6, ncol=1, nrow=3, heights=c(1/7,3/7,3/7)) Warning message: “Removed 2 rows containing non-finite values (stat_count).”Warning message: “Removed 3 rows containing missing values (geom_bar).”   Additional characteristics of waterfront properties compared to non-waterfront ones are: • These houses have more bathrooms • There are a higher proportion of houses in very good condition - but also in poor condition • There are a higher proportion of high grade properties The last plot shows that only 23 of the zipcodes had sales of waterfront properties and only six of these had more than three sales of waterfront properties. How have House Features Changed over Time?¶ Look at the relationship between year built and bedrooms, bathrooms, grade and condition. Plot graphs using stacked barcharts showing proprotionality to see how these have changed over time. Years are grouped into decades to reduce the chart clutter. In [24]: p1 <- ggplot(aes(x=(trunc((yr_built)/10))*10, fill = as.factor(bedrooms)),data = houseData[bedrooms >= 1 & bedrooms < 8,]) + geom_bar(position="fill") + ggtitle('Bedrooms and Year Built') + scale_fill_brewer(palette="Set3",name="bedrooms") + xlab("Decade Built") p2 <- ggplot(aes(x=(trunc((yr_built)/10))*10, fill = as.factor(bathrooms)),data = houseData[bathrooms >= 1 & bathrooms <= 3.5,]) + geom_bar(position="fill") + ggtitle('Bathrooms and Year Built') + scale_fill_brewer(palette="Set3",name="bathrooms") + xlab("Decade Built") p3 <- ggplot(aes(x=(trunc((yr_built)/10))*10, fill = grade),data = houseData[grade >= 4 & grade < 13,]) + geom_bar(position="fill") + ggtitle('Grade and Year Built') + scale_fill_brewer(palette="Set3",name="grade") + xlab("Decade Built") p4 <- ggplot(aes(x=(trunc((yr_built)/10))*10, fill = condition),data = houseData) + geom_bar(position="fill") + ggtitle('Condition and Year Built') + scale_fill_brewer(palette="Set3",name="condition") + xlab("Decade Built") grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)  These charts show: • Two bedroom houses were common before the 1950's but are rare more recently, with three and four bedrooms becoming more common • Most houses built before 1960 had only one bathroom, two 1/2 bathrooms is usual for recent houses • Older houses tend to be of lower grade • All the houses in poor condition tend to be older - but the houses in really good condition are also older. Most newer houses are in average condition Association between Bedrooms, House Size and Price¶ The earlier scatterplot matrices show the relationship between bedrooms and house size, and bedrooms and price is different for houses with more than four bedrooms than for those with four or fewer bedrooms. Does plotting all three of these variables together show more about these relationships? In [25]: ggplot(aes(x=price/1000,y=sqft_living,colour=as.factor(bedrooms)), data = houseData[houseData$price < 2000000 & houseData$sqft_living < 5000,]) + geom_point() + scale_colour_brewer(name="bedrooms",palette="Set3")  This plot shows that the number of bedrooms, the house size and the price increase together for houses between 1 and 4 bedrooms. Houses with more than 4 bedrooms look to be both have similar prices and be of similar size to 4 bedroom houses. Summary¶ Overview¶ The provided house sales data has 10000 records with 11 attributes for each record. The provided descriptions for each attribute and some additional notes are: 1. Id: Unique ID for each home sold 2. Price: Price of each home sold 3. Bedroom: #bedrooms • Values range from 0-10 4. Bathrooms: #bathrooms, where .5 accounts for a bathroom with a toilet but no shower • Values range from 0-8 and include .25 and .75, as well as .5 5. Sqft_living: Square footage of the apartments interior living space 6. Sqft_lot: Square footage of the land space 7. Waterfront: A binary variable indicating whether the property was overlooking the waterfront or not. 1’s represent a waterfront property, 0’s represent a non-waterfront property 8. Condition: An index from 1 to 5 on the condition of the apartment, 1 - lowest, 5 - highest 9. Grade: An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high-quality level of construction and design. • Grades 1 and 2 are not present in the training data 10. Yr_built: The year the house was initially built. • Values range from 1900 to 2015 11. Zipcode: What postcode area the house is in Analysis of Each Variable¶ Some houses appear to have been sold more than once during the year. Price, sqft_living and sqft_lot have very skewed distributions. Plotting the log values of these variables shows a more normal distribution. There were very few houses sold that were built between 1933 and 1936 - these were depression years so maybe fewer houses were built. Then there was an increase in sales for houses built during WWII and a spike during the post-war years of 1947 and 1948. A spike in the sales of houses built between 2003 and 2008 was observed, probably indicating theat people who bought new homes then are now selling them. Analysis of Pairs of Variables¶ As might be expected, the sale price of a house appears to be determined by a range of factors including the size of the house, the number of bedrooms and bathrooms, the grade and the year the house was built. These variables are all inter-related with larger houses have more bedrooms and bathrooms; and the size, number of bedrooms and bathrooms all contributing to the grade. Two specific observations are: • The size and grade of the house are the most highly correlated to the price are sqft_living and grade.
 • The correlation between bedrooms and the other variables decreases for houses with more than four bedrooms.
 What is a little surprising is that the condition of the house and the size of the lot do not appear to have a big effect on the price. A closer look at the condition showed that being in poor condition does lead to a lower price, but there are not many houses in poor condition. Being in better than average condition does not lead to a better price. The condition and the age of the house are negatively correlated, implying that older houses are generally in better condition than newer houses. However the houses in the worst condition are also older houses. The majority of newer houses are in average condition. There are several differences between properties that overlook the waterfront and other properties. Waterfront properties are usually higher-priced, larger, in better condition and higher grade houses, but as only 77 waterfront properties were sold some of these differences may not generalise to other sets of data. An unexpected observation is that there appears to be a slight correlation between the number of bathrooms, whether or not the property overlooks the waterfront and the zipcode. Build the Linear Regression Model¶ Prepare the Data Frame¶ Re-load the data frame and factorise the categorical variables In [26]: houseData <- read.csv("training.csv") houseData <- houseData[,-1] houseData$waterfront <- as.factor(houseData$waterfront) houseData$condition <- as.factor(houseData$condition) houseData$grade <- as.factor(houseData$grade) houseData$zipcode <- as.factor(houseData$zipcode) str(houseData) 'data.frame': 10000 obs. of 10 variables: $ price : int 211000 265000 1440000 800000 1059500 750000 229000 271115 428000 1240000 ... $ bedrooms : int 4 3 3 4 5 2 3 2 3 4 ... $ bathrooms : num 1 2.5 3.5 3.5 3.25 1 1.5 1.5 2.25 3.5 ... $ sqft_living: int 2100 1530 3870 2370 3230 1620 1200 830 2600 3820 ... $ sqft_lot : int 9200 6000 3819 3302 3825 6120 5000 1325 15000 13224 ... $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ condition : Factor w/ 5 levels "1","2","3","4",..: 3 4 3 3 3 3 3 3 3 3 ... $ grade : Factor w/ 11 levels "3","4","5","6",..: 5 5 9 6 7 5 4 5 7 8 ... $ yr_built : int 1959 1991 2002 1926 2014 1951 1979 2005 1978 1990 ... $ zipcode : Factor w/ 70 levels "98001","98002",..: 65 24 42 43 52 50 61 59 24 4 ... Define some Functions¶ These functions are used during the model building to evaluate the model accuracy. Function to Calculate Model Accuracy Statistics¶ Name: Model.Accuracy Input parameters: • predicted - a vector of predictions • target - a vector containing the target values for the predictions • df - the degrees of freedom • p - the number of parameters excluding the coefficient Return Value: A list containing: • rsquared - the R-Squared value calculated from the predicted and target values • rse - the residual standard error • f.stat - the F-statistic Description: Calculate the TSS and RSS as: • TSS: $\sum_{i=1}^n (y_i - \bar y)^2$ • RSS: $\sum_{i=1}^n (\hat y_i - y_i)^2$ Calculate the statistics according to the following formulae: • R-Squared value: $R^2 = 1 - \frac{RSS}{TSS}$ • Residual standard error - $\sqrt{\frac{1}{df}RSS}$ • F-statistics - $\frac{(TSS - RSS)/p}{RSS / df}$ In [27]: Model.Accuracy <- function(predicted, target, df, p) { rss <- 0 tss <- 0 target.mean <- mean(target) for (i in 1:length(predicted)) { rss <- rss + (predicted[i]-target[i])^2 tss <- tss + (target[i]-target.mean)^2 } rsquared <- 1 - rss/tss rse <- sqrt(rss/df) f.stat <- ((tss-rss)/p) / (rss/df) return(list(rsquared=rsquared,rse=rse,f.stat=f.stat)) } Function to Calculate RMSE¶ Name: RMSE Input parameters: • predicted - a vector of predictions • target - a vector containing the target values for the predictions Return Value: The RMSE value calculated from the predicted and target values Description: Calculate the RMSE value: $RMSE = \sqrt {\sum_{i=1}^n (\hat y_i - y_i)^2 / N}$ In [28]: RMSE <- function(predicted, target) { se <- 0 for (i in 1:length(predicted)) { se <- se + (predicted[i]-target[i])^2 } return (sqrt(se/length(predicted))) } First Model¶ Try fitting all variables to see what appears to be important In [29]: fit1 <- lm(price ~ ., data=houseData) summary(fit1) Call: lm(formula = price ~ ., data = houseData) Residuals: Min 1Q Median 3Q Max -1247108 -65534 265 55635 1899005 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.43e+06 2.00e+05 7.14 9.7e-13 ***
bedrooms -1.14e+04 2.29e+03 -5.00 5.8e-07 ***
bathrooms 2.52e+04 3.63e+03 6.93 4.5e-12 ***
sqft_living 1.47e+02 3.68e+00 39.95 < 2e-16 *** sqft_lot 2.18e-01 4.45e-02 4.89 1.0e-06 *** waterfront1 8.85e+05 1.83e+04 48.26 < 2e-16 *** condition2 6.57e+04 3.96e+04 1.66 0.09686 . condition3 7.83e+04 3.63e+04 2.16 0.03093 * condition4 1.02e+05 3.63e+04 2.82 0.00484 ** condition5 1.36e+05 3.66e+04 3.73 0.00019 *** grade4 -8.21e+04 1.17e+05 -0.70 0.48363 grade5 -1.22e+05 1.10e+05 -1.10 0.26962 grade6 -1.24e+05 1.10e+05 -1.13 0.25869 grade7 -1.14e+05 1.10e+05 -1.04 0.29907 grade8 -7.40e+04 1.10e+05 -0.67 0.50073 grade9 2.52e+04 1.10e+05 0.23 0.81882 grade10 1.72e+05 1.10e+05 1.56 0.11876 grade11 4.05e+05 1.11e+05 3.65 0.00026 *** grade12 8.20e+05 1.13e+05 7.24 4.7e-13 *** grade13 2.14e+06 1.37e+05 15.65 < 2e-16 *** yr_built -7.28e+02 8.42e+01 -8.64 < 2e-16 *** zipcode98002 7.90e+03 2.02e+04 0.39 0.69608 zipcode98003 -1.16e+03 1.84e+04 -0.06 0.94982 zipcode98004 7.37e+05 1.77e+04 41.71 < 2e-16 *** zipcode98005 2.78e+05 2.04e+04 13.65 < 2e-16 *** zipcode98006 2.66e+05 1.60e+04 16.61 < 2e-16 *** zipcode98007 2.42e+05 2.28e+04 10.65 < 2e-16 *** zipcode98008 2.64e+05 1.82e+04 14.50 < 2e-16 *** zipcode98010 5.40e+04 2.75e+04 1.96 0.04946 * zipcode98011 1.41e+05 2.05e+04 6.90 5.7e-12 *** zipcode98014 9.20e+04 2.34e+04 3.93 8.4e-05 *** zipcode98019 9.84e+04 2.16e+04 4.56 5.3e-06 *** zipcode98022 1.96e+04 1.99e+04 0.99 0.32381 zipcode98023 -3.11e+04 1.55e+04 -2.00 0.04582 * zipcode98024 1.45e+05 2.61e+04 5.56 2.7e-08 *** zipcode98027 1.61e+05 1.62e+04 9.97 < 2e-16 *** zipcode98028 1.19e+05 1.90e+04 6.23 4.7e-10 *** zipcode98029 2.08e+05 1.73e+04 12.08 < 2e-16 *** zipcode98030 1.27e+04 1.82e+04 0.69 0.48711 zipcode98031 2.17e+04 1.89e+04 1.15 0.25183 zipcode98032 -7.44e+03 2.37e+04 -0.31 0.75365 zipcode98033 3.78e+05 1.66e+04 22.84 < 2e-16 *** zipcode98034 2.08e+05 1.53e+04 13.62 < 2e-16 *** zipcode98038 3.81e+04 1.55e+04 2.46 0.01385 * zipcode98039 1.29e+06 3.46e+04 37.26 < 2e-16 *** zipcode98040 5.23e+05 1.86e+04 28.15 < 2e-16 *** zipcode98042 9.66e+03 1.52e+04 0.64 0.52537 zipcode98045 1.08e+05 1.85e+04 5.84 5.5e-09 *** zipcode98052 2.44e+05 1.56e+04 15.66 < 2e-16 *** zipcode98053 2.13e+05 1.62e+04 13.14 < 2e-16 *** zipcode98055 3.99e+04 1.82e+04 2.19 0.02880 * zipcode98056 8.91e+04 1.65e+04 5.39 7.2e-08 *** zipcode98058 2.05e+04 1.58e+04 1.30 0.19482 zipcode98059 8.15e+04 1.60e+04 5.10 3.5e-07 *** zipcode98065 1.14e+05 1.74e+04 6.55 6.1e-11 *** zipcode98070 -2.98e+04 2.47e+04 -1.21 0.22745 zipcode98072 1.58e+05 1.91e+04 8.31 < 2e-16 *** zipcode98074 1.66e+05 1.62e+04 10.26 < 2e-16 *** zipcode98075 1.65e+05 1.70e+04 9.70 < 2e-16 *** zipcode98077 1.11e+05 2.03e+04 5.46 4.9e-08 *** zipcode98092 -1.03e+04 1.70e+04 -0.60 0.54601 zipcode98102 4.06e+05 2.60e+04 15.62 < 2e-16 *** zipcode98103 3.19e+05 1.55e+04 20.59 < 2e-16 *** zipcode98105 4.75e+05 1.89e+04 25.17 < 2e-16 *** zipcode98106 1.03e+05 1.76e+04 5.85 5.2e-09 *** zipcode98107 3.31e+05 1.80e+04 18.39 < 2e-16 *** zipcode98108 9.75e+04 2.03e+04 4.81 1.5e-06 *** zipcode98109 4.39e+05 2.58e+04 17.03 < 2e-16 *** zipcode98112 5.86e+05 1.89e+04 31.05 < 2e-16 *** zipcode98115 3.31e+05 1.55e+04 21.29 < 2e-16 *** zipcode98116 2.92e+05 1.77e+04 16.49 < 2e-16 *** zipcode98117 2.97e+05 1.56e+04 19.09 < 2e-16 *** zipcode98118 1.47e+05 1.58e+04 9.30 < 2e-16 *** zipcode98119 5.13e+05 2.18e+04 23.53 < 2e-16 *** zipcode98122 3.39e+05 1.84e+04 18.46 < 2e-16 *** zipcode98125 1.98e+05 1.61e+04 12.28 < 2e-16 *** zipcode98126 1.96e+05 1.71e+04 11.51 < 2e-16 *** zipcode98133 1.42e+05 1.56e+04 9.09 < 2e-16 *** zipcode98136 2.63e+05 1.91e+04 13.77 < 2e-16 *** zipcode98144 2.84e+05 1.75e+04 16.18 < 2e-16 *** zipcode98146 1.04e+05 1.80e+04 5.77 8.0e-09 *** zipcode98148 7.37e+04 3.21e+04 2.30 0.02162 * zipcode98155 1.43e+05 1.62e+04 8.78 < 2e-16 *** zipcode98166 8.00e+04 1.91e+04 4.19 2.8e-05 *** zipcode98168 3.63e+04 1.84e+04 1.97 0.04896 * zipcode98177 2.60e+05 1.86e+04 13.95 < 2e-16 *** zipcode98178 4.04e+04 1.88e+04 2.14 0.03205 * zipcode98188 2.88e+04 2.16e+04 1.34 0.18112 zipcode98198 9.73e+03 1.80e+04 0.54 0.58896 zipcode98199 3.99e+05 1.80e+04 22.17 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 154000 on 9910 degrees of freedom Multiple R-squared: 0.827, Adjusted R-squared: 0.826 F-statistic: 533 on 89 and 9910 DF, p-value: <2e-16 The adjusted R-squared ($R^2$) value indicates this model explains 82.6% of the variation in house prices. The F-statistic 533 has a p-value < 2.2e-16 - so reject the null hypothesis (the model explains nothing) - the model is useful The p-values for the coefficients show all the non-factor variables are significant at the 0.05 level. The factors all have levels that are significant, but some levels of grade and zipcode are not. Check the residuals using the plot function In [30]: par(mfrow=c(2,2)) plot(fit1)  The model plots show: • Residual vs Fitted - shows the residuals reasonably evenly distributed around zero, but they funnel out as the fitted value increases. This model violates the assumption of homoscedasticity - the error terms change along the regression line • Normal Q-Q - the residuals deviate significantly from the dashed line, indicating the residuals are not normally distributed • Scale-Location - The chart shows the model violates the assumption of equal variance • Residuals vs Leverage - The chart shows there are some possibly influential outliers, however they appear to cancel each other out Use step to remove unimportant variables In [31]: step(fit1) Start: AIC=238963 price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + condition + grade + yr_built + zipcode Df Sum of Sq RSS AIC 2.35e+14 238963
– sqft_lot 1 5.66e+11 2.35e+14 238985
– bedrooms 1 5.92e+11 2.35e+14 238986
– bathrooms 1 1.14e+12 2.36e+14 239009
– yr_built 1 1.77e+12 2.36e+14 239036
– condition 4 2.52e+12 2.37e+14 239062
– sqft_living 1 3.78e+13 2.72e+14 240454
– waterfront 1 5.51e+13 2.90e+14 241072
– grade 10 6.40e+13 2.99e+14 241355
– zipcode 69 2.05e+14 4.40e+14 245116

Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
waterfront + condition + grade + yr_built + zipcode, data = houseData)

Coefficients:
(Intercept) bedrooms bathrooms sqft_living sqft_lot
1.43e+06 -1.14e+04 2.52e+04 1.47e+02 2.18e-01
waterfront1 condition2 condition3 condition4 condition5
8.85e+05 6.57e+04 7.83e+04 1.02e+05 1.36e+05
grade4 grade5 grade6 grade7 grade8
-8.21e+04 -1.22e+05 -1.24e+05 -1.14e+05 -7.40e+04
grade9 grade10 grade11 grade12 grade13
2.52e+04 1.72e+05 4.05e+05 8.20e+05 2.14e+06
yr_built zipcode98002 zipcode98003 zipcode98004 zipcode98005
-7.28e+02 7.90e+03 -1.16e+03 7.37e+05 2.78e+05
zipcode98006 zipcode98007 zipcode98008 zipcode98010 zipcode98011
2.66e+05 2.42e+05 2.64e+05 5.40e+04 1.41e+05
zipcode98014 zipcode98019 zipcode98022 zipcode98023 zipcode98024
9.20e+04 9.84e+04 1.96e+04 -3.11e+04 1.45e+05
zipcode98027 zipcode98028 zipcode98029 zipcode98030 zipcode98031
1.61e+05 1.19e+05 2.08e+05 1.27e+04 2.17e+04
zipcode98032 zipcode98033 zipcode98034 zipcode98038 zipcode98039
-7.44e+03 3.78e+05 2.08e+05 3.81e+04 1.29e+06
zipcode98040 zipcode98042 zipcode98045 zipcode98052 zipcode98053
5.23e+05 9.66e+03 1.08e+05 2.44e+05 2.13e+05
zipcode98055 zipcode98056 zipcode98058 zipcode98059 zipcode98065
3.99e+04 8.91e+04 2.05e+04 8.15e+04 1.14e+05
zipcode98070 zipcode98072 zipcode98074 zipcode98075 zipcode98077
-2.98e+04 1.58e+05 1.66e+05 1.65e+05 1.11e+05
zipcode98092 zipcode98102 zipcode98103 zipcode98105 zipcode98106
-1.03e+04 4.06e+05 3.19e+05 4.75e+05 1.03e+05
zipcode98107 zipcode98108 zipcode98109 zipcode98112 zipcode98115
3.31e+05 9.75e+04 4.39e+05 5.86e+05 3.31e+05
zipcode98116 zipcode98117 zipcode98118 zipcode98119 zipcode98122
2.92e+05 2.97e+05 1.47e+05 5.13e+05 3.39e+05
zipcode98125 zipcode98126 zipcode98133 zipcode98136 zipcode98144
1.98e+05 1.96e+05 1.42e+05 2.63e+05 2.84e+05
zipcode98146 zipcode98148 zipcode98155 zipcode98166 zipcode98168
1.04e+05 7.37e+04 1.43e+05 8.00e+04 3.63e+04
zipcode98177 zipcode98178 zipcode98188 zipcode98198 zipcode98199
2.60e+05 4.04e+04 2.88e+04 9.73e+03 3.99e+05

Running step in backward direction has not removed any variables, so all are significant

Account for the Heteroscedasticity¶
The second model corrects the heteroscedasticity seen in the first model by using a log transformation of the response variable. It also uses log transformations of the sqft_living and sqft_lot predictor variables as the distributions of these variables were skewed.
In [32]:
fit2 <- lm(log(price) ~ . + log(sqft_living) + log(sqft_lot), data=houseData) summary(fit2) Call: lm(formula = log(price) ~ . + log(sqft_living) + log(sqft_lot), data = houseData) Residuals: Min 1Q Median 3Q Max -1.0465 -0.1022 0.0025 0.1039 1.1231 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.03e+01 2.96e-01 34.88 < 2e-16 *** bedrooms -1.75e-02 2.86e-03 -6.14 8.7e-10 *** bathrooms 3.76e-02 4.51e-03 8.35 < 2e-16 *** sqft_living 5.55e-05 9.32e-06 5.95 2.7e-09 *** sqft_lot -2.90e-08 7.27e-08 -0.40 0.69020 waterfront1 6.56e-01 2.25e-02 29.18 < 2e-16 *** condition2 1.09e-01 4.84e-02 2.25 0.02464 * condition3 2.69e-01 4.44e-02 6.06 1.4e-09 *** condition4 2.99e-01 4.44e-02 6.74 1.7e-11 *** condition5 3.61e-01 4.48e-02 8.05 9.1e-16 *** grade4 -4.70e-01 1.43e-01 -3.28 0.00105 ** grade5 -5.05e-01 1.35e-01 -3.73 0.00019 *** grade6 -4.30e-01 1.35e-01 -3.20 0.00140 ** grade7 -3.35e-01 1.35e-01 -2.48 0.01301 * grade8 -2.14e-01 1.35e-01 -1.58 0.11317 grade9 -6.80e-02 1.35e-01 -0.50 0.61470 grade10 3.00e-02 1.35e-01 0.22 0.82452 grade11 1.61e-01 1.36e-01 1.19 0.23583 grade12 2.94e-01 1.39e-01 2.12 0.03370 * grade13 4.35e-01 1.68e-01 2.60 0.00946 ** yr_built -5.69e-04 1.10e-04 -5.18 2.3e-07 *** zipcode98002 -1.56e-03 2.48e-02 -0.06 0.94997 zipcode98003 3.20e-02 2.25e-02 1.42 0.15562 zipcode98004 1.15e+00 2.16e-02 52.96 < 2e-16 *** zipcode98005 7.15e-01 2.49e-02 28.67 < 2e-16 *** zipcode98006 6.86e-01 1.96e-02 34.93 < 2e-16 *** zipcode98007 6.75e-01 2.79e-02 24.18 < 2e-16 *** zipcode98008 6.83e-01 2.23e-02 30.65 < 2e-16 *** zipcode98010 2.14e-01 3.37e-02 6.35 2.2e-10 *** zipcode98011 4.71e-01 2.51e-02 18.77 < 2e-16 *** zipcode98014 3.18e-01 2.87e-02 11.11 < 2e-16 *** zipcode98019 3.40e-01 2.64e-02 12.85 < 2e-16 *** zipcode98022 8.68e-02 2.43e-02 3.56 0.00037 *** zipcode98023 -4.45e-03 1.90e-02 -0.23 0.81534 zipcode98024 4.31e-01 3.19e-02 13.50 < 2e-16 *** zipcode98027 5.29e-01 1.98e-02 26.69 < 2e-16 *** zipcode98028 4.23e-01 2.33e-02 18.17 < 2e-16 *** zipcode98029 6.28e-01 2.12e-02 29.65 < 2e-16 *** zipcode98030 9.51e-02 2.23e-02 4.27 2.0e-05 *** zipcode98031 9.53e-02 2.31e-02 4.12 3.8e-05 *** zipcode98032 -8.42e-03 2.90e-02 -0.29 0.77153 zipcode98033 8.20e-01 2.03e-02 40.42 < 2e-16 *** zipcode98034 5.80e-01 1.87e-02 31.00 < 2e-16 *** zipcode98038 2.12e-01 1.89e-02 11.20 < 2e-16 *** zipcode98039 1.28e+00 4.24e-02 30.18 < 2e-16 *** zipcode98040 9.22e-01 2.28e-02 40.52 < 2e-16 *** zipcode98042 9.70e-02 1.86e-02 5.21 1.9e-07 *** zipcode98045 3.45e-01 2.27e-02 15.17 < 2e-16 *** zipcode98052 6.61e-01 1.91e-02 34.67 < 2e-16 *** zipcode98053 6.02e-01 1.98e-02 30.41 < 2e-16 *** zipcode98055 1.79e-01 2.24e-02 8.02 1.2e-15 *** zipcode98056 3.45e-01 2.03e-02 17.05 < 2e-16 *** zipcode98058 1.66e-01 1.93e-02 8.58 < 2e-16 *** zipcode98059 3.70e-01 1.96e-02 18.89 < 2e-16 *** zipcode98065 4.71e-01 2.14e-02 22.01 < 2e-16 *** zipcode98070 2.88e-01 3.03e-02 9.50 < 2e-16 *** zipcode98072 4.77e-01 2.34e-02 20.44 < 2e-16 *** zipcode98074 5.63e-01 1.98e-02 28.43 < 2e-16 *** zipcode98075 5.89e-01 2.08e-02 28.27 < 2e-16 *** zipcode98077 4.27e-01 2.50e-02 17.11 < 2e-16 *** zipcode98092 5.04e-02 2.08e-02 2.43 0.01529 * zipcode98102 9.81e-01 3.24e-02 30.29 < 2e-16 *** zipcode98103 8.62e-01 1.97e-02 43.80 < 2e-16 *** zipcode98105 9.97e-01 2.36e-02 42.26 < 2e-16 *** zipcode98106 3.56e-01 2.18e-02 16.39 < 2e-16 *** zipcode98107 8.91e-01 2.27e-02 39.32 < 2e-16 *** zipcode98108 3.70e-01 2.50e-02 14.77 < 2e-16 *** zipcode98109 9.94e-01 3.21e-02 30.97 < 2e-16 *** zipcode98112 1.07e+00 2.37e-02 45.08 < 2e-16 *** zipcode98115 8.57e-01 1.94e-02 44.16 < 2e-16 *** zipcode98116 8.06e-01 2.20e-02 36.59 < 2e-16 *** zipcode98117 8.35e-01 1.95e-02 42.76 < 2e-16 *** zipcode98118 4.90e-01 1.97e-02 24.90 < 2e-16 *** zipcode98119 1.05e+00 2.73e-02 38.50 < 2e-16 *** zipcode98122 8.70e-01 2.32e-02 37.49 < 2e-16 *** zipcode98125 5.99e-01 1.99e-02 30.18 < 2e-16 *** zipcode98126 6.10e-01 2.12e-02 28.81 < 2e-16 *** zipcode98133 4.64e-01 1.93e-02 24.07 < 2e-16 *** zipcode98136 7.48e-01 2.36e-02 31.66 < 2e-16 *** zipcode98144 7.32e-01 2.20e-02 33.25 < 2e-16 *** zipcode98146 3.11e-01 2.21e-02 14.04 < 2e-16 *** zipcode98148 1.84e-01 3.93e-02 4.68 3.0e-06 *** zipcode98155 4.57e-01 1.99e-02 22.97 < 2e-16 *** zipcode98166 3.57e-01 2.33e-02 15.31 < 2e-16 *** zipcode98168 4.41e-02 2.26e-02 1.95 0.05084 . zipcode98177 6.61e-01 2.29e-02 28.92 < 2e-16 *** zipcode98178 1.64e-01 2.31e-02 7.12 1.2e-12 *** zipcode98188 1.16e-01 2.64e-02 4.41 1.0e-05 *** zipcode98198 9.07e-02 2.21e-02 4.11 4.0e-05 *** zipcode98199 9.16e-01 2.24e-02 40.95 < 2e-16 *** log(sqft_living) 3.40e-01 1.97e-02 17.28 < 2e-16 *** log(sqft_lot) 6.76e-02 4.11e-03 16.45 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.188 on 9908 degrees of freedom Multiple R-squared: 0.876, Adjusted R-squared: 0.874 F-statistic: 766 on 91 and 9908 DF, p-value: <2e-16 From the summary of the model: • The F-statistic of 766 is significant at the 0.001 level, so this model is also useful. • The $R^2$ value shows the model explains 87.4% of the variance, so this seems to indicate using logs is helping to explain more of the model. • The p-values of the coefficients show that sqft_lot is below the 0.05 significance level and so the null hypothesis (this variable does not help explain the variance) cannot be rejected. Run step to remove it and anything else that is not significant Run step to remove unnecessary variables In [33]: fit2 <- step(fit2) summary(fit2) Start: AIC=-33306 log(price) ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + condition + grade + yr_built + zipcode + log(sqft_living) + log(sqft_lot) Df Sum of Sq RSS AIC - sqft_lot 1 0 351 -33308 351 -33306
– yr_built 1 1 352 -33281
– sqft_living 1 1 352 -33272
– bedrooms 1 1 353 -33270
– bathrooms 1 2 354 -33238
– condition 4 9 361 -33047
– log(sqft_lot) 1 10 361 -33038
– log(sqft_living) 1 11 362 -33011
– waterfront 1 30 381 -32484
– grade 10 59 410 -31768
– zipcode 69 622 973 -23256

Step: AIC=-33308
log(price) ~ bedrooms + bathrooms + sqft_living + waterfront +
condition + grade + yr_built + zipcode + log(sqft_living) +
log(sqft_lot)

Df Sum of Sq RSS AIC
351 -33308
– yr_built 1 1 352 -33281
– sqft_living 1 1 352 -33274
– bedrooms 1 1 353 -33272
– bathrooms 1 2 354 -33240
– condition 4 9 361 -33049
– log(sqft_living) 1 11 362 -33010
– log(sqft_lot) 1 16 368 -32851
– waterfront 1 30 381 -32484
– grade 10 59 411 -31766
– zipcode 69 631 982 -23160

Call:
lm(formula = log(price) ~ bedrooms + bathrooms + sqft_living +
waterfront + condition + grade + yr_built + zipcode + log(sqft_living) +
log(sqft_lot), data = houseData)

Residuals:
Min 1Q Median 3Q Max
-1.0464 -0.1022 0.0026 0.1039 1.1229

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.03e+01 2.91e-01 35.56 < 2e-16 *** bedrooms -1.75e-02 2.85e-03 -6.12 9.4e-10 *** bathrooms 3.75e-02 4.50e-03 8.34 < 2e-16 *** sqft_living 5.53e-05 9.32e-06 5.94 2.9e-09 *** waterfront1 6.56e-01 2.25e-02 29.21 < 2e-16 *** condition2 1.09e-01 4.84e-02 2.25 0.02464 * condition3 2.69e-01 4.44e-02 6.06 1.4e-09 *** condition4 2.99e-01 4.44e-02 6.74 1.7e-11 *** condition5 3.61e-01 4.48e-02 8.05 9.1e-16 *** grade4 -4.71e-01 1.43e-01 -3.28 0.00103 ** grade5 -5.06e-01 1.35e-01 -3.74 0.00018 *** grade6 -4.31e-01 1.35e-01 -3.20 0.00136 ** grade7 -3.36e-01 1.35e-01 -2.49 0.01269 * grade8 -2.15e-01 1.35e-01 -1.59 0.11120 grade9 -6.90e-02 1.35e-01 -0.51 0.60947 grade10 2.91e-02 1.35e-01 0.22 0.82950 grade11 1.60e-01 1.36e-01 1.18 0.23822 grade12 2.93e-01 1.39e-01 2.12 0.03438 * grade13 4.35e-01 1.68e-01 2.60 0.00947 ** yr_built -5.77e-04 1.08e-04 -5.34 9.7e-08 *** zipcode98002 -1.83e-03 2.48e-02 -0.07 0.94113 zipcode98003 3.20e-02 2.25e-02 1.42 0.15577 zipcode98004 1.15e+00 2.16e-02 52.96 < 2e-16 *** zipcode98005 7.15e-01 2.49e-02 28.68 < 2e-16 *** zipcode98006 6.86e-01 1.96e-02 34.93 < 2e-16 *** zipcode98007 6.74e-01 2.79e-02 24.18 < 2e-16 *** zipcode98008 6.83e-01 2.23e-02 30.65 < 2e-16 *** zipcode98010 2.14e-01 3.37e-02 6.35 2.2e-10 *** zipcode98011 4.71e-01 2.51e-02 18.77 < 2e-16 *** zipcode98014 3.17e-01 2.86e-02 11.11 < 2e-16 *** zipcode98019 3.39e-01 2.64e-02 12.84 < 2e-16 *** zipcode98022 8.58e-02 2.42e-02 3.54 0.00040 *** zipcode98023 -4.45e-03 1.90e-02 -0.23 0.81518 zipcode98024 4.31e-01 3.19e-02 13.49 < 2e-16 *** zipcode98027 5.29e-01 1.98e-02 26.70 < 2e-16 *** zipcode98028 4.23e-01 2.33e-02 18.18 < 2e-16 *** zipcode98029 6.28e-01 2.12e-02 29.65 < 2e-16 *** zipcode98030 9.51e-02 2.23e-02 4.27 2.0e-05 *** zipcode98031 9.53e-02 2.31e-02 4.12 3.9e-05 *** zipcode98032 -8.50e-03 2.90e-02 -0.29 0.76943 zipcode98033 8.20e-01 2.03e-02 40.42 < 2e-16 *** zipcode98034 5.80e-01 1.87e-02 31.00 < 2e-16 *** zipcode98038 2.12e-01 1.89e-02 11.19 < 2e-16 *** zipcode98039 1.28e+00 4.24e-02 30.19 < 2e-16 *** zipcode98040 9.23e-01 2.28e-02 40.52 < 2e-16 *** zipcode98042 9.69e-02 1.86e-02 5.21 1.9e-07 *** zipcode98045 3.45e-01 2.27e-02 15.17 < 2e-16 *** zipcode98052 6.61e-01 1.91e-02 34.67 < 2e-16 *** zipcode98053 6.02e-01 1.98e-02 30.41 < 2e-16 *** zipcode98055 1.79e-01 2.24e-02 8.01 1.3e-15 *** zipcode98056 3.45e-01 2.03e-02 17.04 < 2e-16 *** zipcode98058 1.66e-01 1.93e-02 8.57 < 2e-16 *** zipcode98059 3.70e-01 1.96e-02 18.89 < 2e-16 *** zipcode98065 4.70e-01 2.14e-02 22.02 < 2e-16 *** zipcode98070 2.87e-01 3.02e-02 9.49 < 2e-16 *** zipcode98072 4.77e-01 2.33e-02 20.45 < 2e-16 *** zipcode98074 5.63e-01 1.98e-02 28.44 < 2e-16 *** zipcode98075 5.89e-01 2.08e-02 28.28 < 2e-16 *** zipcode98077 4.27e-01 2.50e-02 17.12 < 2e-16 *** zipcode98092 5.03e-02 2.08e-02 2.42 0.01551 * zipcode98102 9.80e-01 3.22e-02 30.41 < 2e-16 *** zipcode98103 8.61e-01 1.95e-02 44.21 < 2e-16 *** zipcode98105 9.96e-01 2.35e-02 42.45 < 2e-16 *** zipcode98106 3.56e-01 2.17e-02 16.40 < 2e-16 *** zipcode98107 8.90e-01 2.25e-02 39.60 < 2e-16 *** zipcode98108 3.69e-01 2.50e-02 14.78 < 2e-16 *** zipcode98109 9.92e-01 3.19e-02 31.08 < 2e-16 *** zipcode98112 1.07e+00 2.35e-02 45.30 < 2e-16 *** zipcode98115 8.56e-01 1.93e-02 44.33 < 2e-16 *** zipcode98116 8.06e-01 2.19e-02 36.72 < 2e-16 *** zipcode98117 8.34e-01 1.94e-02 42.99 < 2e-16 *** zipcode98118 4.90e-01 1.96e-02 24.96 < 2e-16 *** zipcode98119 1.05e+00 2.71e-02 38.68 < 2e-16 *** zipcode98122 8.69e-01 2.30e-02 37.78 < 2e-16 *** zipcode98125 5.99e-01 1.98e-02 30.20 < 2e-16 *** zipcode98126 6.10e-01 2.11e-02 28.87 < 2e-16 *** zipcode98133 4.63e-01 1.92e-02 24.09 < 2e-16 *** zipcode98136 7.47e-01 2.36e-02 31.73 < 2e-16 *** zipcode98144 7.31e-01 2.19e-02 33.44 < 2e-16 *** zipcode98146 3.10e-01 2.21e-02 14.03 < 2e-16 *** zipcode98148 1.84e-01 3.93e-02 4.67 3.0e-06 *** zipcode98155 4.57e-01 1.99e-02 22.96 < 2e-16 *** zipcode98166 3.57e-01 2.33e-02 15.31 < 2e-16 *** zipcode98168 4.40e-02 2.26e-02 1.95 0.05136 . zipcode98177 6.61e-01 2.29e-02 28.92 < 2e-16 *** zipcode98178 1.64e-01 2.31e-02 7.11 1.3e-12 *** zipcode98188 1.16e-01 2.64e-02 4.41 1.1e-05 *** zipcode98198 9.06e-02 2.21e-02 4.11 4.1e-05 *** zipcode98199 9.16e-01 2.23e-02 41.07 < 2e-16 *** log(sqft_living) 3.40e-01 1.96e-02 17.37 < 2e-16 *** log(sqft_lot) 6.65e-02 3.08e-03 21.57 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.188 on 9909 degrees of freedom Multiple R-squared: 0.876, Adjusted R-squared: 0.874 F-statistic: 775 on 90 and 9909 DF, p-value: <2e-16 Step has removed sqft_lot, but nothing else The $R^2$ value is still 87.4%. The $R^2$ value is calculated on the predicted log of the price, so calculate the $R^2$ for the prices using the Model.Accuracy function defined above In [34]: houseData.predict <- exp(fit2$fitted.values) cat("R-Squared:",Model.Accuracy(houseData.predict,houseData$price,9889,110)$rsquared) R-Squared: 0.8686 That has lowered $R^2$ slightly to 86.8%, but that is still an improvement of over 3% on the first model. Check the residuals using the plot function In [35]: par(mfrow=c(2,2)) plot(fit2)  That has improved the residual plots. The residuals are more evenly distributed and the scale location shows the variance is nearly equal. The Normal Q-Q plot shows that the residuals are not quite normally distrbuted, though. Estimating the log of price rather than price directly and using the log of sqft_living and sqlt_lot has helped meet the linear regression assumptions and improved the $R^2$ value by over 3%. Correlation and Interaction¶ What about correlation between the variables? Add some interaction terms for the correlated variables - sqft_living, bathrooms and grade, also the negative correlation between yr_built and condition and the correlation between waterfront and zipcode In [36]: fit3 <- lm(log(price) ~ bedrooms + bathrooms + sqft_living + waterfront + condition + grade + yr_built + zipcode + log(sqft_living) + log(sqft_lot) + sqft_living:grade + sqft_living:bathrooms + grade:bathrooms + yr_built:condition + waterfront:zipcode, data = houseData) summary(fit3) Call: lm(formula = log(price) ~ bedrooms + bathrooms + sqft_living + waterfront + condition + grade + yr_built + zipcode + log(sqft_living) + log(sqft_lot) + sqft_living:grade + sqft_living:bathrooms + grade:bathrooms + yr_built:condition + waterfront:zipcode, data = houseData) Residuals: Min 1Q Median 3Q Max -1.0209 -0.0976 0.0024 0.1009 1.1332 Coefficients: (48 not defined because of singularities) Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.16e+00 5.28e+00 -0.98 0.32902
bedrooms -1.64e-02 2.83e-03 -5.78 7.8e-09 ***
bathrooms 1.02e-01 1.55e-01 0.66 0.51230
sqft_living 8.15e-04 3.61e-03 0.23 0.82161
waterfront1 5.10e-01 8.47e-02 6.03 1.7e-09 ***
condition2 1.19e+01 5.14e+00 2.31 0.02104 *
condition3 1.29e+01 4.87e+00 2.64 0.00838 **
condition4 1.59e+01 4.88e+00 3.26 0.00110 **
condition5 1.58e+01 4.90e+00 3.21 0.00131 **
grade4 2.93e-01 2.06e+00 0.14 0.88673
grade5 1.38e-01 2.05e+00 0.07 0.94624
grade6 1.99e-01 2.05e+00 0.10 0.92281
grade7 1.65e-01 2.05e+00 0.08 0.93579
grade8 1.90e-01 2.05e+00 0.09 0.92597
grade9 3.44e-01 2.05e+00 0.17 0.86647
grade10 3.69e-01 2.05e+00 0.18 0.85689
grade11 5.73e-01 2.05e+00 0.28 0.77943
grade12 1.27e+00 2.04e+00 0.62 0.53434
grade13 1.60e+00 2.15e+00 0.74 0.45700
yr_built 6.46e-03 2.52e-03 2.56 0.01043 *
zipcode98002 -1.16e-03 2.44e-02 -0.05 0.96218
zipcode98003 3.84e-02 2.22e-02 1.73 0.08348 .
zipcode98004 1.15e+00 2.14e-02 53.97 < 2e-16 *** zipcode98005 7.27e-01 2.46e-02 29.54 < 2e-16 *** zipcode98006 6.96e-01 1.95e-02 35.80 < 2e-16 *** zipcode98007 6.94e-01 2.75e-02 25.21 < 2e-16 *** zipcode98008 6.91e-01 2.21e-02 31.30 < 2e-16 *** zipcode98010 2.19e-01 3.32e-02 6.60 4.2e-11 *** zipcode98011 4.76e-01 2.47e-02 19.23 < 2e-16 *** zipcode98014 3.16e-01 2.82e-02 11.20 < 2e-16 *** zipcode98019 3.35e-01 2.60e-02 12.88 < 2e-16 *** zipcode98022 8.97e-02 2.39e-02 3.75 0.00018 *** zipcode98023 2.65e-03 1.88e-02 0.14 0.88815 zipcode98024 4.37e-01 3.15e-02 13.87 < 2e-16 *** zipcode98027 5.35e-01 1.96e-02 27.32 < 2e-16 *** zipcode98028 4.30e-01 2.30e-02 18.71 < 2e-16 *** zipcode98029 6.41e-01 2.09e-02 30.69 < 2e-16 *** zipcode98030 9.90e-02 2.20e-02 4.50 6.7e-06 *** zipcode98031 1.04e-01 2.28e-02 4.58 4.8e-06 *** zipcode98032 -4.17e-04 2.86e-02 -0.01 0.98837 zipcode98033 8.33e-01 2.01e-02 41.52 < 2e-16 *** zipcode98034 5.87e-01 1.85e-02 31.74 < 2e-16 *** zipcode98038 2.10e-01 1.87e-02 11.23 < 2e-16 *** zipcode98039 1.32e+00 4.36e-02 30.40 < 2e-16 *** zipcode98040 9.39e-01 2.29e-02 41.08 < 2e-16 *** zipcode98042 1.04e-01 1.84e-02 5.69 1.3e-08 *** zipcode98045 3.49e-01 2.24e-02 15.55 < 2e-16 *** zipcode98052 6.73e-01 1.88e-02 35.72 < 2e-16 *** zipcode98053 6.05e-01 1.95e-02 30.99 < 2e-16 *** zipcode98055 1.86e-01 2.20e-02 8.46 < 2e-16 *** zipcode98056 3.51e-01 2.01e-02 17.43 < 2e-16 *** zipcode98058 1.74e-01 1.91e-02 9.14 < 2e-16 *** zipcode98059 3.71e-01 1.93e-02 19.19 < 2e-16 *** zipcode98065 4.68e-01 2.11e-02 22.20 < 2e-16 *** zipcode98070 3.75e-01 3.23e-02 11.59 < 2e-16 *** zipcode98072 4.89e-01 2.30e-02 21.24 < 2e-16 *** zipcode98074 5.63e-01 1.96e-02 28.68 < 2e-16 *** zipcode98075 5.83e-01 2.07e-02 28.17 < 2e-16 *** zipcode98077 4.38e-01 2.47e-02 17.78 < 2e-16 *** zipcode98092 5.62e-02 2.05e-02 2.75 0.00604 ** zipcode98102 1.01e+00 3.20e-02 31.57 < 2e-16 *** zipcode98103 8.75e-01 1.93e-02 45.39 < 2e-16 *** zipcode98105 1.01e+00 2.33e-02 43.29 < 2e-16 *** zipcode98106 3.67e-01 2.14e-02 17.14 < 2e-16 *** zipcode98107 9.05e-01 2.22e-02 40.73 < 2e-16 *** zipcode98108 3.77e-01 2.46e-02 15.31 < 2e-16 *** zipcode98109 1.01e+00 3.15e-02 31.98 < 2e-16 *** zipcode98112 1.08e+00 2.33e-02 46.42 < 2e-16 *** zipcode98115 8.62e-01 1.91e-02 45.27 < 2e-16 *** zipcode98116 8.17e-01 2.16e-02 37.77 < 2e-16 *** zipcode98117 8.41e-01 1.91e-02 43.94 < 2e-16 *** zipcode98118 4.99e-01 1.94e-02 25.70 < 2e-16 *** zipcode98119 1.07e+00 2.68e-02 39.99 < 2e-16 *** zipcode98122 8.91e-01 2.28e-02 39.16 < 2e-16 *** zipcode98125 6.07e-01 1.96e-02 30.93 < 2e-16 *** zipcode98126 6.18e-01 2.08e-02 29.66 < 2e-16 *** zipcode98133 4.72e-01 1.90e-02 24.86 < 2e-16 *** zipcode98136 7.54e-01 2.34e-02 32.20 < 2e-16 *** zipcode98144 7.37e-01 2.17e-02 34.05 < 2e-16 *** zipcode98146 3.20e-01 2.19e-02 14.61 < 2e-16 *** zipcode98148 1.88e-01 3.87e-02 4.86 1.2e-06 *** zipcode98155 4.59e-01 1.97e-02 23.31 < 2e-16 *** zipcode98166 3.77e-01 2.34e-02 16.13 < 2e-16 *** zipcode98168 5.46e-02 2.23e-02 2.45 0.01442 * zipcode98177 6.68e-01 2.26e-02 29.62 < 2e-16 *** zipcode98178 1.68e-01 2.29e-02 7.35 2.2e-13 *** zipcode98188 1.20e-01 2.60e-02 4.63 3.7e-06 *** zipcode98198 1.14e-01 2.20e-02 5.17 2.4e-07 *** zipcode98199 9.28e-01 2.20e-02 42.20 < 2e-16 *** log(sqft_living) 5.63e-01 4.25e-02 13.25 < 2e-16 *** log(sqft_lot) 6.85e-02 3.10e-03 22.05 < 2e-16 *** sqft_living:grade4 -1.53e-03 3.61e-03 -0.42 0.67172 sqft_living:grade5 -1.13e-03 3.61e-03 -0.31 0.75525 sqft_living:grade6 -1.05e-03 3.61e-03 -0.29 0.77201 sqft_living:grade7 -9.36e-04 3.61e-03 -0.26 0.79546 sqft_living:grade8 -8.49e-04 3.61e-03 -0.24 0.81407 sqft_living:grade9 -8.62e-04 3.61e-03 -0.24 0.81137 sqft_living:grade10 -8.66e-04 3.61e-03 -0.24 0.81034 sqft_living:grade11 -8.64e-04 3.61e-03 -0.24 0.81087 sqft_living:grade12 -9.17e-04 3.61e-03 -0.25 0.79942 sqft_living:grade13 -1.04e-03 3.66e-03 -0.29 0.77542 bathrooms:sqft_living 1.26e-05 5.11e-06 2.47 0.01350 * bathrooms:grade4 1.17e-01 2.49e-01 0.47 0.63689 bathrooms:grade5 -2.94e-02 1.68e-01 -0.18 0.86062 bathrooms:grade6 -7.08e-02 1.54e-01 -0.46 0.64576 bathrooms:grade7 -7.86e-02 1.52e-01 -0.52 0.60606 bathrooms:grade8 -1.16e-01 1.52e-01 -0.77 0.44336 bathrooms:grade9 -1.14e-01 1.51e-01 -0.75 0.45119 bathrooms:grade10 -8.55e-02 1.50e-01 -0.57 0.56910 bathrooms:grade11 -1.12e-01 1.50e-01 -0.75 0.45387 bathrooms:grade12 -1.83e-01 1.53e-01 -1.20 0.23124 bathrooms:grade13 NA NA NA NA condition2:yr_built -6.09e-03 2.66e-03 -2.29 0.02193 * condition3:yr_built -6.52e-03 2.52e-03 -2.58 0.00978 ** condition4:yr_built -8.07e-03 2.52e-03 -3.20 0.00140 ** condition5:yr_built -7.96e-03 2.54e-03 -3.14 0.00171 ** waterfront1:zipcode98002 NA NA NA NA waterfront1:zipcode98003 NA NA NA NA waterfront1:zipcode98004 NA NA NA NA waterfront1:zipcode98005 NA NA NA NA waterfront1:zipcode98006 1.53e-01 1.57e-01 0.98 0.32861 waterfront1:zipcode98007 NA NA NA NA waterfront1:zipcode98008 3.92e-01 1.62e-01 2.42 0.01562 * waterfront1:zipcode98010 NA NA NA NA waterfront1:zipcode98011 NA NA NA NA waterfront1:zipcode98014 NA NA NA NA waterfront1:zipcode98019 NA NA NA NA waterfront1:zipcode98022 NA NA NA NA waterfront1:zipcode98023 3.92e-01 2.04e-01 1.92 0.05482 . waterfront1:zipcode98024 NA NA NA NA waterfront1:zipcode98027 5.61e-02 1.59e-01 0.35 0.72441 waterfront1:zipcode98028 NA NA NA NA waterfront1:zipcode98029 NA NA NA NA waterfront1:zipcode98030 NA NA NA NA waterfront1:zipcode98031 NA NA NA NA waterfront1:zipcode98032 NA NA NA NA waterfront1:zipcode98033 2.68e-01 1.60e-01 1.68 0.09343 . waterfront1:zipcode98034 4.65e-01 1.57e-01 2.96 0.00305 ** waterfront1:zipcode98038 NA NA NA NA waterfront1:zipcode98039 -1.91e-01 2.08e-01 -0.92 0.35956 waterfront1:zipcode98040 1.86e-01 1.12e-01 1.66 0.09635 . waterfront1:zipcode98042 NA NA NA NA waterfront1:zipcode98045 NA NA NA NA waterfront1:zipcode98052 1.63e-01 2.06e-01 0.79 0.42780 waterfront1:zipcode98053 NA NA NA NA waterfront1:zipcode98055 NA NA NA NA waterfront1:zipcode98056 5.63e-01 1.57e-01 3.57 0.00035 *** waterfront1:zipcode98058 NA NA NA NA waterfront1:zipcode98059 NA NA NA NA waterfront1:zipcode98065 NA NA NA NA waterfront1:zipcode98070 -2.27e-01 1.04e-01 -2.18 0.02925 * waterfront1:zipcode98072 NA NA NA NA waterfront1:zipcode98074 4.68e-01 1.19e-01 3.92 9.0e-05 *** waterfront1:zipcode98075 4.01e-01 1.15e-01 3.50 0.00047 *** waterfront1:zipcode98077 NA NA NA NA waterfront1:zipcode98092 NA NA NA NA waterfront1:zipcode98102 NA NA NA NA waterfront1:zipcode98103 NA NA NA NA waterfront1:zipcode98105 -2.52e-02 1.39e-01 -0.18 0.85587 waterfront1:zipcode98106 NA NA NA NA waterfront1:zipcode98107 NA NA NA NA waterfront1:zipcode98108 NA NA NA NA waterfront1:zipcode98109 NA NA NA NA waterfront1:zipcode98112 NA NA NA NA waterfront1:zipcode98115 NA NA NA NA waterfront1:zipcode98116 NA NA NA NA waterfront1:zipcode98117 NA NA NA NA waterfront1:zipcode98118 5.94e-02 1.38e-01 0.43 0.66586 waterfront1:zipcode98119 NA NA NA NA waterfront1:zipcode98122 NA NA NA NA waterfront1:zipcode98125 3.90e-01 1.37e-01 2.84 0.00446 ** waterfront1:zipcode98126 NA NA NA NA waterfront1:zipcode98133 NA NA NA NA waterfront1:zipcode98136 1.47e-01 1.39e-01 1.06 0.28909 waterfront1:zipcode98144 4.39e-01 1.57e-01 2.80 0.00519 ** waterfront1:zipcode98146 -2.68e-02 1.57e-01 -0.17 0.86429 waterfront1:zipcode98148 NA NA NA NA waterfront1:zipcode98155 3.21e-01 1.37e-01 2.34 0.01932 * waterfront1:zipcode98166 -1.43e-01 1.16e-01 -1.24 0.21589 waterfront1:zipcode98168 NA NA NA NA waterfront1:zipcode98177 NA NA NA NA waterfront1:zipcode98178 4.44e-01 1.57e-01 2.82 0.00478 ** waterfront1:zipcode98188 NA NA NA NA waterfront1:zipcode98198 NA NA NA NA waterfront1:zipcode98199 NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.185 on 9863 degrees of freedom Multiple R-squared: 0.88, Adjusted R-squared: 0.878 F-statistic: 532 on 136 and 9863 DF, p-value: <2e-16 Several of these variables and interactions look like they are not significant: • bathrooms • sqft_living • grade • sqft_living:grade • bathrooms:grade Bathrooms:grade will be removed in the next iteration of the model, sqft_living:grade becomes more significant in the next iteration and so is retained. As sqft_living:bathrooms is significant and sqft_living:grade ends up being significant, sqft_living, grade and bathrooms need to remain in the model. There are also lots of NAs due to zipcodes that don't have any waterfront properties, and warnings about singularities - factors or interactions between factors that only have one record. Check the $R^2$ value using the Model.Accuracy function and the residual plots In [37]: houseData.predict <- exp(fit3$fitted.values) cat("R-Squared:",Model.Accuracy(houseData.predict,houseData$price,9863,136)$rsquared) par(mfrow=c(2,2)) plot(fit3) R-Squared: 0.885 Warning message: “not plotting observations with leverage one: 513, 639, 8099”Warning message: “not plotting observations with leverage one: 513, 639, 8099”Warning message in sqrt(crit * p * (1 - hh)/hh): “NaNs produced”Warning message in sqrt(crit * p * (1 - hh)/hh): “NaNs produced”  The $R^2$ value has improved by another 2% and the residual plots are still looking good, but produced a couple of warning messages. Look at the records referenced in the warnings In [38]: predict.price <- houseData.predict[c(513, 639, 8099)] cbind(predict.price,houseData[c(513, 639, 8099),]) predict.price price bedrooms bathrooms sqft_living sqft_lot waterfront condition grade yr_built zipcode 513 629000 629000 3 1.75 1460 12367 1 4 8 1970 98023 639 2200000 2200000 5 4.25 4640 22703 1 5 8 1952 98052 8099 280000 280000 1 0.00 600 24501 0 2 3 1950 98045 These are records for which the model has made a completely accurate prediction. Two of these reference waterfront properties and the other one looks a bit strange with no bathrooms so it may be an outlier. Generate some new Variables¶ Generating a couple of new variables will allow the linear regression to better model some of the interactions. These variables will be added: A Variable to handle the Waterfront:Zipcode Interaction¶ This variable will allow the model to include : • If the property is not a waterfront property, set wf.zipcode to the zipcode • If the property is a waterfront property and the zipcode has more than two waterfront properties, then set wf.zipcode to the zipcode appended with the characters "-1" • The remaining waterfront properties will be grouped according to whether the zipcode is a low, mid or high priced zipcode. • During model evaluation, some of the waterfront properties in the individual zipcodes were found to be influencers. The zipcodes containing these properties have been added to the groups to lessen the influence of these observations. A Variable to Group Year Built by Decade¶ This variable groups the year_built by decade and is factorised to allow the regression to model changes over time that cannot be captured by a mathematical expression. A variable to Classify Houses by Bedrooms¶ This variable to classify houses into two groups by bedrooms: 1. Houses with four or fewer bedrooms 2. Houses with more than four bedrooms Analysis of the waterfront properties by zipcode In [39]: wf1 <- aggregate(houseData$zipcode[houseData$waterfront==1], list(houseData$zipcode[houseData$waterfront==1]), NROW) names(wf1) <- c("zipcode","wf.count") wf2 <- aggregate(houseData$price[houseData$waterfront==1], list(houseData$zipcode[houseData$waterfront==1]), mean) names(wf2) <- c("zipcode","price.mean") #wf3 <- aggregate(houseData$price[houseData$waterfront==1], list(houseData$zipcode[houseData$waterfront==1]), sd) #names(wf3) <- c("zipcode","price.sd") wf3 <- cbind(wf1,wf2["price.mean"]) #wf4 <- cbind(wf4,wf3["price.sd"]) wf4 <- wf3[wf3$wf.count > 2 & !(wf3$zipcode %in% c(98125,98074,98155,98118)),]
cat(“Individual waterfront zipcodes:\n”)
print(wf4)
wf5 <- wf3[!(wf3$zipcode %in% wf4$zipcode),] cat("\nLow-price waterfront zipcodes:\n") print(wf5[wf5$price.mean <= 1000000,]) cat("\nMid-price waterfront zipcodes:\n") print(wf5[wf5$price.mean > 1000000 & wf5$price.mean <= 2400000,]) cat("\nHigh-price waterfront zipcodes:\n") print(wf5[wf5$price.mean > 2400000,])

Individual waterfront zipcodes:
zipcode wf.count price.mean
8 98040 7 3072143
11 98070 12 639633
13 98075 6 1854833
14 98105 3 3051667
17 98136 3 1165000
21 98166 6 1021417
23 98198 5 736400

Low-price waterfront zipcodes:
zipcode wf.count price.mean
3 98023 1 629000
19 98146 2 590750

Mid-price waterfront zipcodes:
zipcode wf.count price.mean
1 98006 2 1825500
4 98027 2 2400000
9 98052 1 2200000
12 98074 5 1996600
15 98118 3 1750167
16 98125 3 1281667
20 98155 3 1608333
22 98178 2 1400000

High-price waterfront zipcodes:
zipcode wf.count price.mean
2 98008 2 3422500
5 98033 2 4252900
6 98034 2 2477500
7 98039 1 3640900
10 98056 2 2615000
18 98144 2 2750000

Function to Generate New Variables¶
Name: GenerateVariables
Input parameters:
• data – a dataframe containing the house price data
Return value:
• The modified dataframe with the extra variables added
Description
Adds the following variables to the dataframe:
• wf.zipcode – combines the zipcode and waterfront variables into a single variable
• decade – a factor representing the decade in which the house was built
• bedroom.class – a factor indicating whether the house has no more than four bedrooms or more than four bedrooms
In [40]:
GenerateVariables <- function(data) { # Make a copy of the dataframe newdata <- data # Generate the wf.zipcode variable ## Zipcodes treated individually wfzipcodes <- c(98040,98070,98075,98105,98136,98166,98198) ## Define the zipcode groups wfziplow <- c(98023,98146) wfzipmid <- c(98006,98027,98052,98074,98118,98125,98155,98178) wfziphigh <- c(98008,98033,98034,98039,98056,98144) ## Create the new variable newdata$wf.zipcode <- as.character(newdata$zipcode) wf.rows <- row.names(newdata[newdata$waterfront == 1 & newdata$zipcode %in% wfzipcodes,]) newdata$wf.zipcode[as.numeric(wf.rows)] <- paste(newdata$zipcode[as.numeric(wf.rows)],"1",sep="-") newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfziplow] <- "wf-low" newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfzipmid] <- "wf-mid" newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfziphigh] <- "wf-high" newdata$wf.zipcode <- as.factor(newdata$wf.zipcode) # Generate the decade variable newdata$decade <- as.factor(ifelse(newdata$yr_built < 1900, 0, ifelse(newdata$yr_built > 2019, 11, trunc(newdata$yr_built/10)-190)))

# Generate the bedroom.class variable
newdata$bedroom.class <- as.factor(ifelse(newdata$bedrooms < 5, "4minus", "5plus")) # Return the modified dataframe return(newdata) } Generate the New Variables¶ In [41]: houseData <- GenerateVariables(houseData) str(houseData) 'data.frame': 10000 obs. of 13 variables: $ price : int 211000 265000 1440000 800000 1059500 750000 229000 271115 428000 1240000 ... $ bedrooms : int 4 3 3 4 5 2 3 2 3 4 ... $ bathrooms : num 1 2.5 3.5 3.5 3.25 1 1.5 1.5 2.25 3.5 ... $ sqft_living : int 2100 1530 3870 2370 3230 1620 1200 830 2600 3820 ... $ sqft_lot : int 9200 6000 3819 3302 3825 6120 5000 1325 15000 13224 ... $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ condition : Factor w/ 5 levels "1","2","3","4",..: 3 4 3 3 3 3 3 3 3 3 ... $ grade : Factor w/ 11 levels "3","4","5","6",..: 5 5 9 6 7 5 4 5 7 8 ... $ yr_built : int 1959 1991 2002 1926 2014 1951 1979 2005 1978 1990 ... $ zipcode : Factor w/ 70 levels "98001","98002",..: 65 24 42 43 52 50 61 59 24 4 ... $ wf.zipcode : Factor w/ 80 levels "98001","98002",..: 71 24 45 46 56 54 66 63 24 4 ... $ decade : Factor w/ 12 levels "0","1","2","3",..: 6 10 11 3 12 6 8 11 8 10 ... $ bedroom.class: Factor w/ 2 levels "4minus","5plus": 1 1 1 1 2 1 1 1 1 1 ... The Final Model¶ This model adds in the new variables, removes the bathrooms:grade interaction and adds two other interactions. The resulting set of terms in the model capture the following observations from the exploratory data analysis: 1. The influence the number of bathrooms, condition, grade and age of the house have on the sale price. 2. The influence of the size of the house and lot on the sale price using log transformations of the sqft_living and sqft_lot variables. 3. The influence that the zipcode and whether or not the property overlooks the waterfront has on house price using the generated variable wf.zipcode. This variable also captures the correlation between zipcode and waterfront. 4. The non-linear influence of the age of the property on the price using the generated variable decade. 5. The difference between having more or less than four bedrooms using the generated variable bedroom.class. 6. The correlation between the size and grade of the house using the sqft_living:grade interaction. 7. The correlation between the number of bathrooms and the size of the house using the bathrooms:log(sqft_living) interaction. 8. the correlation between the condition and age of the house using the condition:yr_built interaction. 9. The correlation between the size and age of the house using the sqft_living:decade interaction. 10. The correlation between the number of bathrooms, zipcode and waterfront using the bathrooms:wf.zipcode interaction. In [42]: house.model <- lm(log(price) ~ bathrooms + condition + grade + yr_built + log(sqft_living) + log(sqft_lot) + wf.zipcode + decade + bedroom.class + sqft_living:grade + bathrooms:log(sqft_living) + condition:yr_built + sqft_living:decade + bathrooms:wf.zipcode, data=houseData) summary(house.model) Call: lm(formula = log(price) ~ bathrooms + condition + grade + yr_built + log(sqft_living) + log(sqft_lot) + wf.zipcode + decade + bedroom.class + sqft_living:grade + bathrooms:log(sqft_living) + condition:yr_built + sqft_living:decade + bathrooms:wf.zipcode, data = houseData) Residuals: Min 1Q Median 3Q Max -0.9481 -0.0946 0.0023 0.0986 1.0992 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.84e+00 5.31e+00 -1.10 0.27185
bathrooms -1.79e-01 8.38e-02 -2.14 0.03266 *
condition2 1.36e+01 5.08e+00 2.69 0.00725 **
condition3 1.57e+01 4.82e+00 3.25 0.00116 **
condition4 1.66e+01 4.82e+00 3.43 0.00060 ***
condition5 1.63e+01 4.85e+00 3.37 0.00075 ***
grade4 -1.03e+00 1.87e+00 -0.55 0.58004
grade5 -1.26e+00 1.86e+00 -0.68 0.49743
grade6 -1.22e+00 1.86e+00 -0.66 0.51077
grade7 -1.27e+00 1.86e+00 -0.68 0.49567
grade8 -1.27e+00 1.86e+00 -0.68 0.49479
grade9 -1.12e+00 1.86e+00 -0.60 0.54606
grade10 -1.06e+00 1.86e+00 -0.57 0.56974
grade11 -9.20e-01 1.86e+00 -0.49 0.62181
grade12 -6.49e-01 1.87e+00 -0.35 0.72821
grade13 -5.76e-01 1.91e+00 -0.30 0.76298
yr_built 7.72e-03 2.58e-03 3.00 0.00274 **
log(sqft_living) 4.69e-01 3.54e-02 13.26 < 2e-16 *** log(sqft_lot) 8.04e-02 3.29e-03 24.44 < 2e-16 *** wf.zipcode98002 3.00e-02 7.24e-02 0.41 0.67860 wf.zipcode98003 1.86e-01 7.30e-02 2.55 0.01087 * wf.zipcode98004 1.47e+00 6.66e-02 22.07 < 2e-16 *** wf.zipcode98005 1.14e+00 8.63e-02 13.20 < 2e-16 *** wf.zipcode98006 8.16e-01 6.52e-02 12.52 < 2e-16 *** wf.zipcode98007 9.95e-01 1.02e-01 9.76 < 2e-16 *** wf.zipcode98008 9.11e-01 7.72e-02 11.80 < 2e-16 *** wf.zipcode98010 4.09e-01 1.14e-01 3.58 0.00035 *** wf.zipcode98011 6.98e-01 1.09e-01 6.40 1.6e-10 *** wf.zipcode98014 5.75e-01 7.91e-02 7.27 3.9e-13 *** wf.zipcode98019 4.97e-01 9.17e-02 5.42 6.2e-08 *** wf.zipcode98022 1.67e-01 7.76e-02 2.15 0.03129 * wf.zipcode98023 1.93e-01 6.28e-02 3.08 0.00209 ** wf.zipcode98024 5.80e-01 8.23e-02 7.05 2.0e-12 *** wf.zipcode98027 7.03e-01 6.75e-02 10.42 < 2e-16 *** wf.zipcode98028 5.96e-01 8.85e-02 6.74 1.7e-11 *** wf.zipcode98029 9.00e-01 8.87e-02 10.14 < 2e-16 *** wf.zipcode98030 2.21e-01 8.03e-02 2.76 0.00585 ** wf.zipcode98031 3.27e-01 7.80e-02 4.20 2.7e-05 *** wf.zipcode98032 1.78e-01 9.15e-02 1.94 0.05228 . wf.zipcode98033 1.01e+00 6.41e-02 15.68 < 2e-16 *** wf.zipcode98034 7.50e-01 6.21e-02 12.07 < 2e-16 *** wf.zipcode98038 3.14e-01 7.96e-02 3.95 8.0e-05 *** wf.zipcode98039 1.63e+00 1.24e-01 13.15 < 2e-16 *** wf.zipcode98040 1.23e+00 7.74e-02 15.93 < 2e-16 *** wf.zipcode98040-1 1.90e+00 1.85e-01 10.25 < 2e-16 *** wf.zipcode98042 2.32e-01 6.51e-02 3.56 0.00037 *** wf.zipcode98045 6.27e-01 7.02e-02 8.93 < 2e-16 *** wf.zipcode98052 9.18e-01 6.92e-02 13.26 < 2e-16 *** wf.zipcode98053 8.54e-01 7.13e-02 11.98 < 2e-16 *** wf.zipcode98055 3.37e-01 6.48e-02 5.20 2.0e-07 *** wf.zipcode98056 4.31e-01 6.30e-02 6.84 8.7e-12 *** wf.zipcode98058 3.12e-01 6.33e-02 4.94 8.1e-07 *** wf.zipcode98059 4.27e-01 6.49e-02 6.58 4.9e-11 *** wf.zipcode98065 6.25e-01 7.19e-02 8.70 < 2e-16 *** wf.zipcode98070 5.69e-01 8.55e-02 6.66 2.9e-11 *** wf.zipcode98070-1 8.37e-01 1.48e-01 5.65 1.7e-08 *** wf.zipcode98072 7.08e-01 8.28e-02 8.54 < 2e-16 *** wf.zipcode98074 8.37e-01 7.44e-02 11.25 < 2e-16 *** wf.zipcode98075 9.47e-01 7.94e-02 11.94 < 2e-16 *** wf.zipcode98075-1 1.72e+00 2.34e-01 7.35 2.1e-13 *** wf.zipcode98077 6.94e-01 8.52e-02 8.14 4.6e-16 *** wf.zipcode98092 2.39e-01 7.43e-02 3.21 0.00132 ** wf.zipcode98102 1.22e+00 1.08e-01 11.34 < 2e-16 *** wf.zipcode98103 1.05e+00 5.82e-02 18.09 < 2e-16 *** wf.zipcode98105 1.08e+00 6.99e-02 15.41 < 2e-16 *** wf.zipcode98105-1 1.73e+00 4.29e-01 4.05 5.2e-05 *** wf.zipcode98106 5.57e-01 6.30e-02 8.84 < 2e-16 *** wf.zipcode98107 1.09e+00 6.59e-02 16.48 < 2e-16 *** wf.zipcode98108 5.66e-01 7.29e-02 7.75 9.9e-15 *** wf.zipcode98109 1.40e+00 8.77e-02 15.93 < 2e-16 *** wf.zipcode98112 1.17e+00 7.20e-02 16.24 < 2e-16 *** wf.zipcode98115 1.01e+00 5.83e-02 17.38 < 2e-16 *** wf.zipcode98116 1.05e+00 6.59e-02 15.96 < 2e-16 *** wf.zipcode98117 1.07e+00 5.86e-02 18.32 < 2e-16 *** wf.zipcode98118 6.43e-01 5.80e-02 11.10 < 2e-16 *** wf.zipcode98119 1.26e+00 7.73e-02 16.29 < 2e-16 *** wf.zipcode98122 1.01e+00 6.79e-02 14.90 < 2e-16 *** wf.zipcode98125 7.91e-01 6.00e-02 13.19 < 2e-16 *** wf.zipcode98126 7.39e-01 6.07e-02 12.18 < 2e-16 *** wf.zipcode98133 6.93e-01 5.91e-02 11.73 < 2e-16 *** wf.zipcode98136 9.55e-01 6.79e-02 14.07 < 2e-16 *** wf.zipcode98136-1 2.14e+00 3.15e-01 6.81 1.0e-11 *** wf.zipcode98144 8.02e-01 6.63e-02 12.10 < 2e-16 *** wf.zipcode98146 3.03e-01 6.37e-02 4.76 2.0e-06 *** wf.zipcode98148 3.49e-01 1.08e-01 3.24 0.00120 ** wf.zipcode98155 6.29e-01 6.16e-02 10.22 < 2e-16 *** wf.zipcode98166 3.71e-01 7.01e-02 5.30 1.2e-07 *** wf.zipcode98166-1 1.36e+00 2.60e-01 5.23 1.7e-07 *** wf.zipcode98168 1.24e-01 6.71e-02 1.85 0.06447 . wf.zipcode98177 8.09e-01 6.81e-02 11.88 < 2e-16 *** wf.zipcode98178 2.73e-01 6.75e-02 4.04 5.3e-05 *** wf.zipcode98188 2.58e-01 7.10e-02 3.64 0.00027 *** wf.zipcode98198 1.54e-01 6.83e-02 2.26 0.02392 * wf.zipcode98198-1 4.55e-01 2.45e-01 1.85 0.06411 . wf.zipcode98199 1.12e+00 6.74e-02 16.58 < 2e-16 *** wf.zipcodewf-high 1.64e+00 2.44e-01 6.70 2.2e-11 *** wf.zipcodewf-low 2.05e+00 3.21e-01 6.40 1.6e-10 *** wf.zipcodewf-mid 1.73e+00 1.40e-01 12.36 < 2e-16 *** decade1 -3.87e-02 3.45e-02 -1.12 0.26102 decade2 2.21e-02 3.50e-02 0.63 0.52763 decade3 9.27e-02 4.21e-02 2.20 0.02782 * decade4 -2.61e-02 4.11e-02 -0.63 0.52550 decade5 -5.26e-02 4.46e-02 -1.18 0.23757 decade6 -6.43e-02 5.07e-02 -1.27 0.20537 decade7 9.47e-03 5.61e-02 0.17 0.86601 decade8 7.36e-02 6.08e-02 1.21 0.22594 decade9 8.08e-02 6.73e-02 1.20 0.22979 decade10 1.95e-02 7.25e-02 0.27 0.78760 decade11 7.78e-03 7.88e-02 0.10 0.92135 bedroom.class5plus -2.55e-02 7.24e-03 -3.52 0.00043 *** grade3:sqft_living -1.77e-03 3.32e-03 -0.53 0.59497 grade4:sqft_living -5.98e-04 1.89e-04 -3.16 0.00156 ** grade5:sqft_living -2.54e-04 5.98e-05 -4.25 2.2e-05 *** grade6:sqft_living -1.88e-04 3.49e-05 -5.39 7.3e-08 *** grade7:sqft_living -6.90e-05 2.66e-05 -2.59 0.00953 ** grade8:sqft_living -9.10e-06 2.43e-05 -0.37 0.70817 grade9:sqft_living -1.70e-05 2.23e-05 -0.76 0.44615 grade10:sqft_living -3.74e-06 2.16e-05 -0.17 0.86260 grade11:sqft_living -5.39e-06 2.34e-05 -0.23 0.81785 grade12:sqft_living -1.75e-05 2.69e-05 -0.65 0.51429 grade13:sqft_living -6.94e-06 5.91e-05 -0.12 0.90651 bathrooms:log(sqft_living) 3.78e-02 1.06e-02 3.56 0.00037 *** condition2:yr_built -6.99e-03 2.62e-03 -2.66 0.00772 ** condition3:yr_built -7.94e-03 2.49e-03 -3.19 0.00145 ** condition4:yr_built -8.38e-03 2.50e-03 -3.36 0.00078 *** condition5:yr_built -8.23e-03 2.51e-03 -3.29 0.00102 ** decade1:sqft_living 2.21e-05 1.68e-05 1.32 0.18664 decade2:sqft_living -2.39e-06 1.61e-05 -0.15 0.88171 decade3:sqft_living -3.22e-05 1.80e-05 -1.79 0.07360 . decade4:sqft_living 4.81e-06 1.63e-05 0.30 0.76762 decade5:sqft_living 3.00e-06 1.49e-05 0.20 0.84017 decade6:sqft_living -6.71e-06 1.51e-05 -0.44 0.65637 decade7:sqft_living -4.58e-05 1.52e-05 -3.02 0.00257 ** decade8:sqft_living -5.89e-05 1.48e-05 -3.98 6.8e-05 *** decade9:sqft_living -5.51e-05 1.47e-05 -3.76 0.00017 *** decade10:sqft_living -1.36e-05 1.40e-05 -0.97 0.33081 decade11:sqft_living 1.71e-05 1.58e-05 1.08 0.28132 bathrooms:wf.zipcode98002 -9.23e-03 3.57e-02 -0.26 0.79578 bathrooms:wf.zipcode98003 -6.81e-02 3.50e-02 -1.95 0.05146 . bathrooms:wf.zipcode98004 -1.44e-01 2.92e-02 -4.92 8.6e-07 *** bathrooms:wf.zipcode98005 -1.74e-01 3.68e-02 -4.72 2.4e-06 *** bathrooms:wf.zipcode98006 -5.55e-02 2.87e-02 -1.93 0.05361 . bathrooms:wf.zipcode98007 -1.32e-01 4.49e-02 -2.93 0.00336 ** bathrooms:wf.zipcode98008 -9.65e-02 3.62e-02 -2.66 0.00772 ** bathrooms:wf.zipcode98010 -1.08e-01 5.41e-02 -1.99 0.04688 * bathrooms:wf.zipcode98011 -1.01e-01 4.87e-02 -2.08 0.03732 * bathrooms:wf.zipcode98014 -1.28e-01 3.62e-02 -3.53 0.00042 *** bathrooms:wf.zipcode98019 -8.38e-02 4.13e-02 -2.03 0.04258 * bathrooms:wf.zipcode98022 -4.53e-02 3.89e-02 -1.17 0.24392 bathrooms:wf.zipcode98023 -8.62e-02 2.96e-02 -2.91 0.00360 ** bathrooms:wf.zipcode98024 -8.05e-02 3.69e-02 -2.18 0.02916 * bathrooms:wf.zipcode98027 -8.02e-02 2.94e-02 -2.73 0.00640 ** bathrooms:wf.zipcode98028 -7.74e-02 4.07e-02 -1.90 0.05741 . bathrooms:wf.zipcode98029 -1.20e-01 3.62e-02 -3.32 0.00089 *** bathrooms:wf.zipcode98030 -6.59e-02 3.64e-02 -1.81 0.06996 . bathrooms:wf.zipcode98031 -1.07e-01 3.64e-02 -2.95 0.00321 ** bathrooms:wf.zipcode98032 -8.15e-02 4.88e-02 -1.67 0.09451 . bathrooms:wf.zipcode98033 -8.37e-02 2.91e-02 -2.88 0.00400 ** bathrooms:wf.zipcode98034 -7.19e-02 2.90e-02 -2.48 0.01326 * bathrooms:wf.zipcode98038 -6.19e-02 3.46e-02 -1.79 0.07362 . bathrooms:wf.zipcode98039 -1.34e-01 4.20e-02 -3.18 0.00145 ** bathrooms:wf.zipcode98040 -1.22e-01 3.21e-02 -3.80 0.00014 *** bathrooms:wf.zipcode98040-1 -1.06e-01 5.39e-02 -1.96 0.05039 . bathrooms:wf.zipcode98042 -6.62e-02 3.00e-02 -2.21 0.02736 * bathrooms:wf.zipcode98045 -1.40e-01 3.25e-02 -4.30 1.7e-05 *** bathrooms:wf.zipcode98052 -1.10e-01 3.09e-02 -3.56 0.00038 *** bathrooms:wf.zipcode98053 -1.16e-01 3.08e-02 -3.76 0.00017 *** bathrooms:wf.zipcode98055 -7.54e-02 3.07e-02 -2.45 0.01411 * bathrooms:wf.zipcode98056 -4.02e-02 2.96e-02 -1.36 0.17410 bathrooms:wf.zipcode98058 -6.20e-02 2.97e-02 -2.09 0.03669 * bathrooms:wf.zipcode98059 -3.74e-02 2.92e-02 -1.28 0.19964 bathrooms:wf.zipcode98065 -8.57e-02 3.08e-02 -2.78 0.00543 ** bathrooms:wf.zipcode98070 -1.09e-01 4.03e-02 -2.69 0.00705 ** bathrooms:wf.zipcode98070-1 -9.04e-02 7.29e-02 -1.24 0.21517 bathrooms:wf.zipcode98072 -9.63e-02 3.75e-02 -2.57 0.01028 * bathrooms:wf.zipcode98074 -1.19e-01 3.22e-02 -3.71 0.00021 *** bathrooms:wf.zipcode98075 -1.55e-01 3.21e-02 -4.83 1.4e-06 *** bathrooms:wf.zipcode98075-1 -8.63e-02 8.83e-02 -0.98 0.32859 bathrooms:wf.zipcode98077 -1.11e-01 3.66e-02 -3.03 0.00245 ** bathrooms:wf.zipcode98092 -9.06e-02 3.37e-02 -2.69 0.00721 ** bathrooms:wf.zipcode98102 -1.10e-01 4.58e-02 -2.40 0.01627 * bathrooms:wf.zipcode98103 -9.37e-02 2.75e-02 -3.41 0.00065 *** bathrooms:wf.zipcode98105 -4.58e-02 3.20e-02 -1.43 0.15293 bathrooms:wf.zipcode98105-1 -1.24e-01 1.20e-01 -1.03 0.30103 bathrooms:wf.zipcode98106 -9.88e-02 3.05e-02 -3.24 0.00119 ** bathrooms:wf.zipcode98107 -9.27e-02 3.07e-02 -3.02 0.00253 ** bathrooms:wf.zipcode98108 -9.57e-02 3.62e-02 -2.65 0.00815 ** bathrooms:wf.zipcode98109 -1.90e-01 3.84e-02 -4.96 7.3e-07 *** bathrooms:wf.zipcode98112 -5.72e-02 3.08e-02 -1.86 0.06337 . bathrooms:wf.zipcode98115 -7.97e-02 2.78e-02 -2.86 0.00421 ** bathrooms:wf.zipcode98116 -1.18e-01 3.09e-02 -3.82 0.00014 *** bathrooms:wf.zipcode98117 -1.24e-01 2.82e-02 -4.40 1.1e-05 *** bathrooms:wf.zipcode98118 -7.12e-02 2.82e-02 -2.52 0.01161 * bathrooms:wf.zipcode98119 -9.69e-02 3.33e-02 -2.91 0.00365 ** bathrooms:wf.zipcode98122 -6.48e-02 3.14e-02 -2.07 0.03890 * bathrooms:wf.zipcode98125 -9.07e-02 2.90e-02 -3.13 0.00177 ** bathrooms:wf.zipcode98126 -6.07e-02 3.04e-02 -1.99 0.04609 * bathrooms:wf.zipcode98133 -1.13e-01 2.89e-02 -3.92 8.8e-05 *** bathrooms:wf.zipcode98136 -1.05e-01 3.25e-02 -3.24 0.00122 ** bathrooms:wf.zipcode98136-1 -3.66e-01 1.41e-01 -2.60 0.00925 ** bathrooms:wf.zipcode98144 -4.00e-02 3.03e-02 -1.32 0.18595 bathrooms:wf.zipcode98146 2.73e-02 3.28e-02 0.83 0.40519 bathrooms:wf.zipcode98148 -8.12e-02 5.73e-02 -1.42 0.15629 bathrooms:wf.zipcode98155 -8.05e-02 3.02e-02 -2.66 0.00778 ** bathrooms:wf.zipcode98166 1.40e-02 3.45e-02 0.41 0.68478 bathrooms:wf.zipcode98166-1 -2.32e-01 9.18e-02 -2.53 0.01133 * bathrooms:wf.zipcode98168 -1.64e-02 3.83e-02 -0.43 0.66848 bathrooms:wf.zipcode98177 -6.55e-02 3.11e-02 -2.11 0.03514 * bathrooms:wf.zipcode98178 -4.68e-02 3.42e-02 -1.37 0.17174 bathrooms:wf.zipcode98188 -6.24e-02 3.43e-02 -1.82 0.06856 . bathrooms:wf.zipcode98198 -6.55e-03 3.40e-02 -0.19 0.84720 bathrooms:wf.zipcode98198-1 8.95e-02 1.10e-01 0.81 0.41733 bathrooms:wf.zipcode98199 -9.87e-02 3.04e-02 -3.24 0.00118 ** bathrooms:wf.zipcodewf-high -5.81e-02 6.90e-02 -0.84 0.39978 bathrooms:wf.zipcodewf-low -6.86e-01 1.73e-01 -3.98 7.1e-05 *** bathrooms:wf.zipcodewf-mid -1.61e-01 4.82e-02 -3.34 0.00084 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.181 on 9784 degrees of freedom Multiple R-squared: 0.886, Adjusted R-squared: 0.884 F-statistic: 355 on 215 and 9784 DF, p-value: <2e-16 Model Accuracy¶ As the linear regression model predicts the log of the house price, the prediction results need to be transformed back to house prices using the exp function before assessing the model accuracy. Convert the predicted values to the house price and calculate the RSE, $R^2$ and F-statistic values from the results. In [43]: # Transform the predicted values to house prices houseData.predict <- exp(house.model$fitted.values) # Calculate the accuracy statistics houseModel.accuracy <- Model.Accuracy(houseData.predict,houseData$price,9784,215) # Print the accuracy statistics cat("\nModel accuracy after converting the predicted log(price) to price:\n") cat("\nDegrees of freedom: 9784\nModel parameters: 215 plus intercept") cat("\nResidual standard error:",houseModel.accuracy$rse) cat("\nPercentage error:",houseModel.accuracy$rse*100/mean(houseData$price),"%") cat("\nR-Squared:",houseModel.accuracy$rsquared) cat("\nF-statistic:",houseModel.accuracy$f.stat,"; p-value:",pf(355,215,9784,lower.tail=FALSE)) Model accuracy after converting the predicted log(price) to price: Degrees of freedom: 9784 Model parameters: 215 plus intercept Residual standard error: 121593 Percentage error: 22.46 % R-Squared: 0.8934 F-statistic: 381.2 ; p-value: 0 Model Analysis¶ Residuals: These are centred around zero, the 1st and 3rd quartiles are equal distances from zero and only a small difference in the minimum and maximum distance. This indicates an even distribution of the residuals. Residual Standard Error: The residual standard error, or the estimated standard deviation of the residuals is 121592.8, which is 22.5% of the mean house price. $R^2$: This indicates the model explains 88.4% of the variation in the log of the house price. After converting the regression results to the house price, the $R^2$ value increases and shows the model explains 89.3% of the variation in the house price. F-statistic: The F-statistic calculated using the house prices is 381, which indicates a strong relationship between the predictor and response variables. The p-value associated with this is very small, meaning the null hypothesis (the model is not useful) can be rejected. There are 221 terms (plus the intercept) in the model, leaving 9778 degrees of freedom from the sample size of 10,000. Coefficients: All the numerical variable coefficients have small p-values, so they are not zero at the 5% significance level. Not all the coefficients for all factors of categorical variables are significant, but all categorical variables apart from grade and decade have several factors with significant coefficients. Grade and decade are required in the model as there are interaction terms that include these variables. Some observations from the values of the coefficients: • The yr_built, log(sqft_living) and log(sqft_lot) variables and the interaction bathroom:log(sqft_living) all have positive coefficients, meaning that increases in these values result in a higher house price.
 • The negative coefficient for the bedroom.class5plus factor shows that having 5 or more bedrooms lowers the house price once the effects of the correlated variables have been taken into account.
 • The condition factor coefficients increase for the better conditioned houses, reflecting the higher price for houses in better condition seen in the EDA section.
 • The negative correlation between condition and yr_built is modelled by the decreasing coefficients for the condition:yr_built factors.
 • The correlation between bathrooms and zipcode discovered during the EDA is modelled by the different coefficients for the bathrooms:wf.zipcode interaction categories.
 • As none of the grade category coefficients are significant, the correlation between grade and house price observed in the EDA is explained by the interaction between grade and sqft_living - in other words the grade has very little effect on the house price. The size of the house has the effect; the grade is largely determined by the size of the house.
 In [44]: par(mfrow=c(2,2)) plot(house.model) Warning message: “not plotting observations with leverage one: 6061, 8099”Warning message: “not plotting observations with leverage one: 6061, 8099”Warning message in sqrt(crit * p * (1 - hh)/hh): “NaNs produced”Warning message in sqrt(crit * p * (1 - hh)/hh): “NaNs produced”  The model plots show: • Residual vs Fitted - shows the residuals are evenly distributed around zero with no funnelling, so the model is meeting the assumption of homoscedasticity - the error terms are constant along the regression line. • Normal Q-Q - the residuals deviate slightly from the dashed line, indicating the residuals have close to a normal distribution • Scale-Location - The chart shows the variance of the residuals is reasonably constant • Residuals vs Leverage - The chart shows there are some possibly influential outliers, however they are generally away from the Cook's line. Display the records for the unplotted observations In [45]: # Check the records in the warnings predict.price <- houseData.predict[c(6061, 8099)] cbind(predict.price,houseData[c(6061, 8099),]) predict.price price bedrooms bathrooms sqft_living sqft_lot waterfront condition grade yr_built zipcode wf.zipcode decade bedroom.class 6061 262000 262000 1 0.75 520 12981 0 5 3 1920 98022 98022 2 4minus 8099 280000 280000 1 0.00 600 24501 0 2 3 1950 98045 98045 5 4minus The model has managed to predict the unplotted observations correctly and so may be overfitting at these points. Check for Influential Outliers¶ Use the code supplied in the tutorial to check for influential outlier points In [46]: outlierTest(house.model, cutoff=0.05, digits = 1) rstudent unadjusted p-value Bonferonni p 518 6.507 8.048e-11 8.047e-07 609 -5.289 1.258e-07 1.257e-03 4114 -5.087 3.713e-07 3.713e-03 2091 5.061 4.253e-07 4.253e-03 1642 5.060 4.274e-07 4.273e-03 6864 -5.034 4.880e-07 4.879e-03 9352 -5.013 5.444e-07 5.443e-03 7951 -4.960 7.165e-07 7.163e-03 3438 4.956 7.302e-07 7.300e-03 4313 4.912 9.165e-07 9.164e-03 The outlier test has reported several outliers, so generate an influence plot to see if these are influential outliers. In [47]: influencePlot(house.model, scale=5, id.method="noteworthy", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" ) Warning message in plot.window(...): “"id.method" is not a graphical parameter”Warning message in plot.xy(xy, type, ...): “"id.method" is not a graphical parameter”Warning message in axis(side = side, at = at, labels = labels, ...): “"id.method" is not a graphical parameter”Warning message in axis(side = side, at = at, labels = labels, ...): “"id.method" is not a graphical parameter”Warning message in box(...): “"id.method" is not a graphical parameter”Warning message in title(...): “"id.method" is not a graphical parameter”Warning message in plot.xy(xy.coords(x, y), type = type, ...): “"id.method" is not a graphical parameter” StudRes Hat CookD 518 6.507 0.12561 0.028041 609 -5.289 0.01682 0.002209 5493 2.048 0.80005 0.077655 6061 NaN 1.00000 NaN 8099 NaN 1.00000 NaN  The influence plot shows quite a few influential points has reported three of them: • 518 has a large studentized residual and is also an outlier. The large studentized residual means the model has made a poor prediction for this sample. • 5493 and 6061 have large Hat values and so are significantly influencing the model. Display the outlier and influencer records Using the Model to Predict Prices¶ Function to prepare the data for price prediction¶ Name: Prepare.Data Input parameters: • data - a dataframe that contains the test data. It should be in the same structure as the training dataset. Return Value: • A dataframe containing the prepared data Description: This function reformats the input dataframe to the format exptected by the model. • Make a working copy of the data and remove the id and price columns, if they exist • Change any grades or 1 or 2 to 3. These grades were not in the training data and the model will not run if the data contains these grades • Factorise the categorical variables • Create the generated variables wf.zipcode, decade and bedroom.class In [48]: Prepare.Data <- function(data) { # Remove the id and price columns newdata <- data newdata$id <- NULL newdata$price <- NULL # The model will fail if any grades of 1 or 2 are present, so change these to 3 newdata$grade[newdata$grade < 3] <- 3 # Set the factors newdata$waterfront <- as.factor(newdata$waterfront) newdata$condition <- as.factor(newdata$condition) newdata$grade <- as.factor(newdata$grade) newdata$zipcode <- as.factor(newdata$zipcode) # Generate the new variables newdata <- GenerateVariables(newdata) # Return the prepared dataframe return(newdata) } Predict the House Prices for the Development Dataset¶ In [49]: houseDev <- read.csv("dev.csv") houseDev2 <- Prepare.Data(houseDev) houseDev.predict <- exp(predict(house.model,houseDev2,type="response")) print(head(cbind(PredictedPrice=houseDev.predict, ActualPrice=houseDev$price),20)) PredictedPrice ActualPrice 1 1158734 1146800 2 674695 950000 3 894635 850000 4 670569 599000 5 298266 255000 6 324204 280000 7 710348 715000 8 478907 550000 9 649825 1080000 10 497252 499000 11 269746 252350 12 287739 276900 13 742353 850000 14 327048 302495 15 352188 390000 16 797501 699000 17 654435 450000 18 581266 460000 19 280000 280000 20 231590 279000 The first twenty predicted prices are displayed along with the actual sale price. Most of these are reasonably close, but the model made poor predictions for records 2, 9 and 17. RMSE for Development Predictions¶ In [50]: dev.rmse <- RMSE(houseDev.predict,houseDev$price) cat("RMSE for development predictions is:",dev.rmse,"; which is",dev.rmse*100/mean(houseDev$price),"% of the mean house price.") RMSE for development predictions is: 114025 ; which is 20.65 % of the mean house price. The root mean squared error for the price prediction of the development data is $114025, which is 20.7% of the mean house price. This is a slightly better than expected result, as the percentage error of the model based on the RSE is 22.5%. Check the uncertainty of the expected value of predictions¶ In [51]: print(head(exp(predict(house.model,newdata=houseDev2,interval="confidence")),20)) fit lwr upr 1 1158734 1061532 1264836 2 674695 649397 700979 3 894635 850748 940786 4 670569 642353 700023 5 298266 273907 324792 6 324204 313039 335768 7 710348 685306 736305 8 478907 463961 494334 9 649825 615232 686363 10 497252 480154 514959 11 269746 262285 277418 12 287739 274402 301724 13 742353 689114 799705 14 327048 314975 339583 15 352188 338916 365980 16 797501 773052 822724 17 654435 615885 695398 18 581266 533773 632986 19 280000 196355 399277 20 231590 220246 243519 The confidence intervals are generally about 10% of the fitted value, so on average the predictions should be within 10% of the actual sale price. Check the uncertainty around the individual predictions In [52]: print(head(exp(predict(house.model,newdata=houseDev2,interval="prediction")),20)) fit lwr upr 1 1158734 803970 1670042 2 674695 472172 964084 3 894635 625157 1280273 4 670569 469030 958707 5 298266 207066 429635 6 324204 226962 463110 7 710348 497243 1014783 8 478907 335367 683881 9 649825 453795 930535 10 497252 348107 710297 11 269746 188955 385080 12 287739 201145 411611 13 742353 516585 1066790 14 327048 228893 467294 15 352188 246467 503259 16 797501 558500 1138779 17 654435 456574 938041 18 581266 403530 837288 19 280000 169514 462499 20 231590 161834 331415 The prediction intervals seem quite large compared to the fitted values, suggesting that there is a high level of uncertainty around each prediction. Steps to Rebuild Model¶ The following steps can be run to prepare the training data, build the model and run the model to predict house prices. It includes copies of functions created and used for building and testing the model. 1. Prepare the dataframe¶ Below is a copy of the GenerateVariables function used to add the generated variables to the dataframe In [53]: GenerateVariables <- function(data) { # Make a copy of the dataframe newdata <- data # Generate the wf.zipcode variable ## Zipcodes treated individually wfzipcodes <- c(98040,98070,98075,98105,98136,98166,98198) ## Define the zipcode groups wfziplow <- c(98023,98146) wfzipmid <- c(98006,98027,98052,98074,98118,98125,98155,98178) wfziphigh <- c(98008,98033,98034,98039,98056,98144) ## Create the new variable newdata$wf.zipcode <- as.character(newdata$zipcode) wf.rows <- row.names(newdata[newdata$waterfront == 1 & newdata$zipcode %in% wfzipcodes,]) newdata$wf.zipcode[as.numeric(wf.rows)] <- paste(newdata$zipcode[as.numeric(wf.rows)],"1",sep="-") newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfziplow] <- "wf-low" newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfzipmid] <- "wf-mid" newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfziphigh] <- "wf-high" newdata$wf.zipcode <- as.factor(newdata$wf.zipcode) # Generate the decade variable newdata$decade <- as.factor(ifelse(newdata$yr_built < 1900, 0, ifelse(newdata$yr_built > 2019, 11, trunc(newdata$yr_built/10)-190)))

# Generate the bedroom.class variable
newdata$bedroom.class <- as.factor(ifelse (newdata$bedrooms < 5, "4minus", "5plus")) # Return the modified dataframe return(newdata) } The following code can be used to prepare the training dataframe for building the model. The required steps are: 1. Read the training data 2. Remove the id variable 3. Set the factors 4. Generate the new variables In [54]: houseData <- read.csv("training.csv") houseData <- houseData[,-1] houseData$waterfront <- as.factor(houseData$waterfront) houseData$condition <- as.factor(houseData$condition) houseData$grade <- as.factor(houseData$grade) houseData$zipcode <- as.factor(houseData$zipcode) houseData <- GenerateVariables(houseData) 2. Build the model¶ A copy of the code for the final model is shown below In [55]: house.model <- lm(log(price) ~ bathrooms + condition + grade + yr_built + log(sqft_living) + log(sqft_lot) + wf.zipcode + decade + bedroom.class + sqft_living:grade + bathrooms:log(sqft_living) + condition:yr_built + sqft_living:decade + bathrooms:wf.zipcode, data=houseData) summary(house.model) Call: lm(formula = log(price) ~ bathrooms + condition + grade + yr_built + log(sqft_living) + log(sqft_lot) + wf.zipcode + decade + bedroom.class + sqft_living:grade + bathrooms:log(sqft_living) + condition:yr_built + sqft_living:decade + bathrooms:wf.zipcode, data = houseData) Residuals: Min 1Q Median 3Q Max -0.9481 -0.0946 0.0023 0.0986 1.0992 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.84e+00 5.31e+00 -1.10 0.27185
bathrooms -1.79e-01 8.38e-02 -2.14 0.03266 *
condition2 1.36e+01 5.08e+00 2.69 0.00725 **
condition3 1.57e+01 4.82e+00 3.25 0.00116 **
condition4 1.66e+01 4.82e+00 3.43 0.00060 ***
condition5 1.63e+01 4.85e+00 3.37 0.00075 ***
grade4 -1.03e+00 1.87e+00 -0.55 0.58004
grade5 -1.26e+00 1.86e+00 -0.68 0.49743
grade6 -1.22e+00 1.86e+00 -0.66 0.51077
grade7 -1.27e+00 1.86e+00 -0.68 0.49567
grade8 -1.27e+00 1.86e+00 -0.68 0.49479
grade9 -1.12e+00 1.86e+00 -0.60 0.54606
grade10 -1.06e+00 1.86e+00 -0.57 0.56974
grade11 -9.20e-01 1.86e+00 -0.49 0.62181
grade12 -6.49e-01 1.87e+00 -0.35 0.72821
grade13 -5.76e-01 1.91e+00 -0.30 0.76298
yr_built 7.72e-03 2.58e-03 3.00 0.00274 **
log(sqft_living) 4.69e-01 3.54e-02 13.26 < 2e-16 *** log(sqft_lot) 8.04e-02 3.29e-03 24.44 < 2e-16 *** wf.zipcode98002 3.00e-02 7.24e-02 0.41 0.67860 wf.zipcode98003 1.86e-01 7.30e-02 2.55 0.01087 * wf.zipcode98004 1.47e+00 6.66e-02 22.07 < 2e-16 *** wf.zipcode98005 1.14e+00 8.63e-02 13.20 < 2e-16 *** wf.zipcode98006 8.16e-01 6.52e-02 12.52 < 2e-16 *** wf.zipcode98007 9.95e-01 1.02e-01 9.76 < 2e-16 *** wf.zipcode98008 9.11e-01 7.72e-02 11.80 < 2e-16 *** wf.zipcode98010 4.09e-01 1.14e-01 3.58 0.00035 *** wf.zipcode98011 6.98e-01 1.09e-01 6.40 1.6e-10 *** wf.zipcode98014 5.75e-01 7.91e-02 7.27 3.9e-13 *** wf.zipcode98019 4.97e-01 9.17e-02 5.42 6.2e-08 *** wf.zipcode98022 1.67e-01 7.76e-02 2.15 0.03129 * wf.zipcode98023 1.93e-01 6.28e-02 3.08 0.00209 ** wf.zipcode98024 5.80e-01 8.23e-02 7.05 2.0e-12 *** wf.zipcode98027 7.03e-01 6.75e-02 10.42 < 2e-16 *** wf.zipcode98028 5.96e-01 8.85e-02 6.74 1.7e-11 *** wf.zipcode98029 9.00e-01 8.87e-02 10.14 < 2e-16 *** wf.zipcode98030 2.21e-01 8.03e-02 2.76 0.00585 ** wf.zipcode98031 3.27e-01 7.80e-02 4.20 2.7e-05 *** wf.zipcode98032 1.78e-01 9.15e-02 1.94 0.05228 . wf.zipcode98033 1.01e+00 6.41e-02 15.68 < 2e-16 *** wf.zipcode98034 7.50e-01 6.21e-02 12.07 < 2e-16 *** wf.zipcode98038 3.14e-01 7.96e-02 3.95 8.0e-05 *** wf.zipcode98039 1.63e+00 1.24e-01 13.15 < 2e-16 *** wf.zipcode98040 1.23e+00 7.74e-02 15.93 < 2e-16 *** wf.zipcode98040-1 1.90e+00 1.85e-01 10.25 < 2e-16 *** wf.zipcode98042 2.32e-01 6.51e-02 3.56 0.00037 *** wf.zipcode98045 6.27e-01 7.02e-02 8.93 < 2e-16 *** wf.zipcode98052 9.18e-01 6.92e-02 13.26 < 2e-16 *** wf.zipcode98053 8.54e-01 7.13e-02 11.98 < 2e-16 *** wf.zipcode98055 3.37e-01 6.48e-02 5.20 2.0e-07 *** wf.zipcode98056 4.31e-01 6.30e-02 6.84 8.7e-12 *** wf.zipcode98058 3.12e-01 6.33e-02 4.94 8.1e-07 *** wf.zipcode98059 4.27e-01 6.49e-02 6.58 4.9e-11 *** wf.zipcode98065 6.25e-01 7.19e-02 8.70 < 2e-16 *** wf.zipcode98070 5.69e-01 8.55e-02 6.66 2.9e-11 *** wf.zipcode98070-1 8.37e-01 1.48e-01 5.65 1.7e-08 *** wf.zipcode98072 7.08e-01 8.28e-02 8.54 < 2e-16 *** wf.zipcode98074 8.37e-01 7.44e-02 11.25 < 2e-16 *** wf.zipcode98075 9.47e-01 7.94e-02 11.94 < 2e-16 *** wf.zipcode98075-1 1.72e+00 2.34e-01 7.35 2.1e-13 *** wf.zipcode98077 6.94e-01 8.52e-02 8.14 4.6e-16 *** wf.zipcode98092 2.39e-01 7.43e-02 3.21 0.00132 ** wf.zipcode98102 1.22e+00 1.08e-01 11.34 < 2e-16 *** wf.zipcode98103 1.05e+00 5.82e-02 18.09 < 2e-16 *** wf.zipcode98105 1.08e+00 6.99e-02 15.41 < 2e-16 *** wf.zipcode98105-1 1.73e+00 4.29e-01 4.05 5.2e-05 *** wf.zipcode98106 5.57e-01 6.30e-02 8.84 < 2e-16 *** wf.zipcode98107 1.09e+00 6.59e-02 16.48 < 2e-16 *** wf.zipcode98108 5.66e-01 7.29e-02 7.75 9.9e-15 *** wf.zipcode98109 1.40e+00 8.77e-02 15.93 < 2e-16 *** wf.zipcode98112 1.17e+00 7.20e-02 16.24 < 2e-16 *** wf.zipcode98115 1.01e+00 5.83e-02 17.38 < 2e-16 *** wf.zipcode98116 1.05e+00 6.59e-02 15.96 < 2e-16 *** wf.zipcode98117 1.07e+00 5.86e-02 18.32 < 2e-16 *** wf.zipcode98118 6.43e-01 5.80e-02 11.10 < 2e-16 *** wf.zipcode98119 1.26e+00 7.73e-02 16.29 < 2e-16 *** wf.zipcode98122 1.01e+00 6.79e-02 14.90 < 2e-16 *** wf.zipcode98125 7.91e-01 6.00e-02 13.19 < 2e-16 *** wf.zipcode98126 7.39e-01 6.07e-02 12.18 < 2e-16 *** wf.zipcode98133 6.93e-01 5.91e-02 11.73 < 2e-16 *** wf.zipcode98136 9.55e-01 6.79e-02 14.07 < 2e-16 *** wf.zipcode98136-1 2.14e+00 3.15e-01 6.81 1.0e-11 *** wf.zipcode98144 8.02e-01 6.63e-02 12.10 < 2e-16 *** wf.zipcode98146 3.03e-01 6.37e-02 4.76 2.0e-06 *** wf.zipcode98148 3.49e-01 1.08e-01 3.24 0.00120 ** wf.zipcode98155 6.29e-01 6.16e-02 10.22 < 2e-16 *** wf.zipcode98166 3.71e-01 7.01e-02 5.30 1.2e-07 *** wf.zipcode98166-1 1.36e+00 2.60e-01 5.23 1.7e-07 *** wf.zipcode98168 1.24e-01 6.71e-02 1.85 0.06447 . wf.zipcode98177 8.09e-01 6.81e-02 11.88 < 2e-16 *** wf.zipcode98178 2.73e-01 6.75e-02 4.04 5.3e-05 *** wf.zipcode98188 2.58e-01 7.10e-02 3.64 0.00027 *** wf.zipcode98198 1.54e-01 6.83e-02 2.26 0.02392 * wf.zipcode98198-1 4.55e-01 2.45e-01 1.85 0.06411 . wf.zipcode98199 1.12e+00 6.74e-02 16.58 < 2e-16 *** wf.zipcodewf-high 1.64e+00 2.44e-01 6.70 2.2e-11 *** wf.zipcodewf-low 2.05e+00 3.21e-01 6.40 1.6e-10 *** wf.zipcodewf-mid 1.73e+00 1.40e-01 12.36 < 2e-16 *** decade1 -3.87e-02 3.45e-02 -1.12 0.26102 decade2 2.21e-02 3.50e-02 0.63 0.52763 decade3 9.27e-02 4.21e-02 2.20 0.02782 * decade4 -2.61e-02 4.11e-02 -0.63 0.52550 decade5 -5.26e-02 4.46e-02 -1.18 0.23757 decade6 -6.43e-02 5.07e-02 -1.27 0.20537 decade7 9.47e-03 5.61e-02 0.17 0.86601 decade8 7.36e-02 6.08e-02 1.21 0.22594 decade9 8.08e-02 6.73e-02 1.20 0.22979 decade10 1.95e-02 7.25e-02 0.27 0.78760 decade11 7.78e-03 7.88e-02 0.10 0.92135 bedroom.class5plus -2.55e-02 7.24e-03 -3.52 0.00043 *** grade3:sqft_living -1.77e-03 3.32e-03 -0.53 0.59497 grade4:sqft_living -5.98e-04 1.89e-04 -3.16 0.00156 ** grade5:sqft_living -2.54e-04 5.98e-05 -4.25 2.2e-05 *** grade6:sqft_living -1.88e-04 3.49e-05 -5.39 7.3e-08 *** grade7:sqft_living -6.90e-05 2.66e-05 -2.59 0.00953 ** grade8:sqft_living -9.10e-06 2.43e-05 -0.37 0.70817 grade9:sqft_living -1.70e-05 2.23e-05 -0.76 0.44615 grade10:sqft_living -3.74e-06 2.16e-05 -0.17 0.86260 grade11:sqft_living -5.39e-06 2.34e-05 -0.23 0.81785 grade12:sqft_living -1.75e-05 2.69e-05 -0.65 0.51429 grade13:sqft_living -6.94e-06 5.91e-05 -0.12 0.90651 bathrooms:log(sqft_living) 3.78e-02 1.06e-02 3.56 0.00037 *** condition2:yr_built -6.99e-03 2.62e-03 -2.66 0.00772 ** condition3:yr_built -7.94e-03 2.49e-03 -3.19 0.00145 ** condition4:yr_built -8.38e-03 2.50e-03 -3.36 0.00078 *** condition5:yr_built -8.23e-03 2.51e-03 -3.29 0.00102 ** decade1:sqft_living 2.21e-05 1.68e-05 1.32 0.18664 decade2:sqft_living -2.39e-06 1.61e-05 -0.15 0.88171 decade3:sqft_living -3.22e-05 1.80e-05 -1.79 0.07360 . decade4:sqft_living 4.81e-06 1.63e-05 0.30 0.76762 decade5:sqft_living 3.00e-06 1.49e-05 0.20 0.84017 decade6:sqft_living -6.71e-06 1.51e-05 -0.44 0.65637 decade7:sqft_living -4.58e-05 1.52e-05 -3.02 0.00257 ** decade8:sqft_living -5.89e-05 1.48e-05 -3.98 6.8e-05 *** decade9:sqft_living -5.51e-05 1.47e-05 -3.76 0.00017 *** decade10:sqft_living -1.36e-05 1.40e-05 -0.97 0.33081 decade11:sqft_living 1.71e-05 1.58e-05 1.08 0.28132 bathrooms:wf.zipcode98002 -9.23e-03 3.57e-02 -0.26 0.79578 bathrooms:wf.zipcode98003 -6.81e-02 3.50e-02 -1.95 0.05146 . bathrooms:wf.zipcode98004 -1.44e-01 2.92e-02 -4.92 8.6e-07 *** bathrooms:wf.zipcode98005 -1.74e-01 3.68e-02 -4.72 2.4e-06 *** bathrooms:wf.zipcode98006 -5.55e-02 2.87e-02 -1.93 0.05361 . bathrooms:wf.zipcode98007 -1.32e-01 4.49e-02 -2.93 0.00336 ** bathrooms:wf.zipcode98008 -9.65e-02 3.62e-02 -2.66 0.00772 ** bathrooms:wf.zipcode98010 -1.08e-01 5.41e-02 -1.99 0.04688 * bathrooms:wf.zipcode98011 -1.01e-01 4.87e-02 -2.08 0.03732 * bathrooms:wf.zipcode98014 -1.28e-01 3.62e-02 -3.53 0.00042 *** bathrooms:wf.zipcode98019 -8.38e-02 4.13e-02 -2.03 0.04258 * bathrooms:wf.zipcode98022 -4.53e-02 3.89e-02 -1.17 0.24392 bathrooms:wf.zipcode98023 -8.62e-02 2.96e-02 -2.91 0.00360 ** bathrooms:wf.zipcode98024 -8.05e-02 3.69e-02 -2.18 0.02916 * bathrooms:wf.zipcode98027 -8.02e-02 2.94e-02 -2.73 0.00640 ** bathrooms:wf.zipcode98028 -7.74e-02 4.07e-02 -1.90 0.05741 . bathrooms:wf.zipcode98029 -1.20e-01 3.62e-02 -3.32 0.00089 *** bathrooms:wf.zipcode98030 -6.59e-02 3.64e-02 -1.81 0.06996 . bathrooms:wf.zipcode98031 -1.07e-01 3.64e-02 -2.95 0.00321 ** bathrooms:wf.zipcode98032 -8.15e-02 4.88e-02 -1.67 0.09451 . bathrooms:wf.zipcode98033 -8.37e-02 2.91e-02 -2.88 0.00400 ** bathrooms:wf.zipcode98034 -7.19e-02 2.90e-02 -2.48 0.01326 * bathrooms:wf.zipcode98038 -6.19e-02 3.46e-02 -1.79 0.07362 . bathrooms:wf.zipcode98039 -1.34e-01 4.20e-02 -3.18 0.00145 ** bathrooms:wf.zipcode98040 -1.22e-01 3.21e-02 -3.80 0.00014 *** bathrooms:wf.zipcode98040-1 -1.06e-01 5.39e-02 -1.96 0.05039 . bathrooms:wf.zipcode98042 -6.62e-02 3.00e-02 -2.21 0.02736 * bathrooms:wf.zipcode98045 -1.40e-01 3.25e-02 -4.30 1.7e-05 *** bathrooms:wf.zipcode98052 -1.10e-01 3.09e-02 -3.56 0.00038 *** bathrooms:wf.zipcode98053 -1.16e-01 3.08e-02 -3.76 0.00017 *** bathrooms:wf.zipcode98055 -7.54e-02 3.07e-02 -2.45 0.01411 * bathrooms:wf.zipcode98056 -4.02e-02 2.96e-02 -1.36 0.17410 bathrooms:wf.zipcode98058 -6.20e-02 2.97e-02 -2.09 0.03669 * bathrooms:wf.zipcode98059 -3.74e-02 2.92e-02 -1.28 0.19964 bathrooms:wf.zipcode98065 -8.57e-02 3.08e-02 -2.78 0.00543 ** bathrooms:wf.zipcode98070 -1.09e-01 4.03e-02 -2.69 0.00705 ** bathrooms:wf.zipcode98070-1 -9.04e-02 7.29e-02 -1.24 0.21517 bathrooms:wf.zipcode98072 -9.63e-02 3.75e-02 -2.57 0.01028 * bathrooms:wf.zipcode98074 -1.19e-01 3.22e-02 -3.71 0.00021 *** bathrooms:wf.zipcode98075 -1.55e-01 3.21e-02 -4.83 1.4e-06 *** bathrooms:wf.zipcode98075-1 -8.63e-02 8.83e-02 -0.98 0.32859 bathrooms:wf.zipcode98077 -1.11e-01 3.66e-02 -3.03 0.00245 ** bathrooms:wf.zipcode98092 -9.06e-02 3.37e-02 -2.69 0.00721 ** bathrooms:wf.zipcode98102 -1.10e-01 4.58e-02 -2.40 0.01627 * bathrooms:wf.zipcode98103 -9.37e-02 2.75e-02 -3.41 0.00065 *** bathrooms:wf.zipcode98105 -4.58e-02 3.20e-02 -1.43 0.15293 bathrooms:wf.zipcode98105-1 -1.24e-01 1.20e-01 -1.03 0.30103 bathrooms:wf.zipcode98106 -9.88e-02 3.05e-02 -3.24 0.00119 ** bathrooms:wf.zipcode98107 -9.27e-02 3.07e-02 -3.02 0.00253 ** bathrooms:wf.zipcode98108 -9.57e-02 3.62e-02 -2.65 0.00815 ** bathrooms:wf.zipcode98109 -1.90e-01 3.84e-02 -4.96 7.3e-07 *** bathrooms:wf.zipcode98112 -5.72e-02 3.08e-02 -1.86 0.06337 . bathrooms:wf.zipcode98115 -7.97e-02 2.78e-02 -2.86 0.00421 ** bathrooms:wf.zipcode98116 -1.18e-01 3.09e-02 -3.82 0.00014 *** bathrooms:wf.zipcode98117 -1.24e-01 2.82e-02 -4.40 1.1e-05 *** bathrooms:wf.zipcode98118 -7.12e-02 2.82e-02 -2.52 0.01161 * bathrooms:wf.zipcode98119 -9.69e-02 3.33e-02 -2.91 0.00365 ** bathrooms:wf.zipcode98122 -6.48e-02 3.14e-02 -2.07 0.03890 * bathrooms:wf.zipcode98125 -9.07e-02 2.90e-02 -3.13 0.00177 ** bathrooms:wf.zipcode98126 -6.07e-02 3.04e-02 -1.99 0.04609 * bathrooms:wf.zipcode98133 -1.13e-01 2.89e-02 -3.92 8.8e-05 *** bathrooms:wf.zipcode98136 -1.05e-01 3.25e-02 -3.24 0.00122 ** bathrooms:wf.zipcode98136-1 -3.66e-01 1.41e-01 -2.60 0.00925 ** bathrooms:wf.zipcode98144 -4.00e-02 3.03e-02 -1.32 0.18595 bathrooms:wf.zipcode98146 2.73e-02 3.28e-02 0.83 0.40519 bathrooms:wf.zipcode98148 -8.12e-02 5.73e-02 -1.42 0.15629 bathrooms:wf.zipcode98155 -8.05e-02 3.02e-02 -2.66 0.00778 ** bathrooms:wf.zipcode98166 1.40e-02 3.45e-02 0.41 0.68478 bathrooms:wf.zipcode98166-1 -2.32e-01 9.18e-02 -2.53 0.01133 * bathrooms:wf.zipcode98168 -1.64e-02 3.83e-02 -0.43 0.66848 bathrooms:wf.zipcode98177 -6.55e-02 3.11e-02 -2.11 0.03514 * bathrooms:wf.zipcode98178 -4.68e-02 3.42e-02 -1.37 0.17174 bathrooms:wf.zipcode98188 -6.24e-02 3.43e-02 -1.82 0.06856 . bathrooms:wf.zipcode98198 -6.55e-03 3.40e-02 -0.19 0.84720 bathrooms:wf.zipcode98198-1 8.95e-02 1.10e-01 0.81 0.41733 bathrooms:wf.zipcode98199 -9.87e-02 3.04e-02 -3.24 0.00118 ** bathrooms:wf.zipcodewf-high -5.81e-02 6.90e-02 -0.84 0.39978 bathrooms:wf.zipcodewf-low -6.86e-01 1.73e-01 -3.98 7.1e-05 *** bathrooms:wf.zipcodewf-mid -1.61e-01 4.82e-02 -3.34 0.00084 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.181 on 9784 degrees of freedom Multiple R-squared: 0.886, Adjusted R-squared: 0.884 F-statistic: 355 on 215 and 9784 DF, p-value: <2e-16 3. Run the model¶ The following code is based on the steps used to predict prices for the development data. A copy of the function to prepare the dataframe In [56]: Prepare.Data <- function(data) { # Remove the id and price columns newdata <- data newdata$id <- NULL newdata$price <- NULL # The model will fail if any grades of 1 or 2 are present, so change these to 3 newdata$grade[newdata$grade < 3] <- 3 # Set the factors newdata$waterfront <- as.factor(newdata$waterfront) newdata$condition <- as.factor(newdata$condition) newdata$grade <- as.factor(newdata$grade) newdata$zipcode <- as.factor(newdata$zipcode) # Generate the new variables newdata <- GenerateVariables(newdata) # Return the prepared dataframe return(newdata) } A copy of the function to calculate the RMSE In [57]: RMSE <- function(predicted, target) { se <- 0 for (i in 1:length(predicted)) { se <- se + (predicted[i]-target[i])^2 } return (sqrt(se/length(predicted))) } The following code will prepare the test data and generate the price predictions. Replace the file name with the name of the file containing the test set. Note: The exp function is applied to the predictions as the model estimates the log of the house price. In [58]: houseTest <- read.csv("dev.csv") houseTest2 <- Prepare.Data(houseTest) houseTest.predict <- exp(predict(house.model,houseTest2,type="response")) cat("RMSE for the test predictions is:", RMSE(houseTest.predict,houseTest$price)) RMSE for the test predictions is: 114025 In [ ]: