Predictive Modelling for House Prices¶
Version: 1.0
Language: R 3.3.0 and Jupyter notebook
Libraries used:
• car
• ggplot2
• grid
• gridExtra
• RColorBrewer
• reshape2
• scales
• stats
Introduction¶
This notebook contains the results of the data analysis performed on a set of house sales data from a residential real estate market. The aim of the data analysis is to build a linear regression model from the real estate transaction data that can be used to predict the sales price of a house.
The first section of the notebook shows the exploratory data analysis (EDA) performed to explore and understand the data. It looks at each attribute (variable) in the data to understand the nature and distribution of the attribute values. It also examines the correlation between the variables through visual analysis. A summary at the end highlights the key findings of the EDA.
The second section shows the development of the linear regression model. It details the process used to build the model and shows the model at key points in the development process. The final model is then presented along with an analysis and interpretation of the model. This section concludes with the results of using the model to predict house prices for the data in the development dataset.
The final section provides the details of the model to enable it to be rebuilt. In addition to the model itself, it includes the functions used to transform the data and run the model.
Two datasets were provided for the assignment – training.csv and dev.csv. The exploratory data analysis and the model building were done using the training.csv dataset; the dev.csv dataset was only used to test the generated model.
Load the libraries used in the notebook
In [1]:
library(ggplot2)
library(reshape2)
library(car)
library(stats)
library(scales)
library(grid)
library(gridExtra)
library(RColorBrewer)
Loading required package: carData
Exploratory Data Analysis¶
Overview of the Training Dataset¶
In [2]:
# Load the training dataset
houseData <- read.csv("training.csv")
In [3]:
# Display the dimensions
cat("The housing dataset has", dim(houseData)[1], "records, each with", dim(houseData)[2],
"attributes. The structure is:\n\n")
# Display the structure
str(houseData)
cat("\nThe first few and last few records in the dataset are:")
# Inspect the first few records
head(houseData)
# And the last few
tail(houseData)
cat("\nBasic statistics for each attribute are:")
# Statistical summary
summary(houseData)
cat("The numbers of unique values for each attribute are:")
apply(houseData, 2, function(x) length(unique(x)))
The housing dataset has 10000 records, each with 11 attributes. The structure is:
'data.frame': 10000 obs. of 11 variables:
$ id : num 5.54e+09 2.03e+09 2.03e+09 9.48e+09 2.86e+09 ...
$ price : int 211000 265000 1440000 800000 1059500 750000 229000 271115 428000 1240000 ...
$ bedrooms : int 4 3 3 4 5 2 3 2 3 4 ...
$ bathrooms : num 1 2.5 3.5 3.5 3.25 1 1.5 1.5 2.25 3.5 ...
$ sqft_living: int 2100 1530 3870 2370 3230 1620 1200 830 2600 3820 ...
$ sqft_lot : int 9200 6000 3819 3302 3825 6120 5000 1325 15000 13224 ...
$ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 4 3 3 3 3 3 3 3 3 ...
$ grade : int 7 7 11 8 9 7 6 7 9 10 ...
$ yr_built : int 1959 1991 2002 1926 2014 1951 1979 2005 1978 1990 ...
$ zipcode : int 98168 98038 98102 98103 98117 98115 98146 98136 98038 98004 ...
The first few and last few records in the dataset are:
id
price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
5537200043
211000
4
1.00
2100
9200
0
3
7
1959
98168
2025700080
265000
3
2.50
1530
6000
0
4
7
1991
98038
2025049111
1440000
3
3.50
3870
3819
0
3
11
2002
98102
9482700075
800000
4
3.50
2370
3302
0
3
8
1926
98103
2856102105
1059500
5
3.25
3230
3825
0
3
9
2014
98117
3364900375
750000
2
1.00
1620
6120
0
3
7
1951
98115
id
price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
9995
8945100320
136500
3
1.50
1420
8580
0
3
6
1962
98023
9996
3832500230
245000
4
2.25
2140
8800
0
4
7
1963
98032
9997
7351200050
1335000
4
1.75
2300
13342
1
3
7
1934
98125
9998
2301400325
760000
3
2.00
1810
4500
0
4
7
1906
98117
9999
1201500010
833000
4
2.50
2190
12690
0
3
8
1973
98033
10000
3709600190
370000
4
2.50
2130
4750
0
3
8
2009
98058
Basic statistics for each attribute are:
id price bedrooms bathrooms
Min. :1.000e+06 Min. : 78000 Min. : 0.000 Min. :0.000
1st Qu.:2.126e+09 1st Qu.: 320000 1st Qu.: 3.000 1st Qu.:1.750
Median :3.905e+09 Median : 450000 Median : 3.000 Median :2.250
Mean :4.591e+09 Mean : 541434 Mean : 3.373 Mean :2.113
3rd Qu.:7.304e+09 3rd Qu.: 649950 3rd Qu.: 4.000 3rd Qu.:2.500
Max. :9.842e+09 Max. :6885000 Max. :10.000 Max. :8.000
sqft_living sqft_lot waterfront condition
Min. : 370 Min. : 520 Min. :0.0000 Min. :1.000
1st Qu.: 1430 1st Qu.: 5058 1st Qu.:0.0000 1st Qu.:3.000
Median : 1920 Median : 7620 Median :0.0000 Median :3.000
Mean : 2080 Mean : 14947 Mean :0.0077 Mean :3.407
3rd Qu.: 2545 3rd Qu.: 10642 3rd Qu.:0.0000 3rd Qu.:4.000
Max. :13540 Max. :982998 Max. :1.0000 Max. :5.000
grade yr_built zipcode
Min. : 3.000 Min. :1900 Min. :98001
1st Qu.: 7.000 1st Qu.:1951 1st Qu.:98033
Median : 7.000 Median :1975 Median :98065
Mean : 7.657 Mean :1971 Mean :98078
3rd Qu.: 8.000 3rd Qu.:1997 3rd Qu.:98117
Max. :13.000 Max. :2015 Max. :98199
The numbers of unique values for each attribute are:
id
9964
price
2526
bedrooms
11
bathrooms
26
sqft_living
744
sqft_lot
5712
waterfront
2
condition
5
grade
11
yr_built
116
zipcode
70
Summary of Attributes¶
The following table identifies which attributes are numerical and whether they are continuous or discrete, and which are categorical and whether they are nominal or ordinal. It includes some initial observations about the ranges and common values of the attributes.
Attribute
Type
Sub-type
Comments
id
Categorical
Nominal
Contains duplicates - are these duplicate records or multiple sales of the same house? Can be dropped for the analysis.
price
Numerical
Continuous
Target variable - values range from 78000 to 6885000. Probably has outliers - especially high ones.
bedrooms
Numerical
Discrete
Values range from 0 to 10 - the majority are 3 and 4.
bathrooms
Numerical
Discrete
Has 26 values ranging from 0 to 8 - with most houses having 2.5, includes fractions - .25, .5 and .75.
sqft_living
Numerical
Continuous
Ranges from 370 to 13540 - could have outliers
sqft_lot
Numerical
Continuous
Ranges from 520 to 982998 - could have extreme outliers
waterfront
Categorical
Nominal
Only has two values - the majority are 0; only 77 are 1
condition
Categorical
Ordinal
Has 5 values - range is 1 - 5
grade
Categorical
Ordinal
Has 13 values potential values - range 1 - 13, but 1 & 2 are not used in training data. The majority are 7 and 8.
yr_built
Numerical
Discrete
Values range from 1900 to 2015.
zipcode
Categorical
Nominal
Has 70 values between 98001 and 98199.
Check the Duplicate IDs¶
Display the records with duplicate ids to determine whether these are duplicate records
In [4]:
duplicates <- aggregate(houseData$id, list(houseData$id), NROW)
duplicates <- houseData[houseData$id %in% duplicates[duplicates$x > 1,”Group.1″],]
head(duplicates[order(duplicates$id),],20)
id
price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
8449
7200179
175000
2
1.00
840
12750
0
3
6
1925
98055
9679
7200179
150000
2
1.00
840
12750
0
3
6
1925
98055
9012
643300040
719521
4
1.75
1920
9500
0
4
7
1966
98006
9674
643300040
481000
4
1.75
1920
9500
0
4
7
1966
98006
614
1139600270
300000
3
2.75
2090
9620
0
3
8
1987
98023
5810
1139600270
310000
3
2.75
2090
9620
0
3
8
1987
98023
2683
1446403850
212000
2
1.00
790
7153
0
4
6
1944
98168
4643
1446403850
118125
2
1.00
790
7153
0
4
6
1944
98168
2032
1721801010
302100
3
1.00
1790
6120
0
3
6
1937
98146
3303
1721801010
225000
3
1.00
1790
6120
0
3
6
1937
98146
4681
1901600090
359000
5
1.75
1940
6654
0
4
7
1953
98166
8112
1901600090
390000
5
1.75
1940
6654
0
4
7
1953
98166
5124
1995200200
313950
3
1.00
1510
6083
0
4
6
1940
98115
7588
1995200200
415000
3
1.00
1510
6083
0
4
6
1940
98115
518
2023049218
445000
2
1.00
930
7740
0
1
5
1932
98148
7854
2023049218
105500
2
1.00
930
7740
0
1
5
1932
98148
2058
2143700830
370000
4
2.50
2100
19680
0
3
6
1914
98055
5516
2143700830
207000
4
2.50
2100
19680
0
3
6
1914
98055
4966
2560801222
309950
3
2.25
1990
6350
0
3
7
1967
98198
7762
2560801222
180000
3
2.25
1990
6350
0
3
7
1967
98198
The prices are different – so they may represent the same house being sold more than once.
The ID is unlikely to be useful for further analysis, so remove it.
In [5]:
# Remove the ID
houseData <- houseData[,-1]
Investigate Distribution of Each Variable¶
In [6]:
attach(houseData)
View the variable distributions using boxplots¶
In [7]:
# Generate box plots of all variables except the two categorical/nomimal ones
m1 <- melt(as.data.frame(houseData[,c(-6,-10)]))
ggplot(m1,aes(x = variable,y = value)) +
facet_wrap(~variable, scales="free") +
geom_boxplot() +
scale_y_continuous(labels=function (n) {format(n, scientific=FALSE)})
No id variables; using all as measure variables

View the variable distributions using histograms and bar charts¶
In [8]:
# Plot a histogram or bar chart of each variable
par(mfrow = c(4,3))
hist(price)
hist(sqft_living)
hist(sqft_lot)
hist(yr_built)
plot(as.factor(bedrooms),main="Bar Chart of bedrooms")
plot(as.factor(bathrooms),main="Bar Chart of bathrooms")
plot(as.factor(waterfront), main="Bar Chart of waterfront")
plot(as.factor(condition), main="Bar Chart of condition")
plot(as.factor(grade), main="Bar Chart of grade")
# plot zipcode on a separate row
par(fig=c(0,1,0,0.30),ps=10,new=TRUE)
barplot(sort(table(zipcode)),las=2,main="Bar Chart of zipcode")

These graphs show:
• Price, sqft_living and sqft_lot all have large positive skews
• Most houses have 2.5 bathrooms, other common values are 1 and 1.75.
• There are very few waterfront properties
• Most properties have condition of 3 or 4
• Most properties have grade of 7 or 8
• Few properties built between 1900 and 1940, fairly even spread from 1940 to 2000, an increase in properties built between 2000 and 2010
• The numbers of properties sold by zip code varies from about 25 to about 250
Take a closer look at some interesting features¶
Replot price, sqft_living and sqft_log using a log scale to see if these variables have a log-normal distribution
In [9]:
# Set some colours using Colorbrewer
gg.colour <- brewer.pal(12,"Paired")[12]
gg.fill <- brewer.pal(12,"Paired")[11]
# Re-plot some of the charts using log scales to counteract the skew
p1 <- ggplot(aes(x=price), data=houseData) +
geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) +
scale_x_log10(labels=comma,breaks=c(100000,300000,1000000,3000000,10000000))
p2 <- ggplot(aes(x=sqft_living), data=houseData) +
geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) +
scale_x_log10(labels=comma,breaks=c(500,1000,2000,5000,10000))
p3 <- ggplot(aes(x=sqft_lot), data=houseData) +
geom_histogram(bins=20, colour=gg.colour, fill=gg.fill) +
scale_x_log10(labels=comma,breaks=c(1000,4000,10000,40000,100000,500000,100000))
grid.arrange(p1, p2, p3, ncol=1, nrow=3)

These graphs show:
• The log of price and sqft_living are (nearly) normally distributed.
• The log of the sqft_lot is not quite normal. The majority of lot sizes are between 4000 and 10,000 sqft, with a few outliers > 100,000 sqft
There was a drop in sales of houses built during the 1930’s and an increase in sales of houses built during the 2000’s. The following graphs provide a closer look at sales for these years.
In [10]:
par(mfrow=c(2,1))
plot(as.factor(yr_built[yr_built >= 1920 & yr_built < 1950]),
main = "Sales by Year Built, 1920-1950", col="lightblue")
plot(as.factor(yr_built[yr_built >= 1990]),
main = “Sales by Year Built, Post 1990″, col=”lightblue”)

There were very few houses sold that were built between 1933 and 1936 – these were depression years so meybe fewer houses were built. Then there was an increase in sales for houses built during WWII (so this data is probably not for a European city) and a spike during the post-war years of 1947 and 1948.
There was a jump in the number of houses sold that were built in 2003 to 2008. These houses were between six and twelve years old when they were sold. Fewer houses built between 2009 and 2013 were sold. These figures indicate that owners of new homes tend to own the house for at least six years before reselling. The sales volume of houses built in 2014 and 2015 show that about 2.5% of sales were for new houses.
One possible anomaly in the data is that there are houses with no bedrooms and/or no bathrooms. Take a closer look at these records.
In [11]:
houseData[bedrooms == 0 | bathrooms == 0,]
price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
37
139950
0
0.00
844
4269
0
4
7
1913
98001
987
235000
0
0.00
1470
4800
0
3
7
1996
98065
1295
355000
0
0.00
2460
8049
0
3
8
1990
98031
3523
265000
0
0.75
384
213444
0
3
4
2003
98070
4110
380000
0
0.00
1470
979
0
3
8
2006
98133
7081
339950
0
2.50
2290
8319
0
3
8
1985
98042
8099
280000
1
0.00
600
24501
0
2
3
1950
98045
Many of these look very strange. The first record describes a house with no bedrooms and no bathrooms but is in above average condition and of average grade. Record 7081 describes a house with no bedrooms but 2.5 bathrooms, so would suit a very clean person who doesn’t sleep! Most (if not all) of these are likely to contain erroneous data.
Investigate Pairs of Variables¶
Correlation Plot Function¶
This is the DIY correlation plot provided in the tutorial
In [12]:
# DIY correlation plot
# http://stackoverflow.com/questions/31709982/how-to-plot-in-r-a-correlogram-on-top-of-a-correlation-matrix
# there’s some truth to the quote that modern programming is often stitching together pieces from SO
colorRange <- c('#69091e', '#e37f65', 'white', '#aed2e6', '#042f60')
## colorRamp() returns a function which takes as an argument a number
## on [0,1] and returns a color in the gradient in colorRange
myColorRampFunc <- colorRamp(colorRange)
panel.cor <- function(w, z, ...) {
correlation <- cor(w, z)
## because the func needs [0,1] and cor gives [-1,1], we need to shift and scale it
col <- rgb(myColorRampFunc((1 + correlation) / 2 ) / 255 )
## square it to avoid visual bias due to "area vs diameter"
radius <- sqrt(abs(correlation))
radians <- seq(0, 2*pi, len = 50) # 50 is arbitrary
x <- radius * cos(radians)
y <- radius * sin(radians)
## make them full loops
x <- c(x, tail(x,n=1))
y <- c(y, tail(y,n=1))
## trick: "don't create a new plot" thing by following the
## advice here: http://www.r-bloggers.com/multiple-y-axis-in-a-r-plot/
## This allows
par(new=TRUE)
plot(0, type='n', xlim=c(-1,1), ylim=c(-1,1), axes=FALSE, asp=1)
polygon(x, y, border=col, col=col)
}
# usage e.g.:
# pairs(mtcars, upper.panel = panel.cor)
Scatterplot Matrix¶
1) Plot the variables using a scatterplot matrix to visualise the correlations between variables.
In [13]:
pairs(houseData[sample.int(nrow(houseData),1000),], lower.panel=panel.cor)

The correlation matrix shows:
• Price is strongly correlated to sqft_living and grade, has a weaker correlation to bathrooms, bedrooms and waterfront
• Bedrooms, bathrooms, sqft_living, grade and yr_built are all correlation to each other - in particular both bathrooms and grade are highly correlated to sqft_living
• There is quite a strong negative correlation between condition and yr_built
• There is very little correlation between the land size (sqft_lot) and the price
• There is a correlation between zipcode and yr_built
There is also a weak correlation between price and whether or not the property overlooks the waterfront.
2) How does looking at just these waterfront properties change the correlations?
In [14]:
pairs(houseData[houseData$waterfront == 1,-6], lower.panel=panel.cor)

The correlations are similar but a few stronger ones:
• yr_built and price
• bedrooms with bathrooms and grade
• grade and yr_built
The negative correlation between condition and yr_built is weaker
There is a new correlation between zipcode and price - so the zipcode appears to have an effect on the price of waterfront properties, but not on other properties.
3) Take a closer look at the correlations between price, bedrooms, bathrooms, sqft_living and grade
In [15]:
scatterplotMatrix(~price+bedrooms+bathrooms+sqft_living+grade,data=houseData)

The scatterplot matrix shows many of these relationships are non-linear - in particular the relationships between price and the other variables.
The relationship between bedrooms and the other variables looks to be linear for houses with between one and four bedrooms but changes for houses with more than four bedrooms.
Correlation Coefficients¶
Show the correlation coefficients for all pairs of variables
In [16]:
options(digits=4)
cor(houseData)
price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
price
1.00000
0.323447
0.52505
0.69585
0.09395
0.294854
0.03340
0.6688
0.06105
-0.04954
bedrooms
0.32345
1.000000
0.52173
0.59094
0.03469
-0.003382
0.02870
0.3671
0.14510
-0.14563
bathrooms
0.52505
0.521726
1.00000
0.75076
0.08911
0.075340
-0.12710
0.6701
0.50991
-0.19424
sqft_living
0.69585
0.590939
0.75076
1.00000
0.18653
0.119910
-0.06342
0.7651
0.32164
-0.19173
sqft_lot
0.09395
0.034691
0.08911
0.18653
1.00000
0.026782
-0.01656
0.1172
0.05836
-0.13530
waterfront
0.29485
-0.003382
0.07534
0.11991
0.02678
1.000000
0.01523
0.0854
-0.01816
0.02621
condition
0.03340
0.028699
-0.12710
-0.06342
-0.01656
0.015235
1.00000
-0.1503
-0.36385
0.01226
grade
0.66883
0.367060
0.67009
0.76515
0.11717
0.085398
-0.15031
1.0000
0.45182
-0.17989
yr_built
0.06105
0.145100
0.50991
0.32164
0.05836
-0.018161
-0.36385
0.4518
1.00000
-0.34943
zipcode
-0.04954
-0.145632
-0.19424
-0.19173
-0.13530
0.026214
0.01226
-0.1799
-0.34943
1.00000
The top positive correlations are between:
• sqft_living and grade
• sqft_living and bathrooms
• sqft_living and price
• grade and bathrooms
• grade and price
The only significant negative correlation is between yr_built and condition - so older houses are generally in better condition than newer houses
Re-format Data for Further Analysis¶
Factorise the categorical variables for the rest of the analysis
In [17]:
houseData$waterfront <- as.factor(houseData$waterfront)
houseData$condition <- as.factor(houseData$condition)
houseData$grade <- as.factor(houseData$grade)
houseData$zipcode <- as.factor(houseData$zipcode)
str(houseData)
'data.frame': 10000 obs. of 10 variables:
$ price : int 211000 265000 1440000 800000 1059500 750000 229000 271115 428000 1240000 ...
$ bedrooms : int 4 3 3 4 5 2 3 2 3 4 ...
$ bathrooms : num 1 2.5 3.5 3.5 3.25 1 1.5 1.5 2.25 3.5 ...
$ sqft_living: int 2100 1530 3870 2370 3230 1620 1200 830 2600 3820 ...
$ sqft_lot : int 9200 6000 3819 3302 3825 6120 5000 1325 15000 13224 ...
$ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ condition : Factor w/ 5 levels "1","2","3","4",..: 3 4 3 3 3 3 3 3 3 3 ...
$ grade : Factor w/ 11 levels "3","4","5","6",..: 5 5 9 6 7 5 4 5 7 8 ...
$ yr_built : int 1959 1991 2002 1926 2014 1951 1979 2005 1978 1990 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 65 24 42 43 52 50 61 59 24 4 ...
melt the data for visualisation and remove some outliers
In [18]:
m1 <- melt(as.data.frame(houseData))
m1 <- m1[!(m1$variable == "sqft_living" & m1$value > 10000),]
m1 <- m1[!(m1$variable == "sqft_lot" & m1$value > 100000),]
Using waterfront, condition, grade, zipcode as id variables
Show how Grade is Related to other Variables¶
The grade is correlated with several of the numeric variables. These box plots show how each numeric variable is distributed by grade.
In [19]:
ggplot(m1,aes(x = grade,y = value)) +
facet_wrap(~variable, scales=”free”) +
geom_boxplot() +
scale_y_continuous(labels=function (n) {format(n, scientific=FALSE)})

High grade houses:
• are higher priced than lower grade houses
• have more bedrooms and bathrooms
• are larger and on larger properties – but the lowest grade houses are also on larger properties
• tend to be newer houses
Show how Condition is Related to other Variables¶
The condition of the house is negatively correlated to the age of the house. These boxplots investigate this correlation and check for any relationships to other variables.
In [20]:
ggplot(m1,aes(x = condition,y = value)) +
facet_wrap(~variable, scales=”free”) +
geom_boxplot() +
scale_y_continuous(labels=function (n) {format(n, scientific=FALSE)})

Houses in the best condition are similar to houses in average condition, but tend to be older and so are probably older houses that have been renovated to a very high standard.
Houses in poor condition, compared to those in average condition:
• Sell for slightly less
• Have fewer bedrooms and bathrooms
• Are smaller but on similar sized properties
• Are older houses
Show how the Zipcode is Related to other Variables¶
Show a boxplot of each numeric variable by zipcode. Zipcodes are coloured according to whether at least one property with the zipcode is a waterfront property (red – a waterfront zipcode, blue – not a waterfront zipcode). Outliers are not shown to reduce graph clutter.
Code to separate the legends from the plots was copied from: http://stackoverflow.com/questions/12539348/ggplot-separate-legend-and-plot
In [21]:
wf.zipcodes <- houseData$zipcode %in% unique(as.character(houseData$zipcode[waterfront==1]))
p1 <- ggplot(houseData,aes(x = reorder(zipcode,price,median),y = price)) +
geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) +
scale_y_continuous(limits=c(0,4000000), labels=comma) +
scale_fill_discrete(name="Waterfront Zipcode", labels=c("No","Yes")) +
theme(legend.position = "bottom", legend.box = "horizontal") +
xlab("Zipcode (ordered by median price)") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))
tmp <- ggplot_gtable(ggplot_build(p1))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
p2 <- ggplot(houseData,aes(x = reorder(zipcode,yr_built,median),y = yr_built)) +
geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) +
scale_fill_discrete(guide="none") +
xlab("Zipcode (ordered by median yr_built)") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))
grid.arrange(p1+theme(legend.position = 'none'), p2, legend, ncol=1, nrow=3, heights=c(3/7,3/7,1/7))
p3 <- ggplot(houseData,aes(x = reorder(zipcode,bedrooms,mean),y = bedrooms)) +
geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) +
scale_y_continuous(limits=c(0,7),breaks=seq(0,7)) +
scale_fill_discrete(guide="none") +
xlab("Zipcode (ordered by mean bedrooms)") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))
p4 <- ggplot(houseData,aes(x = reorder(zipcode,bathrooms,mean),y = bathrooms)) +
geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) +
scale_y_continuous(limits=c(0,5),breaks=seq(0,5)) +
scale_fill_discrete(guide="none") +
xlab("Zipcode (ordered by mean bathrooms)") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))
grid.arrange(p3, p4, legend, ncol=1, nrow=3, heights=c(3/7,3/7,1/7))
p5 <- ggplot(houseData,aes(x = reorder(zipcode,sqft_living,median),y = sqft_living)) +
geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) +
scale_y_continuous(limits=c(0,7000),breaks=seq(0,7000,1000)) +
scale_fill_discrete(guide="none") +
xlab("Zipcode (ordered by median sqft_living)") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))
p6 <- ggplot(houseData,aes(x = reorder(zipcode,sqft_lot,median),y = sqft_lot)) +
geom_boxplot(outlier.shape=NA, mapping=aes(fill=wf.zipcodes)) +
scale_y_continuous(limits=c(0,100000),breaks=seq(0,100000,10000), labels=comma) +
scale_fill_discrete(guide="none") +
xlab("Zipcode (ordered by median sqft_lot)") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size=6))
grid.arrange(p5, p6, legend, ncol=1, nrow=3, heights=c(3/7,3/7,1/7))
Warning message:
“Removed 5 rows containing non-finite values (stat_boxplot).”Warning message:
“Removed 5 rows containing non-finite values (stat_boxplot).”Warning message:
“Removed 11 rows containing non-finite values (stat_boxplot).”Warning message:
“Removed 14 rows containing non-finite values (stat_boxplot).”

Warning message:
“Removed 8 rows containing non-finite values (stat_boxplot).”Warning message:
“Removed 214 rows containing non-finite values (stat_boxplot).”


The above graphs show zipcodes are related to the other attributes in the following ways:
• There are a few high-priced zipcodes. The highest price zipcode contains waterfront properties.
• Quite a few zipcodes have predominately newer houses, a few have predominately "middle-aged" houses and a few have a range of older and newer houses.
• There is no real difference in number of bedrooms by zipcode
• Most zipcodes have mainly houses with more than one bathroom, but a few zipcodes have quite a few one bathroom houses (the lower hinge of the IQR is one).
• A few zipcodes have large houses and some very large properties
• Several of the zipcodes with larger houses have waterfront properties
• One zipcode (98039) stands out as having the highest prices, the most bedrooms and bathrooms and the largest houses. It has waterfront properties.
Waterfront Properties¶
The correlation matrices showed there are differences between how waterfront properties and non-waterfront properties are related to the other variables. These differences can be seen in the following charts.
Firstly, the differences of the numeric variable distributions between waterfront and non-waterfront properties are shown is the following density graphs.
In [22]:
p1 <- ggplot(aes(x=price),data = houseData) +
geom_density(aes(fill = as.factor(waterfront))) +
ggtitle('Price and Waterfront') +
scale_x_continuous(labels=comma) +
scale_fill_discrete(name="Waterfront", labels=c("No","Yes"))
p2 <- ggplot(aes(x=sqft_living),data = houseData) +
geom_density(aes(fill = as.factor(waterfront))) +
ggtitle('House Size and Waterfront') +
scale_x_continuous(limits=c(0,10000)) +
scale_y_continuous(labels=comma) +
scale_fill_discrete(guide="none")
p3 <- ggplot(aes(x=sqft_lot),data = houseData) +
geom_density(aes(fill = as.factor(waterfront))) +
ggtitle('Lot Size and Waterfront') +
scale_x_continuous(limits=c(0,100000)) +
scale_fill_discrete(guide="none")
p4 <- ggplot(aes(x=yr_built),data = houseData) +
geom_density(aes(fill = as.factor(waterfront))) +
ggtitle('Year Built and Waterfront') +
scale_fill_discrete(guide="none")
tmp <- ggplot_gtable(ggplot_build(p1))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
grid.arrange(p1+theme(legend.position = 'none'), p2, p3, p4, legend, ncol=2, nrow=3, heights=c(3/7,3/7,1/7))
Warning message:
“Removed 1 rows containing non-finite values (stat_density).”Warning message:
“Removed 214 rows containing non-finite values (stat_density).”

Properties overlooking the watefront have some quite different characteristics from other properties:
• they are more expensive than non-waterfront ones and have a wider price range
• they have a much wider range of both house size and plot size
• they are slightly older than non-waterfront ones
The following graphs show how waterfront properties differ from non-waterfront ones in terms of the number of bathrooms, condition and grade; and which zipcodes contain waterfront properties.
In [23]:
p1 <- ggplot(aes(x=bathrooms, fill=waterfront),data = houseData) +
facet_wrap(~waterfront, scales="free") +
geom_bar() +
ggtitle('Bathrooms and Waterfront') +
theme(legend.position = "bottom", legend.box = "horizontal") +
scale_x_continuous(limits=c(0,6)) +
scale_fill_discrete(name="Waterfront", labels=c("No","Yes"))
p2 <- ggplot(aes(x=condition, fill=waterfront),data = houseData) +
facet_wrap(~waterfront, scales="free") +
geom_bar() +
ggtitle('Condition and Waterfront') +
theme(legend.position = "bottom", legend.box = "horizontal") +
scale_fill_discrete(name="Waterfront", labels=c("No","Yes"))
p3 <- ggplot(aes(x=grade, fill=waterfront),data = houseData) +
facet_wrap(~waterfront, scales="free") +
geom_bar() +
ggtitle('Grade and Waterfront') +
theme(legend.position = "bottom", legend.box = "horizontal") +
scale_fill_discrete(name="Waterfront", labels=c("No","Yes"))
p4 <- ggplot(aes(x=zipcode, fill=waterfront),data = houseData[houseData$waterfront == 1,]) +
geom_bar(fill="#00BFC4") +
ggtitle('Zipcodes with Waterfront Properties') +
scale_y_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12)) +
theme(axis.text.x = element_text(angle = 90, vjust=0.5))
grid.arrange(p1, p2, ncol=1, nrow=2)
grid.arrange(p3, p4, ncol=1, nrow=2)
#tmp <- ggplot_gtable(ggplot_build(p5))
#leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
#legend <- tmp$grobs[[leg]]
#grid.arrange(legend, p5+theme(legend.position = 'none'), p6, ncol=1, nrow=3, heights=c(1/7,3/7,3/7))
Warning message:
“Removed 2 rows containing non-finite values (stat_count).”Warning message:
“Removed 3 rows containing missing values (geom_bar).”


Additional characteristics of waterfront properties compared to non-waterfront ones are:
• These houses have more bathrooms
• There are a higher proportion of houses in very good condition - but also in poor condition
• There are a higher proportion of high grade properties
The last plot shows that only 23 of the zipcodes had sales of waterfront properties and only six of these had more than three sales of waterfront properties.
How have House Features Changed over Time?¶
Look at the relationship between year built and bedrooms, bathrooms, grade and condition. Plot graphs using stacked barcharts showing proprotionality to see how these have changed over time. Years are grouped into decades to reduce the chart clutter.
In [24]:
p1 <- ggplot(aes(x=(trunc((yr_built)/10))*10, fill = as.factor(bedrooms)),data = houseData[bedrooms >= 1 & bedrooms < 8,]) +
geom_bar(position="fill") +
ggtitle('Bedrooms and Year Built') +
scale_fill_brewer(palette="Set3",name="bedrooms") +
xlab("Decade Built")
p2 <- ggplot(aes(x=(trunc((yr_built)/10))*10, fill = as.factor(bathrooms)),data = houseData[bathrooms >= 1 & bathrooms <= 3.5,]) +
geom_bar(position="fill") +
ggtitle('Bathrooms and Year Built') +
scale_fill_brewer(palette="Set3",name="bathrooms") +
xlab("Decade Built")
p3 <- ggplot(aes(x=(trunc((yr_built)/10))*10, fill = grade),data = houseData[grade >= 4 & grade < 13,]) +
geom_bar(position="fill") +
ggtitle('Grade and Year Built') +
scale_fill_brewer(palette="Set3",name="grade") +
xlab("Decade Built")
p4 <- ggplot(aes(x=(trunc((yr_built)/10))*10, fill = condition),data = houseData) +
geom_bar(position="fill") +
ggtitle('Condition and Year Built') +
scale_fill_brewer(palette="Set3",name="condition") +
xlab("Decade Built")
grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)

These charts show:
• Two bedroom houses were common before the 1950's but are rare more recently, with three and four bedrooms becoming more common
• Most houses built before 1960 had only one bathroom, two 1/2 bathrooms is usual for recent houses
• Older houses tend to be of lower grade
• All the houses in poor condition tend to be older - but the houses in really good condition are also older. Most newer houses are in average condition
Association between Bedrooms, House Size and Price¶
The earlier scatterplot matrices show the relationship between bedrooms and house size, and bedrooms and price is different for houses with more than four bedrooms than for those with four or fewer bedrooms. Does plotting all three of these variables together show more about these relationships?
In [25]:
ggplot(aes(x=price/1000,y=sqft_living,colour=as.factor(bedrooms)),
data = houseData[houseData$price < 2000000 & houseData$sqft_living < 5000,]) +
geom_point() +
scale_colour_brewer(name="bedrooms",palette="Set3")

This plot shows that the number of bedrooms, the house size and the price increase together for houses between 1 and 4 bedrooms. Houses with more than 4 bedrooms look to be both have similar prices and be of similar size to 4 bedroom houses.
Summary¶
Overview¶
The provided house sales data has 10000 records with 11 attributes for each record. The provided descriptions for each attribute and some additional notes are:
1. Id: Unique ID for each home sold
2. Price: Price of each home sold
3. Bedroom: #bedrooms
• Values range from 0-10
4. Bathrooms: #bathrooms, where .5 accounts for a bathroom with a toilet but no shower
• Values range from 0-8 and include .25 and .75, as well as .5
5. Sqft_living: Square footage of the apartments interior living space
6. Sqft_lot: Square footage of the land space
7. Waterfront: A binary variable indicating whether the property was overlooking the waterfront or not. 1’s represent a waterfront property, 0’s represent a non-waterfront property
8. Condition: An index from 1 to 5 on the condition of the apartment, 1 - lowest, 5 - highest
9. Grade: An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high-quality level of construction and design.
• Grades 1 and 2 are not present in the training data
10. Yr_built: The year the house was initially built.
• Values range from 1900 to 2015
11. Zipcode: What postcode area the house is in
Analysis of Each Variable¶
Some houses appear to have been sold more than once during the year.
Price, sqft_living and sqft_lot have very skewed distributions. Plotting the log values of these variables shows a more normal distribution.
There were very few houses sold that were built between 1933 and 1936 - these were depression years so maybe fewer houses were built. Then there was an increase in sales for houses built during WWII and a spike during the post-war years of 1947 and 1948.
A spike in the sales of houses built between 2003 and 2008 was observed, probably indicating theat people who bought new homes then are now selling them.
Analysis of Pairs of Variables¶
As might be expected, the sale price of a house appears to be determined by a range of factors including the size of the house, the number of bedrooms and bathrooms, the grade and the year the house was built. These variables are all inter-related with larger houses have more bedrooms and bathrooms; and the size, number of bedrooms and bathrooms all contributing to the grade. Two specific observations are:
• The size and grade of the house are the most highly correlated to the price are sqft_living and grade.
• The correlation between bedrooms and the other variables decreases for houses with more than four bedrooms.
What is a little surprising is that the condition of the house and the size of the lot do not appear to have a big effect on the price. A closer look at the condition showed that being in poor condition does lead to a lower price, but there are not many houses in poor condition. Being in better than average condition does not lead to a better price.
The condition and the age of the house are negatively correlated, implying that older houses are generally in better condition than newer houses. However the houses in the worst condition are also older houses. The majority of newer houses are in average condition.
There are several differences between properties that overlook the waterfront and other properties. Waterfront properties are usually higher-priced, larger, in better condition and higher grade houses, but as only 77 waterfront properties were sold some of these differences may not generalise to other sets of data.
An unexpected observation is that there appears to be a slight correlation between the number of bathrooms, whether or not the property overlooks the waterfront and the zipcode.
Build the Linear Regression Model¶
Prepare the Data Frame¶
Re-load the data frame and factorise the categorical variables
In [26]:
houseData <- read.csv("training.csv")
houseData <- houseData[,-1]
houseData$waterfront <- as.factor(houseData$waterfront)
houseData$condition <- as.factor(houseData$condition)
houseData$grade <- as.factor(houseData$grade)
houseData$zipcode <- as.factor(houseData$zipcode)
str(houseData)
'data.frame': 10000 obs. of 10 variables:
$ price : int 211000 265000 1440000 800000 1059500 750000 229000 271115 428000 1240000 ...
$ bedrooms : int 4 3 3 4 5 2 3 2 3 4 ...
$ bathrooms : num 1 2.5 3.5 3.5 3.25 1 1.5 1.5 2.25 3.5 ...
$ sqft_living: int 2100 1530 3870 2370 3230 1620 1200 830 2600 3820 ...
$ sqft_lot : int 9200 6000 3819 3302 3825 6120 5000 1325 15000 13224 ...
$ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ condition : Factor w/ 5 levels "1","2","3","4",..: 3 4 3 3 3 3 3 3 3 3 ...
$ grade : Factor w/ 11 levels "3","4","5","6",..: 5 5 9 6 7 5 4 5 7 8 ...
$ yr_built : int 1959 1991 2002 1926 2014 1951 1979 2005 1978 1990 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 65 24 42 43 52 50 61 59 24 4 ...
Define some Functions¶
These functions are used during the model building to evaluate the model accuracy.
Function to Calculate Model Accuracy Statistics¶
Name: Model.Accuracy
Input parameters:
• predicted - a vector of predictions
• target - a vector containing the target values for the predictions
• df - the degrees of freedom
• p - the number of parameters excluding the coefficient
Return Value:
A list containing:
• rsquared - the R-Squared value calculated from the predicted and target values
• rse - the residual standard error
• f.stat - the F-statistic
Description:
Calculate the TSS and RSS as:
• TSS: $\sum_{i=1}^n (y_i - \bar y)^2$
• RSS: $\sum_{i=1}^n (\hat y_i - y_i)^2$
Calculate the statistics according to the following formulae:
• R-Squared value: $R^2 = 1 - \frac{RSS}{TSS}$
• Residual standard error - $\sqrt{\frac{1}{df}RSS}$
• F-statistics - $\frac{(TSS - RSS)/p}{RSS / df}$
In [27]:
Model.Accuracy <- function(predicted, target, df, p) {
rss <- 0
tss <- 0
target.mean <- mean(target)
for (i in 1:length(predicted)) {
rss <- rss + (predicted[i]-target[i])^2
tss <- tss + (target[i]-target.mean)^2
}
rsquared <- 1 - rss/tss
rse <- sqrt(rss/df)
f.stat <- ((tss-rss)/p) / (rss/df)
return(list(rsquared=rsquared,rse=rse,f.stat=f.stat))
}
Function to Calculate RMSE¶
Name: RMSE
Input parameters:
• predicted - a vector of predictions
• target - a vector containing the target values for the predictions
Return Value:
The RMSE value calculated from the predicted and target values
Description:
Calculate the RMSE value: $RMSE = \sqrt {\sum_{i=1}^n (\hat y_i - y_i)^2 / N}$
In [28]:
RMSE <- function(predicted, target) {
se <- 0
for (i in 1:length(predicted)) {
se <- se + (predicted[i]-target[i])^2
}
return (sqrt(se/length(predicted)))
}
First Model¶
Try fitting all variables to see what appears to be important
In [29]:
fit1 <- lm(price ~ ., data=houseData)
summary(fit1)
Call:
lm(formula = price ~ ., data = houseData)
Residuals:
Min 1Q Median 3Q Max
-1247108 -65534 265 55635 1899005
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.43e+06 2.00e+05 7.14 9.7e-13 ***
bedrooms -1.14e+04 2.29e+03 -5.00 5.8e-07 ***
bathrooms 2.52e+04 3.63e+03 6.93 4.5e-12 ***
sqft_living 1.47e+02 3.68e+00 39.95 < 2e-16 ***
sqft_lot 2.18e-01 4.45e-02 4.89 1.0e-06 ***
waterfront1 8.85e+05 1.83e+04 48.26 < 2e-16 ***
condition2 6.57e+04 3.96e+04 1.66 0.09686 .
condition3 7.83e+04 3.63e+04 2.16 0.03093 *
condition4 1.02e+05 3.63e+04 2.82 0.00484 **
condition5 1.36e+05 3.66e+04 3.73 0.00019 ***
grade4 -8.21e+04 1.17e+05 -0.70 0.48363
grade5 -1.22e+05 1.10e+05 -1.10 0.26962
grade6 -1.24e+05 1.10e+05 -1.13 0.25869
grade7 -1.14e+05 1.10e+05 -1.04 0.29907
grade8 -7.40e+04 1.10e+05 -0.67 0.50073
grade9 2.52e+04 1.10e+05 0.23 0.81882
grade10 1.72e+05 1.10e+05 1.56 0.11876
grade11 4.05e+05 1.11e+05 3.65 0.00026 ***
grade12 8.20e+05 1.13e+05 7.24 4.7e-13 ***
grade13 2.14e+06 1.37e+05 15.65 < 2e-16 ***
yr_built -7.28e+02 8.42e+01 -8.64 < 2e-16 ***
zipcode98002 7.90e+03 2.02e+04 0.39 0.69608
zipcode98003 -1.16e+03 1.84e+04 -0.06 0.94982
zipcode98004 7.37e+05 1.77e+04 41.71 < 2e-16 ***
zipcode98005 2.78e+05 2.04e+04 13.65 < 2e-16 ***
zipcode98006 2.66e+05 1.60e+04 16.61 < 2e-16 ***
zipcode98007 2.42e+05 2.28e+04 10.65 < 2e-16 ***
zipcode98008 2.64e+05 1.82e+04 14.50 < 2e-16 ***
zipcode98010 5.40e+04 2.75e+04 1.96 0.04946 *
zipcode98011 1.41e+05 2.05e+04 6.90 5.7e-12 ***
zipcode98014 9.20e+04 2.34e+04 3.93 8.4e-05 ***
zipcode98019 9.84e+04 2.16e+04 4.56 5.3e-06 ***
zipcode98022 1.96e+04 1.99e+04 0.99 0.32381
zipcode98023 -3.11e+04 1.55e+04 -2.00 0.04582 *
zipcode98024 1.45e+05 2.61e+04 5.56 2.7e-08 ***
zipcode98027 1.61e+05 1.62e+04 9.97 < 2e-16 ***
zipcode98028 1.19e+05 1.90e+04 6.23 4.7e-10 ***
zipcode98029 2.08e+05 1.73e+04 12.08 < 2e-16 ***
zipcode98030 1.27e+04 1.82e+04 0.69 0.48711
zipcode98031 2.17e+04 1.89e+04 1.15 0.25183
zipcode98032 -7.44e+03 2.37e+04 -0.31 0.75365
zipcode98033 3.78e+05 1.66e+04 22.84 < 2e-16 ***
zipcode98034 2.08e+05 1.53e+04 13.62 < 2e-16 ***
zipcode98038 3.81e+04 1.55e+04 2.46 0.01385 *
zipcode98039 1.29e+06 3.46e+04 37.26 < 2e-16 ***
zipcode98040 5.23e+05 1.86e+04 28.15 < 2e-16 ***
zipcode98042 9.66e+03 1.52e+04 0.64 0.52537
zipcode98045 1.08e+05 1.85e+04 5.84 5.5e-09 ***
zipcode98052 2.44e+05 1.56e+04 15.66 < 2e-16 ***
zipcode98053 2.13e+05 1.62e+04 13.14 < 2e-16 ***
zipcode98055 3.99e+04 1.82e+04 2.19 0.02880 *
zipcode98056 8.91e+04 1.65e+04 5.39 7.2e-08 ***
zipcode98058 2.05e+04 1.58e+04 1.30 0.19482
zipcode98059 8.15e+04 1.60e+04 5.10 3.5e-07 ***
zipcode98065 1.14e+05 1.74e+04 6.55 6.1e-11 ***
zipcode98070 -2.98e+04 2.47e+04 -1.21 0.22745
zipcode98072 1.58e+05 1.91e+04 8.31 < 2e-16 ***
zipcode98074 1.66e+05 1.62e+04 10.26 < 2e-16 ***
zipcode98075 1.65e+05 1.70e+04 9.70 < 2e-16 ***
zipcode98077 1.11e+05 2.03e+04 5.46 4.9e-08 ***
zipcode98092 -1.03e+04 1.70e+04 -0.60 0.54601
zipcode98102 4.06e+05 2.60e+04 15.62 < 2e-16 ***
zipcode98103 3.19e+05 1.55e+04 20.59 < 2e-16 ***
zipcode98105 4.75e+05 1.89e+04 25.17 < 2e-16 ***
zipcode98106 1.03e+05 1.76e+04 5.85 5.2e-09 ***
zipcode98107 3.31e+05 1.80e+04 18.39 < 2e-16 ***
zipcode98108 9.75e+04 2.03e+04 4.81 1.5e-06 ***
zipcode98109 4.39e+05 2.58e+04 17.03 < 2e-16 ***
zipcode98112 5.86e+05 1.89e+04 31.05 < 2e-16 ***
zipcode98115 3.31e+05 1.55e+04 21.29 < 2e-16 ***
zipcode98116 2.92e+05 1.77e+04 16.49 < 2e-16 ***
zipcode98117 2.97e+05 1.56e+04 19.09 < 2e-16 ***
zipcode98118 1.47e+05 1.58e+04 9.30 < 2e-16 ***
zipcode98119 5.13e+05 2.18e+04 23.53 < 2e-16 ***
zipcode98122 3.39e+05 1.84e+04 18.46 < 2e-16 ***
zipcode98125 1.98e+05 1.61e+04 12.28 < 2e-16 ***
zipcode98126 1.96e+05 1.71e+04 11.51 < 2e-16 ***
zipcode98133 1.42e+05 1.56e+04 9.09 < 2e-16 ***
zipcode98136 2.63e+05 1.91e+04 13.77 < 2e-16 ***
zipcode98144 2.84e+05 1.75e+04 16.18 < 2e-16 ***
zipcode98146 1.04e+05 1.80e+04 5.77 8.0e-09 ***
zipcode98148 7.37e+04 3.21e+04 2.30 0.02162 *
zipcode98155 1.43e+05 1.62e+04 8.78 < 2e-16 ***
zipcode98166 8.00e+04 1.91e+04 4.19 2.8e-05 ***
zipcode98168 3.63e+04 1.84e+04 1.97 0.04896 *
zipcode98177 2.60e+05 1.86e+04 13.95 < 2e-16 ***
zipcode98178 4.04e+04 1.88e+04 2.14 0.03205 *
zipcode98188 2.88e+04 2.16e+04 1.34 0.18112
zipcode98198 9.73e+03 1.80e+04 0.54 0.58896
zipcode98199 3.99e+05 1.80e+04 22.17 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 154000 on 9910 degrees of freedom
Multiple R-squared: 0.827, Adjusted R-squared: 0.826
F-statistic: 533 on 89 and 9910 DF, p-value: <2e-16
The adjusted R-squared ($R^2$) value indicates this model explains 82.6% of the variation in house prices.
The F-statistic 533 has a p-value < 2.2e-16 - so reject the null hypothesis (the model explains nothing) - the model is useful
The p-values for the coefficients show all the non-factor variables are significant at the 0.05 level. The factors all have levels that are significant, but some levels of grade and zipcode are not.
Check the residuals using the plot function
In [30]:
par(mfrow=c(2,2))
plot(fit1)

The model plots show:
• Residual vs Fitted - shows the residuals reasonably evenly distributed around zero, but they funnel out as the fitted value increases. This model violates the assumption of homoscedasticity - the error terms change along the regression line
• Normal Q-Q - the residuals deviate significantly from the dashed line, indicating the residuals are not normally distributed
• Scale-Location - The chart shows the model violates the assumption of equal variance
• Residuals vs Leverage - The chart shows there are some possibly influential outliers, however they appear to cancel each other out
Use step to remove unimportant variables
In [31]:
step(fit1)
Start: AIC=238963
price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront +
condition + grade + yr_built + zipcode
Df Sum of Sq RSS AIC
– sqft_lot 1 5.66e+11 2.35e+14 238985
– bedrooms 1 5.92e+11 2.35e+14 238986
– bathrooms 1 1.14e+12 2.36e+14 239009
– yr_built 1 1.77e+12 2.36e+14 239036
– condition 4 2.52e+12 2.37e+14 239062
– sqft_living 1 3.78e+13 2.72e+14 240454
– waterfront 1 5.51e+13 2.90e+14 241072
– grade 10 6.40e+13 2.99e+14 241355
– zipcode 69 2.05e+14 4.40e+14 245116
Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
waterfront + condition + grade + yr_built + zipcode, data = houseData)
Coefficients:
(Intercept) bedrooms bathrooms sqft_living sqft_lot
1.43e+06 -1.14e+04 2.52e+04 1.47e+02 2.18e-01
waterfront1 condition2 condition3 condition4 condition5
8.85e+05 6.57e+04 7.83e+04 1.02e+05 1.36e+05
grade4 grade5 grade6 grade7 grade8
-8.21e+04 -1.22e+05 -1.24e+05 -1.14e+05 -7.40e+04
grade9 grade10 grade11 grade12 grade13
2.52e+04 1.72e+05 4.05e+05 8.20e+05 2.14e+06
yr_built zipcode98002 zipcode98003 zipcode98004 zipcode98005
-7.28e+02 7.90e+03 -1.16e+03 7.37e+05 2.78e+05
zipcode98006 zipcode98007 zipcode98008 zipcode98010 zipcode98011
2.66e+05 2.42e+05 2.64e+05 5.40e+04 1.41e+05
zipcode98014 zipcode98019 zipcode98022 zipcode98023 zipcode98024
9.20e+04 9.84e+04 1.96e+04 -3.11e+04 1.45e+05
zipcode98027 zipcode98028 zipcode98029 zipcode98030 zipcode98031
1.61e+05 1.19e+05 2.08e+05 1.27e+04 2.17e+04
zipcode98032 zipcode98033 zipcode98034 zipcode98038 zipcode98039
-7.44e+03 3.78e+05 2.08e+05 3.81e+04 1.29e+06
zipcode98040 zipcode98042 zipcode98045 zipcode98052 zipcode98053
5.23e+05 9.66e+03 1.08e+05 2.44e+05 2.13e+05
zipcode98055 zipcode98056 zipcode98058 zipcode98059 zipcode98065
3.99e+04 8.91e+04 2.05e+04 8.15e+04 1.14e+05
zipcode98070 zipcode98072 zipcode98074 zipcode98075 zipcode98077
-2.98e+04 1.58e+05 1.66e+05 1.65e+05 1.11e+05
zipcode98092 zipcode98102 zipcode98103 zipcode98105 zipcode98106
-1.03e+04 4.06e+05 3.19e+05 4.75e+05 1.03e+05
zipcode98107 zipcode98108 zipcode98109 zipcode98112 zipcode98115
3.31e+05 9.75e+04 4.39e+05 5.86e+05 3.31e+05
zipcode98116 zipcode98117 zipcode98118 zipcode98119 zipcode98122
2.92e+05 2.97e+05 1.47e+05 5.13e+05 3.39e+05
zipcode98125 zipcode98126 zipcode98133 zipcode98136 zipcode98144
1.98e+05 1.96e+05 1.42e+05 2.63e+05 2.84e+05
zipcode98146 zipcode98148 zipcode98155 zipcode98166 zipcode98168
1.04e+05 7.37e+04 1.43e+05 8.00e+04 3.63e+04
zipcode98177 zipcode98178 zipcode98188 zipcode98198 zipcode98199
2.60e+05 4.04e+04 2.88e+04 9.73e+03 3.99e+05
Running step in backward direction has not removed any variables, so all are significant
Account for the Heteroscedasticity¶
The second model corrects the heteroscedasticity seen in the first model by using a log transformation of the response variable. It also uses log transformations of the sqft_living and sqft_lot predictor variables as the distributions of these variables were skewed.
In [32]:
fit2 <- lm(log(price) ~ . + log(sqft_living) + log(sqft_lot), data=houseData)
summary(fit2)
Call:
lm(formula = log(price) ~ . + log(sqft_living) + log(sqft_lot),
data = houseData)
Residuals:
Min 1Q Median 3Q Max
-1.0465 -0.1022 0.0025 0.1039 1.1231
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.03e+01 2.96e-01 34.88 < 2e-16 ***
bedrooms -1.75e-02 2.86e-03 -6.14 8.7e-10 ***
bathrooms 3.76e-02 4.51e-03 8.35 < 2e-16 ***
sqft_living 5.55e-05 9.32e-06 5.95 2.7e-09 ***
sqft_lot -2.90e-08 7.27e-08 -0.40 0.69020
waterfront1 6.56e-01 2.25e-02 29.18 < 2e-16 ***
condition2 1.09e-01 4.84e-02 2.25 0.02464 *
condition3 2.69e-01 4.44e-02 6.06 1.4e-09 ***
condition4 2.99e-01 4.44e-02 6.74 1.7e-11 ***
condition5 3.61e-01 4.48e-02 8.05 9.1e-16 ***
grade4 -4.70e-01 1.43e-01 -3.28 0.00105 **
grade5 -5.05e-01 1.35e-01 -3.73 0.00019 ***
grade6 -4.30e-01 1.35e-01 -3.20 0.00140 **
grade7 -3.35e-01 1.35e-01 -2.48 0.01301 *
grade8 -2.14e-01 1.35e-01 -1.58 0.11317
grade9 -6.80e-02 1.35e-01 -0.50 0.61470
grade10 3.00e-02 1.35e-01 0.22 0.82452
grade11 1.61e-01 1.36e-01 1.19 0.23583
grade12 2.94e-01 1.39e-01 2.12 0.03370 *
grade13 4.35e-01 1.68e-01 2.60 0.00946 **
yr_built -5.69e-04 1.10e-04 -5.18 2.3e-07 ***
zipcode98002 -1.56e-03 2.48e-02 -0.06 0.94997
zipcode98003 3.20e-02 2.25e-02 1.42 0.15562
zipcode98004 1.15e+00 2.16e-02 52.96 < 2e-16 ***
zipcode98005 7.15e-01 2.49e-02 28.67 < 2e-16 ***
zipcode98006 6.86e-01 1.96e-02 34.93 < 2e-16 ***
zipcode98007 6.75e-01 2.79e-02 24.18 < 2e-16 ***
zipcode98008 6.83e-01 2.23e-02 30.65 < 2e-16 ***
zipcode98010 2.14e-01 3.37e-02 6.35 2.2e-10 ***
zipcode98011 4.71e-01 2.51e-02 18.77 < 2e-16 ***
zipcode98014 3.18e-01 2.87e-02 11.11 < 2e-16 ***
zipcode98019 3.40e-01 2.64e-02 12.85 < 2e-16 ***
zipcode98022 8.68e-02 2.43e-02 3.56 0.00037 ***
zipcode98023 -4.45e-03 1.90e-02 -0.23 0.81534
zipcode98024 4.31e-01 3.19e-02 13.50 < 2e-16 ***
zipcode98027 5.29e-01 1.98e-02 26.69 < 2e-16 ***
zipcode98028 4.23e-01 2.33e-02 18.17 < 2e-16 ***
zipcode98029 6.28e-01 2.12e-02 29.65 < 2e-16 ***
zipcode98030 9.51e-02 2.23e-02 4.27 2.0e-05 ***
zipcode98031 9.53e-02 2.31e-02 4.12 3.8e-05 ***
zipcode98032 -8.42e-03 2.90e-02 -0.29 0.77153
zipcode98033 8.20e-01 2.03e-02 40.42 < 2e-16 ***
zipcode98034 5.80e-01 1.87e-02 31.00 < 2e-16 ***
zipcode98038 2.12e-01 1.89e-02 11.20 < 2e-16 ***
zipcode98039 1.28e+00 4.24e-02 30.18 < 2e-16 ***
zipcode98040 9.22e-01 2.28e-02 40.52 < 2e-16 ***
zipcode98042 9.70e-02 1.86e-02 5.21 1.9e-07 ***
zipcode98045 3.45e-01 2.27e-02 15.17 < 2e-16 ***
zipcode98052 6.61e-01 1.91e-02 34.67 < 2e-16 ***
zipcode98053 6.02e-01 1.98e-02 30.41 < 2e-16 ***
zipcode98055 1.79e-01 2.24e-02 8.02 1.2e-15 ***
zipcode98056 3.45e-01 2.03e-02 17.05 < 2e-16 ***
zipcode98058 1.66e-01 1.93e-02 8.58 < 2e-16 ***
zipcode98059 3.70e-01 1.96e-02 18.89 < 2e-16 ***
zipcode98065 4.71e-01 2.14e-02 22.01 < 2e-16 ***
zipcode98070 2.88e-01 3.03e-02 9.50 < 2e-16 ***
zipcode98072 4.77e-01 2.34e-02 20.44 < 2e-16 ***
zipcode98074 5.63e-01 1.98e-02 28.43 < 2e-16 ***
zipcode98075 5.89e-01 2.08e-02 28.27 < 2e-16 ***
zipcode98077 4.27e-01 2.50e-02 17.11 < 2e-16 ***
zipcode98092 5.04e-02 2.08e-02 2.43 0.01529 *
zipcode98102 9.81e-01 3.24e-02 30.29 < 2e-16 ***
zipcode98103 8.62e-01 1.97e-02 43.80 < 2e-16 ***
zipcode98105 9.97e-01 2.36e-02 42.26 < 2e-16 ***
zipcode98106 3.56e-01 2.18e-02 16.39 < 2e-16 ***
zipcode98107 8.91e-01 2.27e-02 39.32 < 2e-16 ***
zipcode98108 3.70e-01 2.50e-02 14.77 < 2e-16 ***
zipcode98109 9.94e-01 3.21e-02 30.97 < 2e-16 ***
zipcode98112 1.07e+00 2.37e-02 45.08 < 2e-16 ***
zipcode98115 8.57e-01 1.94e-02 44.16 < 2e-16 ***
zipcode98116 8.06e-01 2.20e-02 36.59 < 2e-16 ***
zipcode98117 8.35e-01 1.95e-02 42.76 < 2e-16 ***
zipcode98118 4.90e-01 1.97e-02 24.90 < 2e-16 ***
zipcode98119 1.05e+00 2.73e-02 38.50 < 2e-16 ***
zipcode98122 8.70e-01 2.32e-02 37.49 < 2e-16 ***
zipcode98125 5.99e-01 1.99e-02 30.18 < 2e-16 ***
zipcode98126 6.10e-01 2.12e-02 28.81 < 2e-16 ***
zipcode98133 4.64e-01 1.93e-02 24.07 < 2e-16 ***
zipcode98136 7.48e-01 2.36e-02 31.66 < 2e-16 ***
zipcode98144 7.32e-01 2.20e-02 33.25 < 2e-16 ***
zipcode98146 3.11e-01 2.21e-02 14.04 < 2e-16 ***
zipcode98148 1.84e-01 3.93e-02 4.68 3.0e-06 ***
zipcode98155 4.57e-01 1.99e-02 22.97 < 2e-16 ***
zipcode98166 3.57e-01 2.33e-02 15.31 < 2e-16 ***
zipcode98168 4.41e-02 2.26e-02 1.95 0.05084 .
zipcode98177 6.61e-01 2.29e-02 28.92 < 2e-16 ***
zipcode98178 1.64e-01 2.31e-02 7.12 1.2e-12 ***
zipcode98188 1.16e-01 2.64e-02 4.41 1.0e-05 ***
zipcode98198 9.07e-02 2.21e-02 4.11 4.0e-05 ***
zipcode98199 9.16e-01 2.24e-02 40.95 < 2e-16 ***
log(sqft_living) 3.40e-01 1.97e-02 17.28 < 2e-16 ***
log(sqft_lot) 6.76e-02 4.11e-03 16.45 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.188 on 9908 degrees of freedom
Multiple R-squared: 0.876, Adjusted R-squared: 0.874
F-statistic: 766 on 91 and 9908 DF, p-value: <2e-16
From the summary of the model:
• The F-statistic of 766 is significant at the 0.001 level, so this model is also useful.
• The $R^2$ value shows the model explains 87.4% of the variance, so this seems to indicate using logs is helping to explain more of the model.
• The p-values of the coefficients show that sqft_lot is below the 0.05 significance level and so the null hypothesis (this variable does not help explain the variance) cannot be rejected. Run step to remove it and anything else that is not significant
Run step to remove unnecessary variables
In [33]:
fit2 <- step(fit2)
summary(fit2)
Start: AIC=-33306
log(price) ~ bedrooms + bathrooms + sqft_living + sqft_lot +
waterfront + condition + grade + yr_built + zipcode + log(sqft_living) +
log(sqft_lot)
Df Sum of Sq RSS AIC
- sqft_lot 1 0 351 -33308
– yr_built 1 1 352 -33281
– sqft_living 1 1 352 -33272
– bedrooms 1 1 353 -33270
– bathrooms 1 2 354 -33238
– condition 4 9 361 -33047
– log(sqft_lot) 1 10 361 -33038
– log(sqft_living) 1 11 362 -33011
– waterfront 1 30 381 -32484
– grade 10 59 410 -31768
– zipcode 69 622 973 -23256
Step: AIC=-33308
log(price) ~ bedrooms + bathrooms + sqft_living + waterfront +
condition + grade + yr_built + zipcode + log(sqft_living) +
log(sqft_lot)
Df Sum of Sq RSS AIC
– yr_built 1 1 352 -33281
– sqft_living 1 1 352 -33274
– bedrooms 1 1 353 -33272
– bathrooms 1 2 354 -33240
– condition 4 9 361 -33049
– log(sqft_living) 1 11 362 -33010
– log(sqft_lot) 1 16 368 -32851
– waterfront 1 30 381 -32484
– grade 10 59 411 -31766
– zipcode 69 631 982 -23160
Call:
lm(formula = log(price) ~ bedrooms + bathrooms + sqft_living +
waterfront + condition + grade + yr_built + zipcode + log(sqft_living) +
log(sqft_lot), data = houseData)
Residuals:
Min 1Q Median 3Q Max
-1.0464 -0.1022 0.0026 0.1039 1.1229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.03e+01 2.91e-01 35.56 < 2e-16 ***
bedrooms -1.75e-02 2.85e-03 -6.12 9.4e-10 ***
bathrooms 3.75e-02 4.50e-03 8.34 < 2e-16 ***
sqft_living 5.53e-05 9.32e-06 5.94 2.9e-09 ***
waterfront1 6.56e-01 2.25e-02 29.21 < 2e-16 ***
condition2 1.09e-01 4.84e-02 2.25 0.02464 *
condition3 2.69e-01 4.44e-02 6.06 1.4e-09 ***
condition4 2.99e-01 4.44e-02 6.74 1.7e-11 ***
condition5 3.61e-01 4.48e-02 8.05 9.1e-16 ***
grade4 -4.71e-01 1.43e-01 -3.28 0.00103 **
grade5 -5.06e-01 1.35e-01 -3.74 0.00018 ***
grade6 -4.31e-01 1.35e-01 -3.20 0.00136 **
grade7 -3.36e-01 1.35e-01 -2.49 0.01269 *
grade8 -2.15e-01 1.35e-01 -1.59 0.11120
grade9 -6.90e-02 1.35e-01 -0.51 0.60947
grade10 2.91e-02 1.35e-01 0.22 0.82950
grade11 1.60e-01 1.36e-01 1.18 0.23822
grade12 2.93e-01 1.39e-01 2.12 0.03438 *
grade13 4.35e-01 1.68e-01 2.60 0.00947 **
yr_built -5.77e-04 1.08e-04 -5.34 9.7e-08 ***
zipcode98002 -1.83e-03 2.48e-02 -0.07 0.94113
zipcode98003 3.20e-02 2.25e-02 1.42 0.15577
zipcode98004 1.15e+00 2.16e-02 52.96 < 2e-16 ***
zipcode98005 7.15e-01 2.49e-02 28.68 < 2e-16 ***
zipcode98006 6.86e-01 1.96e-02 34.93 < 2e-16 ***
zipcode98007 6.74e-01 2.79e-02 24.18 < 2e-16 ***
zipcode98008 6.83e-01 2.23e-02 30.65 < 2e-16 ***
zipcode98010 2.14e-01 3.37e-02 6.35 2.2e-10 ***
zipcode98011 4.71e-01 2.51e-02 18.77 < 2e-16 ***
zipcode98014 3.17e-01 2.86e-02 11.11 < 2e-16 ***
zipcode98019 3.39e-01 2.64e-02 12.84 < 2e-16 ***
zipcode98022 8.58e-02 2.42e-02 3.54 0.00040 ***
zipcode98023 -4.45e-03 1.90e-02 -0.23 0.81518
zipcode98024 4.31e-01 3.19e-02 13.49 < 2e-16 ***
zipcode98027 5.29e-01 1.98e-02 26.70 < 2e-16 ***
zipcode98028 4.23e-01 2.33e-02 18.18 < 2e-16 ***
zipcode98029 6.28e-01 2.12e-02 29.65 < 2e-16 ***
zipcode98030 9.51e-02 2.23e-02 4.27 2.0e-05 ***
zipcode98031 9.53e-02 2.31e-02 4.12 3.9e-05 ***
zipcode98032 -8.50e-03 2.90e-02 -0.29 0.76943
zipcode98033 8.20e-01 2.03e-02 40.42 < 2e-16 ***
zipcode98034 5.80e-01 1.87e-02 31.00 < 2e-16 ***
zipcode98038 2.12e-01 1.89e-02 11.19 < 2e-16 ***
zipcode98039 1.28e+00 4.24e-02 30.19 < 2e-16 ***
zipcode98040 9.23e-01 2.28e-02 40.52 < 2e-16 ***
zipcode98042 9.69e-02 1.86e-02 5.21 1.9e-07 ***
zipcode98045 3.45e-01 2.27e-02 15.17 < 2e-16 ***
zipcode98052 6.61e-01 1.91e-02 34.67 < 2e-16 ***
zipcode98053 6.02e-01 1.98e-02 30.41 < 2e-16 ***
zipcode98055 1.79e-01 2.24e-02 8.01 1.3e-15 ***
zipcode98056 3.45e-01 2.03e-02 17.04 < 2e-16 ***
zipcode98058 1.66e-01 1.93e-02 8.57 < 2e-16 ***
zipcode98059 3.70e-01 1.96e-02 18.89 < 2e-16 ***
zipcode98065 4.70e-01 2.14e-02 22.02 < 2e-16 ***
zipcode98070 2.87e-01 3.02e-02 9.49 < 2e-16 ***
zipcode98072 4.77e-01 2.33e-02 20.45 < 2e-16 ***
zipcode98074 5.63e-01 1.98e-02 28.44 < 2e-16 ***
zipcode98075 5.89e-01 2.08e-02 28.28 < 2e-16 ***
zipcode98077 4.27e-01 2.50e-02 17.12 < 2e-16 ***
zipcode98092 5.03e-02 2.08e-02 2.42 0.01551 *
zipcode98102 9.80e-01 3.22e-02 30.41 < 2e-16 ***
zipcode98103 8.61e-01 1.95e-02 44.21 < 2e-16 ***
zipcode98105 9.96e-01 2.35e-02 42.45 < 2e-16 ***
zipcode98106 3.56e-01 2.17e-02 16.40 < 2e-16 ***
zipcode98107 8.90e-01 2.25e-02 39.60 < 2e-16 ***
zipcode98108 3.69e-01 2.50e-02 14.78 < 2e-16 ***
zipcode98109 9.92e-01 3.19e-02 31.08 < 2e-16 ***
zipcode98112 1.07e+00 2.35e-02 45.30 < 2e-16 ***
zipcode98115 8.56e-01 1.93e-02 44.33 < 2e-16 ***
zipcode98116 8.06e-01 2.19e-02 36.72 < 2e-16 ***
zipcode98117 8.34e-01 1.94e-02 42.99 < 2e-16 ***
zipcode98118 4.90e-01 1.96e-02 24.96 < 2e-16 ***
zipcode98119 1.05e+00 2.71e-02 38.68 < 2e-16 ***
zipcode98122 8.69e-01 2.30e-02 37.78 < 2e-16 ***
zipcode98125 5.99e-01 1.98e-02 30.20 < 2e-16 ***
zipcode98126 6.10e-01 2.11e-02 28.87 < 2e-16 ***
zipcode98133 4.63e-01 1.92e-02 24.09 < 2e-16 ***
zipcode98136 7.47e-01 2.36e-02 31.73 < 2e-16 ***
zipcode98144 7.31e-01 2.19e-02 33.44 < 2e-16 ***
zipcode98146 3.10e-01 2.21e-02 14.03 < 2e-16 ***
zipcode98148 1.84e-01 3.93e-02 4.67 3.0e-06 ***
zipcode98155 4.57e-01 1.99e-02 22.96 < 2e-16 ***
zipcode98166 3.57e-01 2.33e-02 15.31 < 2e-16 ***
zipcode98168 4.40e-02 2.26e-02 1.95 0.05136 .
zipcode98177 6.61e-01 2.29e-02 28.92 < 2e-16 ***
zipcode98178 1.64e-01 2.31e-02 7.11 1.3e-12 ***
zipcode98188 1.16e-01 2.64e-02 4.41 1.1e-05 ***
zipcode98198 9.06e-02 2.21e-02 4.11 4.1e-05 ***
zipcode98199 9.16e-01 2.23e-02 41.07 < 2e-16 ***
log(sqft_living) 3.40e-01 1.96e-02 17.37 < 2e-16 ***
log(sqft_lot) 6.65e-02 3.08e-03 21.57 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.188 on 9909 degrees of freedom
Multiple R-squared: 0.876, Adjusted R-squared: 0.874
F-statistic: 775 on 90 and 9909 DF, p-value: <2e-16
Step has removed sqft_lot, but nothing else
The $R^2$ value is still 87.4%.
The $R^2$ value is calculated on the predicted log of the price, so calculate the $R^2$ for the prices using the Model.Accuracy function defined above
In [34]:
houseData.predict <- exp(fit2$fitted.values)
cat("R-Squared:",Model.Accuracy(houseData.predict,houseData$price,9889,110)$rsquared)
R-Squared: 0.8686
That has lowered $R^2$ slightly to 86.8%, but that is still an improvement of over 3% on the first model.
Check the residuals using the plot function
In [35]:
par(mfrow=c(2,2))
plot(fit2)

That has improved the residual plots. The residuals are more evenly distributed and the scale location shows the variance is nearly equal. The Normal Q-Q plot shows that the residuals are not quite normally distrbuted, though.
Estimating the log of price rather than price directly and using the log of sqft_living and sqlt_lot has helped meet the linear regression assumptions and improved the $R^2$ value by over 3%.
Correlation and Interaction¶
What about correlation between the variables?
Add some interaction terms for the correlated variables - sqft_living, bathrooms and grade, also the negative correlation between yr_built and condition and the correlation between waterfront and zipcode
In [36]:
fit3 <- lm(log(price) ~ bedrooms + bathrooms + sqft_living + waterfront + condition + grade +
yr_built + zipcode + log(sqft_living) + log(sqft_lot) + sqft_living:grade +
sqft_living:bathrooms + grade:bathrooms + yr_built:condition + waterfront:zipcode,
data = houseData)
summary(fit3)
Call:
lm(formula = log(price) ~ bedrooms + bathrooms + sqft_living +
waterfront + condition + grade + yr_built + zipcode + log(sqft_living) +
log(sqft_lot) + sqft_living:grade + sqft_living:bathrooms +
grade:bathrooms + yr_built:condition + waterfront:zipcode,
data = houseData)
Residuals:
Min 1Q Median 3Q Max
-1.0209 -0.0976 0.0024 0.1009 1.1332
Coefficients: (48 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.16e+00 5.28e+00 -0.98 0.32902
bedrooms -1.64e-02 2.83e-03 -5.78 7.8e-09 ***
bathrooms 1.02e-01 1.55e-01 0.66 0.51230
sqft_living 8.15e-04 3.61e-03 0.23 0.82161
waterfront1 5.10e-01 8.47e-02 6.03 1.7e-09 ***
condition2 1.19e+01 5.14e+00 2.31 0.02104 *
condition3 1.29e+01 4.87e+00 2.64 0.00838 **
condition4 1.59e+01 4.88e+00 3.26 0.00110 **
condition5 1.58e+01 4.90e+00 3.21 0.00131 **
grade4 2.93e-01 2.06e+00 0.14 0.88673
grade5 1.38e-01 2.05e+00 0.07 0.94624
grade6 1.99e-01 2.05e+00 0.10 0.92281
grade7 1.65e-01 2.05e+00 0.08 0.93579
grade8 1.90e-01 2.05e+00 0.09 0.92597
grade9 3.44e-01 2.05e+00 0.17 0.86647
grade10 3.69e-01 2.05e+00 0.18 0.85689
grade11 5.73e-01 2.05e+00 0.28 0.77943
grade12 1.27e+00 2.04e+00 0.62 0.53434
grade13 1.60e+00 2.15e+00 0.74 0.45700
yr_built 6.46e-03 2.52e-03 2.56 0.01043 *
zipcode98002 -1.16e-03 2.44e-02 -0.05 0.96218
zipcode98003 3.84e-02 2.22e-02 1.73 0.08348 .
zipcode98004 1.15e+00 2.14e-02 53.97 < 2e-16 ***
zipcode98005 7.27e-01 2.46e-02 29.54 < 2e-16 ***
zipcode98006 6.96e-01 1.95e-02 35.80 < 2e-16 ***
zipcode98007 6.94e-01 2.75e-02 25.21 < 2e-16 ***
zipcode98008 6.91e-01 2.21e-02 31.30 < 2e-16 ***
zipcode98010 2.19e-01 3.32e-02 6.60 4.2e-11 ***
zipcode98011 4.76e-01 2.47e-02 19.23 < 2e-16 ***
zipcode98014 3.16e-01 2.82e-02 11.20 < 2e-16 ***
zipcode98019 3.35e-01 2.60e-02 12.88 < 2e-16 ***
zipcode98022 8.97e-02 2.39e-02 3.75 0.00018 ***
zipcode98023 2.65e-03 1.88e-02 0.14 0.88815
zipcode98024 4.37e-01 3.15e-02 13.87 < 2e-16 ***
zipcode98027 5.35e-01 1.96e-02 27.32 < 2e-16 ***
zipcode98028 4.30e-01 2.30e-02 18.71 < 2e-16 ***
zipcode98029 6.41e-01 2.09e-02 30.69 < 2e-16 ***
zipcode98030 9.90e-02 2.20e-02 4.50 6.7e-06 ***
zipcode98031 1.04e-01 2.28e-02 4.58 4.8e-06 ***
zipcode98032 -4.17e-04 2.86e-02 -0.01 0.98837
zipcode98033 8.33e-01 2.01e-02 41.52 < 2e-16 ***
zipcode98034 5.87e-01 1.85e-02 31.74 < 2e-16 ***
zipcode98038 2.10e-01 1.87e-02 11.23 < 2e-16 ***
zipcode98039 1.32e+00 4.36e-02 30.40 < 2e-16 ***
zipcode98040 9.39e-01 2.29e-02 41.08 < 2e-16 ***
zipcode98042 1.04e-01 1.84e-02 5.69 1.3e-08 ***
zipcode98045 3.49e-01 2.24e-02 15.55 < 2e-16 ***
zipcode98052 6.73e-01 1.88e-02 35.72 < 2e-16 ***
zipcode98053 6.05e-01 1.95e-02 30.99 < 2e-16 ***
zipcode98055 1.86e-01 2.20e-02 8.46 < 2e-16 ***
zipcode98056 3.51e-01 2.01e-02 17.43 < 2e-16 ***
zipcode98058 1.74e-01 1.91e-02 9.14 < 2e-16 ***
zipcode98059 3.71e-01 1.93e-02 19.19 < 2e-16 ***
zipcode98065 4.68e-01 2.11e-02 22.20 < 2e-16 ***
zipcode98070 3.75e-01 3.23e-02 11.59 < 2e-16 ***
zipcode98072 4.89e-01 2.30e-02 21.24 < 2e-16 ***
zipcode98074 5.63e-01 1.96e-02 28.68 < 2e-16 ***
zipcode98075 5.83e-01 2.07e-02 28.17 < 2e-16 ***
zipcode98077 4.38e-01 2.47e-02 17.78 < 2e-16 ***
zipcode98092 5.62e-02 2.05e-02 2.75 0.00604 **
zipcode98102 1.01e+00 3.20e-02 31.57 < 2e-16 ***
zipcode98103 8.75e-01 1.93e-02 45.39 < 2e-16 ***
zipcode98105 1.01e+00 2.33e-02 43.29 < 2e-16 ***
zipcode98106 3.67e-01 2.14e-02 17.14 < 2e-16 ***
zipcode98107 9.05e-01 2.22e-02 40.73 < 2e-16 ***
zipcode98108 3.77e-01 2.46e-02 15.31 < 2e-16 ***
zipcode98109 1.01e+00 3.15e-02 31.98 < 2e-16 ***
zipcode98112 1.08e+00 2.33e-02 46.42 < 2e-16 ***
zipcode98115 8.62e-01 1.91e-02 45.27 < 2e-16 ***
zipcode98116 8.17e-01 2.16e-02 37.77 < 2e-16 ***
zipcode98117 8.41e-01 1.91e-02 43.94 < 2e-16 ***
zipcode98118 4.99e-01 1.94e-02 25.70 < 2e-16 ***
zipcode98119 1.07e+00 2.68e-02 39.99 < 2e-16 ***
zipcode98122 8.91e-01 2.28e-02 39.16 < 2e-16 ***
zipcode98125 6.07e-01 1.96e-02 30.93 < 2e-16 ***
zipcode98126 6.18e-01 2.08e-02 29.66 < 2e-16 ***
zipcode98133 4.72e-01 1.90e-02 24.86 < 2e-16 ***
zipcode98136 7.54e-01 2.34e-02 32.20 < 2e-16 ***
zipcode98144 7.37e-01 2.17e-02 34.05 < 2e-16 ***
zipcode98146 3.20e-01 2.19e-02 14.61 < 2e-16 ***
zipcode98148 1.88e-01 3.87e-02 4.86 1.2e-06 ***
zipcode98155 4.59e-01 1.97e-02 23.31 < 2e-16 ***
zipcode98166 3.77e-01 2.34e-02 16.13 < 2e-16 ***
zipcode98168 5.46e-02 2.23e-02 2.45 0.01442 *
zipcode98177 6.68e-01 2.26e-02 29.62 < 2e-16 ***
zipcode98178 1.68e-01 2.29e-02 7.35 2.2e-13 ***
zipcode98188 1.20e-01 2.60e-02 4.63 3.7e-06 ***
zipcode98198 1.14e-01 2.20e-02 5.17 2.4e-07 ***
zipcode98199 9.28e-01 2.20e-02 42.20 < 2e-16 ***
log(sqft_living) 5.63e-01 4.25e-02 13.25 < 2e-16 ***
log(sqft_lot) 6.85e-02 3.10e-03 22.05 < 2e-16 ***
sqft_living:grade4 -1.53e-03 3.61e-03 -0.42 0.67172
sqft_living:grade5 -1.13e-03 3.61e-03 -0.31 0.75525
sqft_living:grade6 -1.05e-03 3.61e-03 -0.29 0.77201
sqft_living:grade7 -9.36e-04 3.61e-03 -0.26 0.79546
sqft_living:grade8 -8.49e-04 3.61e-03 -0.24 0.81407
sqft_living:grade9 -8.62e-04 3.61e-03 -0.24 0.81137
sqft_living:grade10 -8.66e-04 3.61e-03 -0.24 0.81034
sqft_living:grade11 -8.64e-04 3.61e-03 -0.24 0.81087
sqft_living:grade12 -9.17e-04 3.61e-03 -0.25 0.79942
sqft_living:grade13 -1.04e-03 3.66e-03 -0.29 0.77542
bathrooms:sqft_living 1.26e-05 5.11e-06 2.47 0.01350 *
bathrooms:grade4 1.17e-01 2.49e-01 0.47 0.63689
bathrooms:grade5 -2.94e-02 1.68e-01 -0.18 0.86062
bathrooms:grade6 -7.08e-02 1.54e-01 -0.46 0.64576
bathrooms:grade7 -7.86e-02 1.52e-01 -0.52 0.60606
bathrooms:grade8 -1.16e-01 1.52e-01 -0.77 0.44336
bathrooms:grade9 -1.14e-01 1.51e-01 -0.75 0.45119
bathrooms:grade10 -8.55e-02 1.50e-01 -0.57 0.56910
bathrooms:grade11 -1.12e-01 1.50e-01 -0.75 0.45387
bathrooms:grade12 -1.83e-01 1.53e-01 -1.20 0.23124
bathrooms:grade13 NA NA NA NA
condition2:yr_built -6.09e-03 2.66e-03 -2.29 0.02193 *
condition3:yr_built -6.52e-03 2.52e-03 -2.58 0.00978 **
condition4:yr_built -8.07e-03 2.52e-03 -3.20 0.00140 **
condition5:yr_built -7.96e-03 2.54e-03 -3.14 0.00171 **
waterfront1:zipcode98002 NA NA NA NA
waterfront1:zipcode98003 NA NA NA NA
waterfront1:zipcode98004 NA NA NA NA
waterfront1:zipcode98005 NA NA NA NA
waterfront1:zipcode98006 1.53e-01 1.57e-01 0.98 0.32861
waterfront1:zipcode98007 NA NA NA NA
waterfront1:zipcode98008 3.92e-01 1.62e-01 2.42 0.01562 *
waterfront1:zipcode98010 NA NA NA NA
waterfront1:zipcode98011 NA NA NA NA
waterfront1:zipcode98014 NA NA NA NA
waterfront1:zipcode98019 NA NA NA NA
waterfront1:zipcode98022 NA NA NA NA
waterfront1:zipcode98023 3.92e-01 2.04e-01 1.92 0.05482 .
waterfront1:zipcode98024 NA NA NA NA
waterfront1:zipcode98027 5.61e-02 1.59e-01 0.35 0.72441
waterfront1:zipcode98028 NA NA NA NA
waterfront1:zipcode98029 NA NA NA NA
waterfront1:zipcode98030 NA NA NA NA
waterfront1:zipcode98031 NA NA NA NA
waterfront1:zipcode98032 NA NA NA NA
waterfront1:zipcode98033 2.68e-01 1.60e-01 1.68 0.09343 .
waterfront1:zipcode98034 4.65e-01 1.57e-01 2.96 0.00305 **
waterfront1:zipcode98038 NA NA NA NA
waterfront1:zipcode98039 -1.91e-01 2.08e-01 -0.92 0.35956
waterfront1:zipcode98040 1.86e-01 1.12e-01 1.66 0.09635 .
waterfront1:zipcode98042 NA NA NA NA
waterfront1:zipcode98045 NA NA NA NA
waterfront1:zipcode98052 1.63e-01 2.06e-01 0.79 0.42780
waterfront1:zipcode98053 NA NA NA NA
waterfront1:zipcode98055 NA NA NA NA
waterfront1:zipcode98056 5.63e-01 1.57e-01 3.57 0.00035 ***
waterfront1:zipcode98058 NA NA NA NA
waterfront1:zipcode98059 NA NA NA NA
waterfront1:zipcode98065 NA NA NA NA
waterfront1:zipcode98070 -2.27e-01 1.04e-01 -2.18 0.02925 *
waterfront1:zipcode98072 NA NA NA NA
waterfront1:zipcode98074 4.68e-01 1.19e-01 3.92 9.0e-05 ***
waterfront1:zipcode98075 4.01e-01 1.15e-01 3.50 0.00047 ***
waterfront1:zipcode98077 NA NA NA NA
waterfront1:zipcode98092 NA NA NA NA
waterfront1:zipcode98102 NA NA NA NA
waterfront1:zipcode98103 NA NA NA NA
waterfront1:zipcode98105 -2.52e-02 1.39e-01 -0.18 0.85587
waterfront1:zipcode98106 NA NA NA NA
waterfront1:zipcode98107 NA NA NA NA
waterfront1:zipcode98108 NA NA NA NA
waterfront1:zipcode98109 NA NA NA NA
waterfront1:zipcode98112 NA NA NA NA
waterfront1:zipcode98115 NA NA NA NA
waterfront1:zipcode98116 NA NA NA NA
waterfront1:zipcode98117 NA NA NA NA
waterfront1:zipcode98118 5.94e-02 1.38e-01 0.43 0.66586
waterfront1:zipcode98119 NA NA NA NA
waterfront1:zipcode98122 NA NA NA NA
waterfront1:zipcode98125 3.90e-01 1.37e-01 2.84 0.00446 **
waterfront1:zipcode98126 NA NA NA NA
waterfront1:zipcode98133 NA NA NA NA
waterfront1:zipcode98136 1.47e-01 1.39e-01 1.06 0.28909
waterfront1:zipcode98144 4.39e-01 1.57e-01 2.80 0.00519 **
waterfront1:zipcode98146 -2.68e-02 1.57e-01 -0.17 0.86429
waterfront1:zipcode98148 NA NA NA NA
waterfront1:zipcode98155 3.21e-01 1.37e-01 2.34 0.01932 *
waterfront1:zipcode98166 -1.43e-01 1.16e-01 -1.24 0.21589
waterfront1:zipcode98168 NA NA NA NA
waterfront1:zipcode98177 NA NA NA NA
waterfront1:zipcode98178 4.44e-01 1.57e-01 2.82 0.00478 **
waterfront1:zipcode98188 NA NA NA NA
waterfront1:zipcode98198 NA NA NA NA
waterfront1:zipcode98199 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.185 on 9863 degrees of freedom
Multiple R-squared: 0.88, Adjusted R-squared: 0.878
F-statistic: 532 on 136 and 9863 DF, p-value: <2e-16
Several of these variables and interactions look like they are not significant:
• bathrooms
• sqft_living
• grade
• sqft_living:grade
• bathrooms:grade
Bathrooms:grade will be removed in the next iteration of the model, sqft_living:grade becomes more significant in the next iteration and so is retained.
As sqft_living:bathrooms is significant and sqft_living:grade ends up being significant, sqft_living, grade and bathrooms need to remain in the model.
There are also lots of NAs due to zipcodes that don't have any waterfront properties, and warnings about singularities - factors or interactions between factors that only have one record.
Check the $R^2$ value using the Model.Accuracy function and the residual plots
In [37]:
houseData.predict <- exp(fit3$fitted.values)
cat("R-Squared:",Model.Accuracy(houseData.predict,houseData$price,9863,136)$rsquared)
par(mfrow=c(2,2))
plot(fit3)
R-Squared: 0.885
Warning message:
“not plotting observations with leverage one:
513, 639, 8099”Warning message:
“not plotting observations with leverage one:
513, 639, 8099”Warning message in sqrt(crit * p * (1 - hh)/hh):
“NaNs produced”Warning message in sqrt(crit * p * (1 - hh)/hh):
“NaNs produced”

The $R^2$ value has improved by another 2% and the residual plots are still looking good, but produced a couple of warning messages. Look at the records referenced in the warnings
In [38]:
predict.price <- houseData.predict[c(513, 639, 8099)]
cbind(predict.price,houseData[c(513, 639, 8099),])
predict.price
price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
513
629000
629000
3
1.75
1460
12367
1
4
8
1970
98023
639
2200000
2200000
5
4.25
4640
22703
1
5
8
1952
98052
8099
280000
280000
1
0.00
600
24501
0
2
3
1950
98045
These are records for which the model has made a completely accurate prediction. Two of these reference waterfront properties and the other one looks a bit strange with no bathrooms so it may be an outlier.
Generate some new Variables¶
Generating a couple of new variables will allow the linear regression to better model some of the interactions. These variables will be added:
A Variable to handle the Waterfront:Zipcode Interaction¶
This variable will allow the model to include :
• If the property is not a waterfront property, set wf.zipcode to the zipcode
• If the property is a waterfront property and the zipcode has more than two waterfront properties, then set wf.zipcode to the zipcode appended with the characters "-1"
• The remaining waterfront properties will be grouped according to whether the zipcode is a low, mid or high priced zipcode.
• During model evaluation, some of the waterfront properties in the individual zipcodes were found to be influencers. The zipcodes containing these properties have been added to the groups to lessen the influence of these observations.
A Variable to Group Year Built by Decade¶
This variable groups the year_built by decade and is factorised to allow the regression to model changes over time that cannot be captured by a mathematical expression.
A variable to Classify Houses by Bedrooms¶
This variable to classify houses into two groups by bedrooms:
1. Houses with four or fewer bedrooms
2. Houses with more than four bedrooms
Analysis of the waterfront properties by zipcode
In [39]:
wf1 <- aggregate(houseData$zipcode[houseData$waterfront==1], list(houseData$zipcode[houseData$waterfront==1]), NROW)
names(wf1) <- c("zipcode","wf.count")
wf2 <- aggregate(houseData$price[houseData$waterfront==1], list(houseData$zipcode[houseData$waterfront==1]), mean)
names(wf2) <- c("zipcode","price.mean")
#wf3 <- aggregate(houseData$price[houseData$waterfront==1], list(houseData$zipcode[houseData$waterfront==1]), sd)
#names(wf3) <- c("zipcode","price.sd")
wf3 <- cbind(wf1,wf2["price.mean"])
#wf4 <- cbind(wf4,wf3["price.sd"])
wf4 <- wf3[wf3$wf.count > 2 & !(wf3$zipcode %in% c(98125,98074,98155,98118)),]
cat(“Individual waterfront zipcodes:\n”)
print(wf4)
wf5 <- wf3[!(wf3$zipcode %in% wf4$zipcode),]
cat("\nLow-price waterfront zipcodes:\n")
print(wf5[wf5$price.mean <= 1000000,])
cat("\nMid-price waterfront zipcodes:\n")
print(wf5[wf5$price.mean > 1000000 & wf5$price.mean <= 2400000,])
cat("\nHigh-price waterfront zipcodes:\n")
print(wf5[wf5$price.mean > 2400000,])
Individual waterfront zipcodes:
zipcode wf.count price.mean
8 98040 7 3072143
11 98070 12 639633
13 98075 6 1854833
14 98105 3 3051667
17 98136 3 1165000
21 98166 6 1021417
23 98198 5 736400
Low-price waterfront zipcodes:
zipcode wf.count price.mean
3 98023 1 629000
19 98146 2 590750
Mid-price waterfront zipcodes:
zipcode wf.count price.mean
1 98006 2 1825500
4 98027 2 2400000
9 98052 1 2200000
12 98074 5 1996600
15 98118 3 1750167
16 98125 3 1281667
20 98155 3 1608333
22 98178 2 1400000
High-price waterfront zipcodes:
zipcode wf.count price.mean
2 98008 2 3422500
5 98033 2 4252900
6 98034 2 2477500
7 98039 1 3640900
10 98056 2 2615000
18 98144 2 2750000
Function to Generate New Variables¶
Name: GenerateVariables
Input parameters:
• data – a dataframe containing the house price data
Return value:
• The modified dataframe with the extra variables added
Description
Adds the following variables to the dataframe:
• wf.zipcode – combines the zipcode and waterfront variables into a single variable
• decade – a factor representing the decade in which the house was built
• bedroom.class – a factor indicating whether the house has no more than four bedrooms or more than four bedrooms
In [40]:
GenerateVariables <- function(data) {
# Make a copy of the dataframe
newdata <- data
# Generate the wf.zipcode variable
## Zipcodes treated individually
wfzipcodes <- c(98040,98070,98075,98105,98136,98166,98198)
## Define the zipcode groups
wfziplow <- c(98023,98146)
wfzipmid <- c(98006,98027,98052,98074,98118,98125,98155,98178)
wfziphigh <- c(98008,98033,98034,98039,98056,98144)
## Create the new variable
newdata$wf.zipcode <- as.character(newdata$zipcode)
wf.rows <- row.names(newdata[newdata$waterfront == 1 & newdata$zipcode %in% wfzipcodes,])
newdata$wf.zipcode[as.numeric(wf.rows)] <- paste(newdata$zipcode[as.numeric(wf.rows)],"1",sep="-")
newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfziplow] <- "wf-low"
newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfzipmid] <- "wf-mid"
newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfziphigh] <- "wf-high"
newdata$wf.zipcode <- as.factor(newdata$wf.zipcode)
# Generate the decade variable
newdata$decade <- as.factor(ifelse(newdata$yr_built < 1900, 0,
ifelse(newdata$yr_built > 2019, 11, trunc(newdata$yr_built/10)-190)))
# Generate the bedroom.class variable
newdata$bedroom.class <- as.factor(ifelse(newdata$bedrooms < 5, "4minus", "5plus"))
# Return the modified dataframe
return(newdata)
}
Generate the New Variables¶
In [41]:
houseData <- GenerateVariables(houseData)
str(houseData)
'data.frame': 10000 obs. of 13 variables:
$ price : int 211000 265000 1440000 800000 1059500 750000 229000 271115 428000 1240000 ...
$ bedrooms : int 4 3 3 4 5 2 3 2 3 4 ...
$ bathrooms : num 1 2.5 3.5 3.5 3.25 1 1.5 1.5 2.25 3.5 ...
$ sqft_living : int 2100 1530 3870 2370 3230 1620 1200 830 2600 3820 ...
$ sqft_lot : int 9200 6000 3819 3302 3825 6120 5000 1325 15000 13224 ...
$ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ condition : Factor w/ 5 levels "1","2","3","4",..: 3 4 3 3 3 3 3 3 3 3 ...
$ grade : Factor w/ 11 levels "3","4","5","6",..: 5 5 9 6 7 5 4 5 7 8 ...
$ yr_built : int 1959 1991 2002 1926 2014 1951 1979 2005 1978 1990 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 65 24 42 43 52 50 61 59 24 4 ...
$ wf.zipcode : Factor w/ 80 levels "98001","98002",..: 71 24 45 46 56 54 66 63 24 4 ...
$ decade : Factor w/ 12 levels "0","1","2","3",..: 6 10 11 3 12 6 8 11 8 10 ...
$ bedroom.class: Factor w/ 2 levels "4minus","5plus": 1 1 1 1 2 1 1 1 1 1 ...
The Final Model¶
This model adds in the new variables, removes the bathrooms:grade interaction and adds two other interactions. The resulting set of terms in the model capture the following observations from the exploratory data analysis:
1. The influence the number of bathrooms, condition, grade and age of the house have on the sale price.
2. The influence of the size of the house and lot on the sale price using log transformations of the sqft_living and sqft_lot variables.
3. The influence that the zipcode and whether or not the property overlooks the waterfront has on house price using the generated variable wf.zipcode. This variable also captures the correlation between zipcode and waterfront.
4. The non-linear influence of the age of the property on the price using the generated variable decade.
5. The difference between having more or less than four bedrooms using the generated variable bedroom.class.
6. The correlation between the size and grade of the house using the sqft_living:grade interaction.
7. The correlation between the number of bathrooms and the size of the house using the bathrooms:log(sqft_living) interaction.
8. the correlation between the condition and age of the house using the condition:yr_built interaction.
9. The correlation between the size and age of the house using the sqft_living:decade interaction.
10. The correlation between the number of bathrooms, zipcode and waterfront using the bathrooms:wf.zipcode interaction.
In [42]:
house.model <- lm(log(price) ~ bathrooms + condition + grade + yr_built +
log(sqft_living) + log(sqft_lot) + wf.zipcode + decade + bedroom.class +
sqft_living:grade + bathrooms:log(sqft_living) + condition:yr_built +
sqft_living:decade + bathrooms:wf.zipcode,
data=houseData)
summary(house.model)
Call:
lm(formula = log(price) ~ bathrooms + condition + grade + yr_built +
log(sqft_living) + log(sqft_lot) + wf.zipcode + decade +
bedroom.class + sqft_living:grade + bathrooms:log(sqft_living) +
condition:yr_built + sqft_living:decade + bathrooms:wf.zipcode,
data = houseData)
Residuals:
Min 1Q Median 3Q Max
-0.9481 -0.0946 0.0023 0.0986 1.0992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.84e+00 5.31e+00 -1.10 0.27185
bathrooms -1.79e-01 8.38e-02 -2.14 0.03266 *
condition2 1.36e+01 5.08e+00 2.69 0.00725 **
condition3 1.57e+01 4.82e+00 3.25 0.00116 **
condition4 1.66e+01 4.82e+00 3.43 0.00060 ***
condition5 1.63e+01 4.85e+00 3.37 0.00075 ***
grade4 -1.03e+00 1.87e+00 -0.55 0.58004
grade5 -1.26e+00 1.86e+00 -0.68 0.49743
grade6 -1.22e+00 1.86e+00 -0.66 0.51077
grade7 -1.27e+00 1.86e+00 -0.68 0.49567
grade8 -1.27e+00 1.86e+00 -0.68 0.49479
grade9 -1.12e+00 1.86e+00 -0.60 0.54606
grade10 -1.06e+00 1.86e+00 -0.57 0.56974
grade11 -9.20e-01 1.86e+00 -0.49 0.62181
grade12 -6.49e-01 1.87e+00 -0.35 0.72821
grade13 -5.76e-01 1.91e+00 -0.30 0.76298
yr_built 7.72e-03 2.58e-03 3.00 0.00274 **
log(sqft_living) 4.69e-01 3.54e-02 13.26 < 2e-16 ***
log(sqft_lot) 8.04e-02 3.29e-03 24.44 < 2e-16 ***
wf.zipcode98002 3.00e-02 7.24e-02 0.41 0.67860
wf.zipcode98003 1.86e-01 7.30e-02 2.55 0.01087 *
wf.zipcode98004 1.47e+00 6.66e-02 22.07 < 2e-16 ***
wf.zipcode98005 1.14e+00 8.63e-02 13.20 < 2e-16 ***
wf.zipcode98006 8.16e-01 6.52e-02 12.52 < 2e-16 ***
wf.zipcode98007 9.95e-01 1.02e-01 9.76 < 2e-16 ***
wf.zipcode98008 9.11e-01 7.72e-02 11.80 < 2e-16 ***
wf.zipcode98010 4.09e-01 1.14e-01 3.58 0.00035 ***
wf.zipcode98011 6.98e-01 1.09e-01 6.40 1.6e-10 ***
wf.zipcode98014 5.75e-01 7.91e-02 7.27 3.9e-13 ***
wf.zipcode98019 4.97e-01 9.17e-02 5.42 6.2e-08 ***
wf.zipcode98022 1.67e-01 7.76e-02 2.15 0.03129 *
wf.zipcode98023 1.93e-01 6.28e-02 3.08 0.00209 **
wf.zipcode98024 5.80e-01 8.23e-02 7.05 2.0e-12 ***
wf.zipcode98027 7.03e-01 6.75e-02 10.42 < 2e-16 ***
wf.zipcode98028 5.96e-01 8.85e-02 6.74 1.7e-11 ***
wf.zipcode98029 9.00e-01 8.87e-02 10.14 < 2e-16 ***
wf.zipcode98030 2.21e-01 8.03e-02 2.76 0.00585 **
wf.zipcode98031 3.27e-01 7.80e-02 4.20 2.7e-05 ***
wf.zipcode98032 1.78e-01 9.15e-02 1.94 0.05228 .
wf.zipcode98033 1.01e+00 6.41e-02 15.68 < 2e-16 ***
wf.zipcode98034 7.50e-01 6.21e-02 12.07 < 2e-16 ***
wf.zipcode98038 3.14e-01 7.96e-02 3.95 8.0e-05 ***
wf.zipcode98039 1.63e+00 1.24e-01 13.15 < 2e-16 ***
wf.zipcode98040 1.23e+00 7.74e-02 15.93 < 2e-16 ***
wf.zipcode98040-1 1.90e+00 1.85e-01 10.25 < 2e-16 ***
wf.zipcode98042 2.32e-01 6.51e-02 3.56 0.00037 ***
wf.zipcode98045 6.27e-01 7.02e-02 8.93 < 2e-16 ***
wf.zipcode98052 9.18e-01 6.92e-02 13.26 < 2e-16 ***
wf.zipcode98053 8.54e-01 7.13e-02 11.98 < 2e-16 ***
wf.zipcode98055 3.37e-01 6.48e-02 5.20 2.0e-07 ***
wf.zipcode98056 4.31e-01 6.30e-02 6.84 8.7e-12 ***
wf.zipcode98058 3.12e-01 6.33e-02 4.94 8.1e-07 ***
wf.zipcode98059 4.27e-01 6.49e-02 6.58 4.9e-11 ***
wf.zipcode98065 6.25e-01 7.19e-02 8.70 < 2e-16 ***
wf.zipcode98070 5.69e-01 8.55e-02 6.66 2.9e-11 ***
wf.zipcode98070-1 8.37e-01 1.48e-01 5.65 1.7e-08 ***
wf.zipcode98072 7.08e-01 8.28e-02 8.54 < 2e-16 ***
wf.zipcode98074 8.37e-01 7.44e-02 11.25 < 2e-16 ***
wf.zipcode98075 9.47e-01 7.94e-02 11.94 < 2e-16 ***
wf.zipcode98075-1 1.72e+00 2.34e-01 7.35 2.1e-13 ***
wf.zipcode98077 6.94e-01 8.52e-02 8.14 4.6e-16 ***
wf.zipcode98092 2.39e-01 7.43e-02 3.21 0.00132 **
wf.zipcode98102 1.22e+00 1.08e-01 11.34 < 2e-16 ***
wf.zipcode98103 1.05e+00 5.82e-02 18.09 < 2e-16 ***
wf.zipcode98105 1.08e+00 6.99e-02 15.41 < 2e-16 ***
wf.zipcode98105-1 1.73e+00 4.29e-01 4.05 5.2e-05 ***
wf.zipcode98106 5.57e-01 6.30e-02 8.84 < 2e-16 ***
wf.zipcode98107 1.09e+00 6.59e-02 16.48 < 2e-16 ***
wf.zipcode98108 5.66e-01 7.29e-02 7.75 9.9e-15 ***
wf.zipcode98109 1.40e+00 8.77e-02 15.93 < 2e-16 ***
wf.zipcode98112 1.17e+00 7.20e-02 16.24 < 2e-16 ***
wf.zipcode98115 1.01e+00 5.83e-02 17.38 < 2e-16 ***
wf.zipcode98116 1.05e+00 6.59e-02 15.96 < 2e-16 ***
wf.zipcode98117 1.07e+00 5.86e-02 18.32 < 2e-16 ***
wf.zipcode98118 6.43e-01 5.80e-02 11.10 < 2e-16 ***
wf.zipcode98119 1.26e+00 7.73e-02 16.29 < 2e-16 ***
wf.zipcode98122 1.01e+00 6.79e-02 14.90 < 2e-16 ***
wf.zipcode98125 7.91e-01 6.00e-02 13.19 < 2e-16 ***
wf.zipcode98126 7.39e-01 6.07e-02 12.18 < 2e-16 ***
wf.zipcode98133 6.93e-01 5.91e-02 11.73 < 2e-16 ***
wf.zipcode98136 9.55e-01 6.79e-02 14.07 < 2e-16 ***
wf.zipcode98136-1 2.14e+00 3.15e-01 6.81 1.0e-11 ***
wf.zipcode98144 8.02e-01 6.63e-02 12.10 < 2e-16 ***
wf.zipcode98146 3.03e-01 6.37e-02 4.76 2.0e-06 ***
wf.zipcode98148 3.49e-01 1.08e-01 3.24 0.00120 **
wf.zipcode98155 6.29e-01 6.16e-02 10.22 < 2e-16 ***
wf.zipcode98166 3.71e-01 7.01e-02 5.30 1.2e-07 ***
wf.zipcode98166-1 1.36e+00 2.60e-01 5.23 1.7e-07 ***
wf.zipcode98168 1.24e-01 6.71e-02 1.85 0.06447 .
wf.zipcode98177 8.09e-01 6.81e-02 11.88 < 2e-16 ***
wf.zipcode98178 2.73e-01 6.75e-02 4.04 5.3e-05 ***
wf.zipcode98188 2.58e-01 7.10e-02 3.64 0.00027 ***
wf.zipcode98198 1.54e-01 6.83e-02 2.26 0.02392 *
wf.zipcode98198-1 4.55e-01 2.45e-01 1.85 0.06411 .
wf.zipcode98199 1.12e+00 6.74e-02 16.58 < 2e-16 ***
wf.zipcodewf-high 1.64e+00 2.44e-01 6.70 2.2e-11 ***
wf.zipcodewf-low 2.05e+00 3.21e-01 6.40 1.6e-10 ***
wf.zipcodewf-mid 1.73e+00 1.40e-01 12.36 < 2e-16 ***
decade1 -3.87e-02 3.45e-02 -1.12 0.26102
decade2 2.21e-02 3.50e-02 0.63 0.52763
decade3 9.27e-02 4.21e-02 2.20 0.02782 *
decade4 -2.61e-02 4.11e-02 -0.63 0.52550
decade5 -5.26e-02 4.46e-02 -1.18 0.23757
decade6 -6.43e-02 5.07e-02 -1.27 0.20537
decade7 9.47e-03 5.61e-02 0.17 0.86601
decade8 7.36e-02 6.08e-02 1.21 0.22594
decade9 8.08e-02 6.73e-02 1.20 0.22979
decade10 1.95e-02 7.25e-02 0.27 0.78760
decade11 7.78e-03 7.88e-02 0.10 0.92135
bedroom.class5plus -2.55e-02 7.24e-03 -3.52 0.00043 ***
grade3:sqft_living -1.77e-03 3.32e-03 -0.53 0.59497
grade4:sqft_living -5.98e-04 1.89e-04 -3.16 0.00156 **
grade5:sqft_living -2.54e-04 5.98e-05 -4.25 2.2e-05 ***
grade6:sqft_living -1.88e-04 3.49e-05 -5.39 7.3e-08 ***
grade7:sqft_living -6.90e-05 2.66e-05 -2.59 0.00953 **
grade8:sqft_living -9.10e-06 2.43e-05 -0.37 0.70817
grade9:sqft_living -1.70e-05 2.23e-05 -0.76 0.44615
grade10:sqft_living -3.74e-06 2.16e-05 -0.17 0.86260
grade11:sqft_living -5.39e-06 2.34e-05 -0.23 0.81785
grade12:sqft_living -1.75e-05 2.69e-05 -0.65 0.51429
grade13:sqft_living -6.94e-06 5.91e-05 -0.12 0.90651
bathrooms:log(sqft_living) 3.78e-02 1.06e-02 3.56 0.00037 ***
condition2:yr_built -6.99e-03 2.62e-03 -2.66 0.00772 **
condition3:yr_built -7.94e-03 2.49e-03 -3.19 0.00145 **
condition4:yr_built -8.38e-03 2.50e-03 -3.36 0.00078 ***
condition5:yr_built -8.23e-03 2.51e-03 -3.29 0.00102 **
decade1:sqft_living 2.21e-05 1.68e-05 1.32 0.18664
decade2:sqft_living -2.39e-06 1.61e-05 -0.15 0.88171
decade3:sqft_living -3.22e-05 1.80e-05 -1.79 0.07360 .
decade4:sqft_living 4.81e-06 1.63e-05 0.30 0.76762
decade5:sqft_living 3.00e-06 1.49e-05 0.20 0.84017
decade6:sqft_living -6.71e-06 1.51e-05 -0.44 0.65637
decade7:sqft_living -4.58e-05 1.52e-05 -3.02 0.00257 **
decade8:sqft_living -5.89e-05 1.48e-05 -3.98 6.8e-05 ***
decade9:sqft_living -5.51e-05 1.47e-05 -3.76 0.00017 ***
decade10:sqft_living -1.36e-05 1.40e-05 -0.97 0.33081
decade11:sqft_living 1.71e-05 1.58e-05 1.08 0.28132
bathrooms:wf.zipcode98002 -9.23e-03 3.57e-02 -0.26 0.79578
bathrooms:wf.zipcode98003 -6.81e-02 3.50e-02 -1.95 0.05146 .
bathrooms:wf.zipcode98004 -1.44e-01 2.92e-02 -4.92 8.6e-07 ***
bathrooms:wf.zipcode98005 -1.74e-01 3.68e-02 -4.72 2.4e-06 ***
bathrooms:wf.zipcode98006 -5.55e-02 2.87e-02 -1.93 0.05361 .
bathrooms:wf.zipcode98007 -1.32e-01 4.49e-02 -2.93 0.00336 **
bathrooms:wf.zipcode98008 -9.65e-02 3.62e-02 -2.66 0.00772 **
bathrooms:wf.zipcode98010 -1.08e-01 5.41e-02 -1.99 0.04688 *
bathrooms:wf.zipcode98011 -1.01e-01 4.87e-02 -2.08 0.03732 *
bathrooms:wf.zipcode98014 -1.28e-01 3.62e-02 -3.53 0.00042 ***
bathrooms:wf.zipcode98019 -8.38e-02 4.13e-02 -2.03 0.04258 *
bathrooms:wf.zipcode98022 -4.53e-02 3.89e-02 -1.17 0.24392
bathrooms:wf.zipcode98023 -8.62e-02 2.96e-02 -2.91 0.00360 **
bathrooms:wf.zipcode98024 -8.05e-02 3.69e-02 -2.18 0.02916 *
bathrooms:wf.zipcode98027 -8.02e-02 2.94e-02 -2.73 0.00640 **
bathrooms:wf.zipcode98028 -7.74e-02 4.07e-02 -1.90 0.05741 .
bathrooms:wf.zipcode98029 -1.20e-01 3.62e-02 -3.32 0.00089 ***
bathrooms:wf.zipcode98030 -6.59e-02 3.64e-02 -1.81 0.06996 .
bathrooms:wf.zipcode98031 -1.07e-01 3.64e-02 -2.95 0.00321 **
bathrooms:wf.zipcode98032 -8.15e-02 4.88e-02 -1.67 0.09451 .
bathrooms:wf.zipcode98033 -8.37e-02 2.91e-02 -2.88 0.00400 **
bathrooms:wf.zipcode98034 -7.19e-02 2.90e-02 -2.48 0.01326 *
bathrooms:wf.zipcode98038 -6.19e-02 3.46e-02 -1.79 0.07362 .
bathrooms:wf.zipcode98039 -1.34e-01 4.20e-02 -3.18 0.00145 **
bathrooms:wf.zipcode98040 -1.22e-01 3.21e-02 -3.80 0.00014 ***
bathrooms:wf.zipcode98040-1 -1.06e-01 5.39e-02 -1.96 0.05039 .
bathrooms:wf.zipcode98042 -6.62e-02 3.00e-02 -2.21 0.02736 *
bathrooms:wf.zipcode98045 -1.40e-01 3.25e-02 -4.30 1.7e-05 ***
bathrooms:wf.zipcode98052 -1.10e-01 3.09e-02 -3.56 0.00038 ***
bathrooms:wf.zipcode98053 -1.16e-01 3.08e-02 -3.76 0.00017 ***
bathrooms:wf.zipcode98055 -7.54e-02 3.07e-02 -2.45 0.01411 *
bathrooms:wf.zipcode98056 -4.02e-02 2.96e-02 -1.36 0.17410
bathrooms:wf.zipcode98058 -6.20e-02 2.97e-02 -2.09 0.03669 *
bathrooms:wf.zipcode98059 -3.74e-02 2.92e-02 -1.28 0.19964
bathrooms:wf.zipcode98065 -8.57e-02 3.08e-02 -2.78 0.00543 **
bathrooms:wf.zipcode98070 -1.09e-01 4.03e-02 -2.69 0.00705 **
bathrooms:wf.zipcode98070-1 -9.04e-02 7.29e-02 -1.24 0.21517
bathrooms:wf.zipcode98072 -9.63e-02 3.75e-02 -2.57 0.01028 *
bathrooms:wf.zipcode98074 -1.19e-01 3.22e-02 -3.71 0.00021 ***
bathrooms:wf.zipcode98075 -1.55e-01 3.21e-02 -4.83 1.4e-06 ***
bathrooms:wf.zipcode98075-1 -8.63e-02 8.83e-02 -0.98 0.32859
bathrooms:wf.zipcode98077 -1.11e-01 3.66e-02 -3.03 0.00245 **
bathrooms:wf.zipcode98092 -9.06e-02 3.37e-02 -2.69 0.00721 **
bathrooms:wf.zipcode98102 -1.10e-01 4.58e-02 -2.40 0.01627 *
bathrooms:wf.zipcode98103 -9.37e-02 2.75e-02 -3.41 0.00065 ***
bathrooms:wf.zipcode98105 -4.58e-02 3.20e-02 -1.43 0.15293
bathrooms:wf.zipcode98105-1 -1.24e-01 1.20e-01 -1.03 0.30103
bathrooms:wf.zipcode98106 -9.88e-02 3.05e-02 -3.24 0.00119 **
bathrooms:wf.zipcode98107 -9.27e-02 3.07e-02 -3.02 0.00253 **
bathrooms:wf.zipcode98108 -9.57e-02 3.62e-02 -2.65 0.00815 **
bathrooms:wf.zipcode98109 -1.90e-01 3.84e-02 -4.96 7.3e-07 ***
bathrooms:wf.zipcode98112 -5.72e-02 3.08e-02 -1.86 0.06337 .
bathrooms:wf.zipcode98115 -7.97e-02 2.78e-02 -2.86 0.00421 **
bathrooms:wf.zipcode98116 -1.18e-01 3.09e-02 -3.82 0.00014 ***
bathrooms:wf.zipcode98117 -1.24e-01 2.82e-02 -4.40 1.1e-05 ***
bathrooms:wf.zipcode98118 -7.12e-02 2.82e-02 -2.52 0.01161 *
bathrooms:wf.zipcode98119 -9.69e-02 3.33e-02 -2.91 0.00365 **
bathrooms:wf.zipcode98122 -6.48e-02 3.14e-02 -2.07 0.03890 *
bathrooms:wf.zipcode98125 -9.07e-02 2.90e-02 -3.13 0.00177 **
bathrooms:wf.zipcode98126 -6.07e-02 3.04e-02 -1.99 0.04609 *
bathrooms:wf.zipcode98133 -1.13e-01 2.89e-02 -3.92 8.8e-05 ***
bathrooms:wf.zipcode98136 -1.05e-01 3.25e-02 -3.24 0.00122 **
bathrooms:wf.zipcode98136-1 -3.66e-01 1.41e-01 -2.60 0.00925 **
bathrooms:wf.zipcode98144 -4.00e-02 3.03e-02 -1.32 0.18595
bathrooms:wf.zipcode98146 2.73e-02 3.28e-02 0.83 0.40519
bathrooms:wf.zipcode98148 -8.12e-02 5.73e-02 -1.42 0.15629
bathrooms:wf.zipcode98155 -8.05e-02 3.02e-02 -2.66 0.00778 **
bathrooms:wf.zipcode98166 1.40e-02 3.45e-02 0.41 0.68478
bathrooms:wf.zipcode98166-1 -2.32e-01 9.18e-02 -2.53 0.01133 *
bathrooms:wf.zipcode98168 -1.64e-02 3.83e-02 -0.43 0.66848
bathrooms:wf.zipcode98177 -6.55e-02 3.11e-02 -2.11 0.03514 *
bathrooms:wf.zipcode98178 -4.68e-02 3.42e-02 -1.37 0.17174
bathrooms:wf.zipcode98188 -6.24e-02 3.43e-02 -1.82 0.06856 .
bathrooms:wf.zipcode98198 -6.55e-03 3.40e-02 -0.19 0.84720
bathrooms:wf.zipcode98198-1 8.95e-02 1.10e-01 0.81 0.41733
bathrooms:wf.zipcode98199 -9.87e-02 3.04e-02 -3.24 0.00118 **
bathrooms:wf.zipcodewf-high -5.81e-02 6.90e-02 -0.84 0.39978
bathrooms:wf.zipcodewf-low -6.86e-01 1.73e-01 -3.98 7.1e-05 ***
bathrooms:wf.zipcodewf-mid -1.61e-01 4.82e-02 -3.34 0.00084 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.181 on 9784 degrees of freedom
Multiple R-squared: 0.886, Adjusted R-squared: 0.884
F-statistic: 355 on 215 and 9784 DF, p-value: <2e-16
Model Accuracy¶
As the linear regression model predicts the log of the house price, the prediction results need to be transformed back to house prices using the exp function before assessing the model accuracy.
Convert the predicted values to the house price and calculate the RSE, $R^2$ and F-statistic values from the results.
In [43]:
# Transform the predicted values to house prices
houseData.predict <- exp(house.model$fitted.values)
# Calculate the accuracy statistics
houseModel.accuracy <- Model.Accuracy(houseData.predict,houseData$price,9784,215)
# Print the accuracy statistics
cat("\nModel accuracy after converting the predicted log(price) to price:\n")
cat("\nDegrees of freedom: 9784\nModel parameters: 215 plus intercept")
cat("\nResidual standard error:",houseModel.accuracy$rse)
cat("\nPercentage error:",houseModel.accuracy$rse*100/mean(houseData$price),"%")
cat("\nR-Squared:",houseModel.accuracy$rsquared)
cat("\nF-statistic:",houseModel.accuracy$f.stat,"; p-value:",pf(355,215,9784,lower.tail=FALSE))
Model accuracy after converting the predicted log(price) to price:
Degrees of freedom: 9784
Model parameters: 215 plus intercept
Residual standard error: 121593
Percentage error: 22.46 %
R-Squared: 0.8934
F-statistic: 381.2 ; p-value: 0
Model Analysis¶
Residuals: These are centred around zero, the 1st and 3rd quartiles are equal distances from zero and only a small difference in the minimum and maximum distance. This indicates an even distribution of the residuals.
Residual Standard Error: The residual standard error, or the estimated standard deviation of the residuals is 121592.8, which is 22.5% of the mean house price.
$R^2$: This indicates the model explains 88.4% of the variation in the log of the house price. After converting the regression results to the house price, the $R^2$ value increases and shows the model explains 89.3% of the variation in the house price.
F-statistic: The F-statistic calculated using the house prices is 381, which indicates a strong relationship between the predictor and response variables. The p-value associated with this is very small, meaning the null hypothesis (the model is not useful) can be rejected. There are 221 terms (plus the intercept) in the model, leaving 9778 degrees of freedom from the sample size of 10,000.
Coefficients: All the numerical variable coefficients have small p-values, so they are not zero at the 5% significance level. Not all the coefficients for all factors of categorical variables are significant, but all categorical variables apart from grade and decade have several factors with significant coefficients. Grade and decade are required in the model as there are interaction terms that include these variables.
Some observations from the values of the coefficients:
• The yr_built, log(sqft_living) and log(sqft_lot) variables and the interaction bathroom:log(sqft_living) all have positive coefficients, meaning that increases in these values result in a higher house price.
• The negative coefficient for the bedroom.class5plus factor shows that having 5 or more bedrooms lowers the house price once the effects of the correlated variables have been taken into account.
• The condition factor coefficients increase for the better conditioned houses, reflecting the higher price for houses in better condition seen in the EDA section.
• The negative correlation between condition and yr_built is modelled by the decreasing coefficients for the condition:yr_built factors.
• The correlation between bathrooms and zipcode discovered during the EDA is modelled by the different coefficients for the bathrooms:wf.zipcode interaction categories.
• As none of the grade category coefficients are significant, the correlation between grade and house price observed in the EDA is explained by the interaction between grade and sqft_living - in other words the grade has very little effect on the house price. The size of the house has the effect; the grade is largely determined by the size of the house.
In [44]:
par(mfrow=c(2,2))
plot(house.model)
Warning message:
“not plotting observations with leverage one:
6061, 8099”Warning message:
“not plotting observations with leverage one:
6061, 8099”Warning message in sqrt(crit * p * (1 - hh)/hh):
“NaNs produced”Warning message in sqrt(crit * p * (1 - hh)/hh):
“NaNs produced”

The model plots show:
• Residual vs Fitted - shows the residuals are evenly distributed around zero with no funnelling, so the model is meeting the assumption of homoscedasticity - the error terms are constant along the regression line.
• Normal Q-Q - the residuals deviate slightly from the dashed line, indicating the residuals have close to a normal distribution
• Scale-Location - The chart shows the variance of the residuals is reasonably constant
• Residuals vs Leverage - The chart shows there are some possibly influential outliers, however they are generally away from the Cook's line.
Display the records for the unplotted observations
In [45]:
# Check the records in the warnings
predict.price <- houseData.predict[c(6061, 8099)]
cbind(predict.price,houseData[c(6061, 8099),])
predict.price
price
bedrooms
bathrooms
sqft_living
sqft_lot
waterfront
condition
grade
yr_built
zipcode
wf.zipcode
decade
bedroom.class
6061
262000
262000
1
0.75
520
12981
0
5
3
1920
98022
98022
2
4minus
8099
280000
280000
1
0.00
600
24501
0
2
3
1950
98045
98045
5
4minus
The model has managed to predict the unplotted observations correctly and so may be overfitting at these points.
Check for Influential Outliers¶
Use the code supplied in the tutorial to check for influential outlier points
In [46]:
outlierTest(house.model, cutoff=0.05, digits = 1)
rstudent unadjusted p-value Bonferonni p
518 6.507 8.048e-11 8.047e-07
609 -5.289 1.258e-07 1.257e-03
4114 -5.087 3.713e-07 3.713e-03
2091 5.061 4.253e-07 4.253e-03
1642 5.060 4.274e-07 4.273e-03
6864 -5.034 4.880e-07 4.879e-03
9352 -5.013 5.444e-07 5.443e-03
7951 -4.960 7.165e-07 7.163e-03
3438 4.956 7.302e-07 7.300e-03
4313 4.912 9.165e-07 9.164e-03
The outlier test has reported several outliers, so generate an influence plot to see if these are influential outliers.
In [47]:
influencePlot(house.model, scale=5, id.method="noteworthy", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
Warning message in plot.window(...):
“"id.method" is not a graphical parameter”Warning message in plot.xy(xy, type, ...):
“"id.method" is not a graphical parameter”Warning message in axis(side = side, at = at, labels = labels, ...):
“"id.method" is not a graphical parameter”Warning message in axis(side = side, at = at, labels = labels, ...):
“"id.method" is not a graphical parameter”Warning message in box(...):
“"id.method" is not a graphical parameter”Warning message in title(...):
“"id.method" is not a graphical parameter”Warning message in plot.xy(xy.coords(x, y), type = type, ...):
“"id.method" is not a graphical parameter”
StudRes
Hat
CookD
518
6.507
0.12561
0.028041
609
-5.289
0.01682
0.002209
5493
2.048
0.80005
0.077655
6061
NaN
1.00000
NaN
8099
NaN
1.00000
NaN

The influence plot shows quite a few influential points has reported three of them:
• 518 has a large studentized residual and is also an outlier. The large studentized residual means the model has made a poor prediction for this sample.
• 5493 and 6061 have large Hat values and so are significantly influencing the model.
Display the outlier and influencer records
Using the Model to Predict Prices¶
Function to prepare the data for price prediction¶
Name: Prepare.Data
Input parameters:
• data - a dataframe that contains the test data. It should be in the same structure as the training dataset.
Return Value:
• A dataframe containing the prepared data
Description:
This function reformats the input dataframe to the format exptected by the model.
• Make a working copy of the data and remove the id and price columns, if they exist
• Change any grades or 1 or 2 to 3. These grades were not in the training data and the model will not run if the data contains these grades
• Factorise the categorical variables
• Create the generated variables wf.zipcode, decade and bedroom.class
In [48]:
Prepare.Data <- function(data) {
# Remove the id and price columns
newdata <- data
newdata$id <- NULL
newdata$price <- NULL
# The model will fail if any grades of 1 or 2 are present, so change these to 3
newdata$grade[newdata$grade < 3] <- 3
# Set the factors
newdata$waterfront <- as.factor(newdata$waterfront)
newdata$condition <- as.factor(newdata$condition)
newdata$grade <- as.factor(newdata$grade)
newdata$zipcode <- as.factor(newdata$zipcode)
# Generate the new variables
newdata <- GenerateVariables(newdata)
# Return the prepared dataframe
return(newdata)
}
Predict the House Prices for the Development Dataset¶
In [49]:
houseDev <- read.csv("dev.csv")
houseDev2 <- Prepare.Data(houseDev)
houseDev.predict <- exp(predict(house.model,houseDev2,type="response"))
print(head(cbind(PredictedPrice=houseDev.predict, ActualPrice=houseDev$price),20))
PredictedPrice ActualPrice
1 1158734 1146800
2 674695 950000
3 894635 850000
4 670569 599000
5 298266 255000
6 324204 280000
7 710348 715000
8 478907 550000
9 649825 1080000
10 497252 499000
11 269746 252350
12 287739 276900
13 742353 850000
14 327048 302495
15 352188 390000
16 797501 699000
17 654435 450000
18 581266 460000
19 280000 280000
20 231590 279000
The first twenty predicted prices are displayed along with the actual sale price. Most of these are reasonably close, but the model made poor predictions for records 2, 9 and 17.
RMSE for Development Predictions¶
In [50]:
dev.rmse <- RMSE(houseDev.predict,houseDev$price)
cat("RMSE for development predictions is:",dev.rmse,"; which is",dev.rmse*100/mean(houseDev$price),"% of the mean house price.")
RMSE for development predictions is: 114025 ; which is 20.65 % of the mean house price.
The root mean squared error for the price prediction of the development data is $114025, which is 20.7% of the mean house price. This is a slightly better than expected result, as the percentage error of the model based on the RSE is 22.5%.
Check the uncertainty of the expected value of predictions¶
In [51]:
print(head(exp(predict(house.model,newdata=houseDev2,interval="confidence")),20))
fit lwr upr
1 1158734 1061532 1264836
2 674695 649397 700979
3 894635 850748 940786
4 670569 642353 700023
5 298266 273907 324792
6 324204 313039 335768
7 710348 685306 736305
8 478907 463961 494334
9 649825 615232 686363
10 497252 480154 514959
11 269746 262285 277418
12 287739 274402 301724
13 742353 689114 799705
14 327048 314975 339583
15 352188 338916 365980
16 797501 773052 822724
17 654435 615885 695398
18 581266 533773 632986
19 280000 196355 399277
20 231590 220246 243519
The confidence intervals are generally about 10% of the fitted value, so on average the predictions should be within 10% of the actual sale price.
Check the uncertainty around the individual predictions
In [52]:
print(head(exp(predict(house.model,newdata=houseDev2,interval="prediction")),20))
fit lwr upr
1 1158734 803970 1670042
2 674695 472172 964084
3 894635 625157 1280273
4 670569 469030 958707
5 298266 207066 429635
6 324204 226962 463110
7 710348 497243 1014783
8 478907 335367 683881
9 649825 453795 930535
10 497252 348107 710297
11 269746 188955 385080
12 287739 201145 411611
13 742353 516585 1066790
14 327048 228893 467294
15 352188 246467 503259
16 797501 558500 1138779
17 654435 456574 938041
18 581266 403530 837288
19 280000 169514 462499
20 231590 161834 331415
The prediction intervals seem quite large compared to the fitted values, suggesting that there is a high level of uncertainty around each prediction.
Steps to Rebuild Model¶
The following steps can be run to prepare the training data, build the model and run the model to predict house prices.
It includes copies of functions created and used for building and testing the model.
1. Prepare the dataframe¶
Below is a copy of the GenerateVariables function used to add the generated variables to the dataframe
In [53]:
GenerateVariables <- function(data) {
# Make a copy of the dataframe
newdata <- data
# Generate the wf.zipcode variable
## Zipcodes treated individually
wfzipcodes <- c(98040,98070,98075,98105,98136,98166,98198)
## Define the zipcode groups
wfziplow <- c(98023,98146)
wfzipmid <- c(98006,98027,98052,98074,98118,98125,98155,98178)
wfziphigh <- c(98008,98033,98034,98039,98056,98144)
## Create the new variable
newdata$wf.zipcode <- as.character(newdata$zipcode)
wf.rows <- row.names(newdata[newdata$waterfront == 1 & newdata$zipcode %in% wfzipcodes,])
newdata$wf.zipcode[as.numeric(wf.rows)] <- paste(newdata$zipcode[as.numeric(wf.rows)],"1",sep="-")
newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfziplow] <- "wf-low"
newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfzipmid] <- "wf-mid"
newdata$wf.zipcode[newdata$waterfront == 1 & newdata$zipcode %in% wfziphigh] <- "wf-high"
newdata$wf.zipcode <- as.factor(newdata$wf.zipcode)
# Generate the decade variable
newdata$decade <- as.factor(ifelse(newdata$yr_built < 1900, 0,
ifelse(newdata$yr_built > 2019, 11, trunc(newdata$yr_built/10)-190)))
# Generate the bedroom.class variable
newdata$bedroom.class <- as.factor(ifelse (newdata$bedrooms < 5, "4minus", "5plus"))
# Return the modified dataframe
return(newdata)
}
The following code can be used to prepare the training dataframe for building the model. The required steps are:
1. Read the training data
2. Remove the id variable
3. Set the factors
4. Generate the new variables
In [54]:
houseData <- read.csv("training.csv")
houseData <- houseData[,-1]
houseData$waterfront <- as.factor(houseData$waterfront)
houseData$condition <- as.factor(houseData$condition)
houseData$grade <- as.factor(houseData$grade)
houseData$zipcode <- as.factor(houseData$zipcode)
houseData <- GenerateVariables(houseData)
2. Build the model¶
A copy of the code for the final model is shown below
In [55]:
house.model <- lm(log(price) ~ bathrooms + condition + grade + yr_built +
log(sqft_living) + log(sqft_lot) + wf.zipcode + decade + bedroom.class +
sqft_living:grade + bathrooms:log(sqft_living) + condition:yr_built +
sqft_living:decade + bathrooms:wf.zipcode,
data=houseData)
summary(house.model)
Call:
lm(formula = log(price) ~ bathrooms + condition + grade + yr_built +
log(sqft_living) + log(sqft_lot) + wf.zipcode + decade +
bedroom.class + sqft_living:grade + bathrooms:log(sqft_living) +
condition:yr_built + sqft_living:decade + bathrooms:wf.zipcode,
data = houseData)
Residuals:
Min 1Q Median 3Q Max
-0.9481 -0.0946 0.0023 0.0986 1.0992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.84e+00 5.31e+00 -1.10 0.27185
bathrooms -1.79e-01 8.38e-02 -2.14 0.03266 *
condition2 1.36e+01 5.08e+00 2.69 0.00725 **
condition3 1.57e+01 4.82e+00 3.25 0.00116 **
condition4 1.66e+01 4.82e+00 3.43 0.00060 ***
condition5 1.63e+01 4.85e+00 3.37 0.00075 ***
grade4 -1.03e+00 1.87e+00 -0.55 0.58004
grade5 -1.26e+00 1.86e+00 -0.68 0.49743
grade6 -1.22e+00 1.86e+00 -0.66 0.51077
grade7 -1.27e+00 1.86e+00 -0.68 0.49567
grade8 -1.27e+00 1.86e+00 -0.68 0.49479
grade9 -1.12e+00 1.86e+00 -0.60 0.54606
grade10 -1.06e+00 1.86e+00 -0.57 0.56974
grade11 -9.20e-01 1.86e+00 -0.49 0.62181
grade12 -6.49e-01 1.87e+00 -0.35 0.72821
grade13 -5.76e-01 1.91e+00 -0.30 0.76298
yr_built 7.72e-03 2.58e-03 3.00 0.00274 **
log(sqft_living) 4.69e-01 3.54e-02 13.26 < 2e-16 ***
log(sqft_lot) 8.04e-02 3.29e-03 24.44 < 2e-16 ***
wf.zipcode98002 3.00e-02 7.24e-02 0.41 0.67860
wf.zipcode98003 1.86e-01 7.30e-02 2.55 0.01087 *
wf.zipcode98004 1.47e+00 6.66e-02 22.07 < 2e-16 ***
wf.zipcode98005 1.14e+00 8.63e-02 13.20 < 2e-16 ***
wf.zipcode98006 8.16e-01 6.52e-02 12.52 < 2e-16 ***
wf.zipcode98007 9.95e-01 1.02e-01 9.76 < 2e-16 ***
wf.zipcode98008 9.11e-01 7.72e-02 11.80 < 2e-16 ***
wf.zipcode98010 4.09e-01 1.14e-01 3.58 0.00035 ***
wf.zipcode98011 6.98e-01 1.09e-01 6.40 1.6e-10 ***
wf.zipcode98014 5.75e-01 7.91e-02 7.27 3.9e-13 ***
wf.zipcode98019 4.97e-01 9.17e-02 5.42 6.2e-08 ***
wf.zipcode98022 1.67e-01 7.76e-02 2.15 0.03129 *
wf.zipcode98023 1.93e-01 6.28e-02 3.08 0.00209 **
wf.zipcode98024 5.80e-01 8.23e-02 7.05 2.0e-12 ***
wf.zipcode98027 7.03e-01 6.75e-02 10.42 < 2e-16 ***
wf.zipcode98028 5.96e-01 8.85e-02 6.74 1.7e-11 ***
wf.zipcode98029 9.00e-01 8.87e-02 10.14 < 2e-16 ***
wf.zipcode98030 2.21e-01 8.03e-02 2.76 0.00585 **
wf.zipcode98031 3.27e-01 7.80e-02 4.20 2.7e-05 ***
wf.zipcode98032 1.78e-01 9.15e-02 1.94 0.05228 .
wf.zipcode98033 1.01e+00 6.41e-02 15.68 < 2e-16 ***
wf.zipcode98034 7.50e-01 6.21e-02 12.07 < 2e-16 ***
wf.zipcode98038 3.14e-01 7.96e-02 3.95 8.0e-05 ***
wf.zipcode98039 1.63e+00 1.24e-01 13.15 < 2e-16 ***
wf.zipcode98040 1.23e+00 7.74e-02 15.93 < 2e-16 ***
wf.zipcode98040-1 1.90e+00 1.85e-01 10.25 < 2e-16 ***
wf.zipcode98042 2.32e-01 6.51e-02 3.56 0.00037 ***
wf.zipcode98045 6.27e-01 7.02e-02 8.93 < 2e-16 ***
wf.zipcode98052 9.18e-01 6.92e-02 13.26 < 2e-16 ***
wf.zipcode98053 8.54e-01 7.13e-02 11.98 < 2e-16 ***
wf.zipcode98055 3.37e-01 6.48e-02 5.20 2.0e-07 ***
wf.zipcode98056 4.31e-01 6.30e-02 6.84 8.7e-12 ***
wf.zipcode98058 3.12e-01 6.33e-02 4.94 8.1e-07 ***
wf.zipcode98059 4.27e-01 6.49e-02 6.58 4.9e-11 ***
wf.zipcode98065 6.25e-01 7.19e-02 8.70 < 2e-16 ***
wf.zipcode98070 5.69e-01 8.55e-02 6.66 2.9e-11 ***
wf.zipcode98070-1 8.37e-01 1.48e-01 5.65 1.7e-08 ***
wf.zipcode98072 7.08e-01 8.28e-02 8.54 < 2e-16 ***
wf.zipcode98074 8.37e-01 7.44e-02 11.25 < 2e-16 ***
wf.zipcode98075 9.47e-01 7.94e-02 11.94 < 2e-16 ***
wf.zipcode98075-1 1.72e+00 2.34e-01 7.35 2.1e-13 ***
wf.zipcode98077 6.94e-01 8.52e-02 8.14 4.6e-16 ***
wf.zipcode98092 2.39e-01 7.43e-02 3.21 0.00132 **
wf.zipcode98102 1.22e+00 1.08e-01 11.34 < 2e-16 ***
wf.zipcode98103 1.05e+00 5.82e-02 18.09 < 2e-16 ***
wf.zipcode98105 1.08e+00 6.99e-02 15.41 < 2e-16 ***
wf.zipcode98105-1 1.73e+00 4.29e-01 4.05 5.2e-05 ***
wf.zipcode98106 5.57e-01 6.30e-02 8.84 < 2e-16 ***
wf.zipcode98107 1.09e+00 6.59e-02 16.48 < 2e-16 ***
wf.zipcode98108 5.66e-01 7.29e-02 7.75 9.9e-15 ***
wf.zipcode98109 1.40e+00 8.77e-02 15.93 < 2e-16 ***
wf.zipcode98112 1.17e+00 7.20e-02 16.24 < 2e-16 ***
wf.zipcode98115 1.01e+00 5.83e-02 17.38 < 2e-16 ***
wf.zipcode98116 1.05e+00 6.59e-02 15.96 < 2e-16 ***
wf.zipcode98117 1.07e+00 5.86e-02 18.32 < 2e-16 ***
wf.zipcode98118 6.43e-01 5.80e-02 11.10 < 2e-16 ***
wf.zipcode98119 1.26e+00 7.73e-02 16.29 < 2e-16 ***
wf.zipcode98122 1.01e+00 6.79e-02 14.90 < 2e-16 ***
wf.zipcode98125 7.91e-01 6.00e-02 13.19 < 2e-16 ***
wf.zipcode98126 7.39e-01 6.07e-02 12.18 < 2e-16 ***
wf.zipcode98133 6.93e-01 5.91e-02 11.73 < 2e-16 ***
wf.zipcode98136 9.55e-01 6.79e-02 14.07 < 2e-16 ***
wf.zipcode98136-1 2.14e+00 3.15e-01 6.81 1.0e-11 ***
wf.zipcode98144 8.02e-01 6.63e-02 12.10 < 2e-16 ***
wf.zipcode98146 3.03e-01 6.37e-02 4.76 2.0e-06 ***
wf.zipcode98148 3.49e-01 1.08e-01 3.24 0.00120 **
wf.zipcode98155 6.29e-01 6.16e-02 10.22 < 2e-16 ***
wf.zipcode98166 3.71e-01 7.01e-02 5.30 1.2e-07 ***
wf.zipcode98166-1 1.36e+00 2.60e-01 5.23 1.7e-07 ***
wf.zipcode98168 1.24e-01 6.71e-02 1.85 0.06447 .
wf.zipcode98177 8.09e-01 6.81e-02 11.88 < 2e-16 ***
wf.zipcode98178 2.73e-01 6.75e-02 4.04 5.3e-05 ***
wf.zipcode98188 2.58e-01 7.10e-02 3.64 0.00027 ***
wf.zipcode98198 1.54e-01 6.83e-02 2.26 0.02392 *
wf.zipcode98198-1 4.55e-01 2.45e-01 1.85 0.06411 .
wf.zipcode98199 1.12e+00 6.74e-02 16.58 < 2e-16 ***
wf.zipcodewf-high 1.64e+00 2.44e-01 6.70 2.2e-11 ***
wf.zipcodewf-low 2.05e+00 3.21e-01 6.40 1.6e-10 ***
wf.zipcodewf-mid 1.73e+00 1.40e-01 12.36 < 2e-16 ***
decade1 -3.87e-02 3.45e-02 -1.12 0.26102
decade2 2.21e-02 3.50e-02 0.63 0.52763
decade3 9.27e-02 4.21e-02 2.20 0.02782 *
decade4 -2.61e-02 4.11e-02 -0.63 0.52550
decade5 -5.26e-02 4.46e-02 -1.18 0.23757
decade6 -6.43e-02 5.07e-02 -1.27 0.20537
decade7 9.47e-03 5.61e-02 0.17 0.86601
decade8 7.36e-02 6.08e-02 1.21 0.22594
decade9 8.08e-02 6.73e-02 1.20 0.22979
decade10 1.95e-02 7.25e-02 0.27 0.78760
decade11 7.78e-03 7.88e-02 0.10 0.92135
bedroom.class5plus -2.55e-02 7.24e-03 -3.52 0.00043 ***
grade3:sqft_living -1.77e-03 3.32e-03 -0.53 0.59497
grade4:sqft_living -5.98e-04 1.89e-04 -3.16 0.00156 **
grade5:sqft_living -2.54e-04 5.98e-05 -4.25 2.2e-05 ***
grade6:sqft_living -1.88e-04 3.49e-05 -5.39 7.3e-08 ***
grade7:sqft_living -6.90e-05 2.66e-05 -2.59 0.00953 **
grade8:sqft_living -9.10e-06 2.43e-05 -0.37 0.70817
grade9:sqft_living -1.70e-05 2.23e-05 -0.76 0.44615
grade10:sqft_living -3.74e-06 2.16e-05 -0.17 0.86260
grade11:sqft_living -5.39e-06 2.34e-05 -0.23 0.81785
grade12:sqft_living -1.75e-05 2.69e-05 -0.65 0.51429
grade13:sqft_living -6.94e-06 5.91e-05 -0.12 0.90651
bathrooms:log(sqft_living) 3.78e-02 1.06e-02 3.56 0.00037 ***
condition2:yr_built -6.99e-03 2.62e-03 -2.66 0.00772 **
condition3:yr_built -7.94e-03 2.49e-03 -3.19 0.00145 **
condition4:yr_built -8.38e-03 2.50e-03 -3.36 0.00078 ***
condition5:yr_built -8.23e-03 2.51e-03 -3.29 0.00102 **
decade1:sqft_living 2.21e-05 1.68e-05 1.32 0.18664
decade2:sqft_living -2.39e-06 1.61e-05 -0.15 0.88171
decade3:sqft_living -3.22e-05 1.80e-05 -1.79 0.07360 .
decade4:sqft_living 4.81e-06 1.63e-05 0.30 0.76762
decade5:sqft_living 3.00e-06 1.49e-05 0.20 0.84017
decade6:sqft_living -6.71e-06 1.51e-05 -0.44 0.65637
decade7:sqft_living -4.58e-05 1.52e-05 -3.02 0.00257 **
decade8:sqft_living -5.89e-05 1.48e-05 -3.98 6.8e-05 ***
decade9:sqft_living -5.51e-05 1.47e-05 -3.76 0.00017 ***
decade10:sqft_living -1.36e-05 1.40e-05 -0.97 0.33081
decade11:sqft_living 1.71e-05 1.58e-05 1.08 0.28132
bathrooms:wf.zipcode98002 -9.23e-03 3.57e-02 -0.26 0.79578
bathrooms:wf.zipcode98003 -6.81e-02 3.50e-02 -1.95 0.05146 .
bathrooms:wf.zipcode98004 -1.44e-01 2.92e-02 -4.92 8.6e-07 ***
bathrooms:wf.zipcode98005 -1.74e-01 3.68e-02 -4.72 2.4e-06 ***
bathrooms:wf.zipcode98006 -5.55e-02 2.87e-02 -1.93 0.05361 .
bathrooms:wf.zipcode98007 -1.32e-01 4.49e-02 -2.93 0.00336 **
bathrooms:wf.zipcode98008 -9.65e-02 3.62e-02 -2.66 0.00772 **
bathrooms:wf.zipcode98010 -1.08e-01 5.41e-02 -1.99 0.04688 *
bathrooms:wf.zipcode98011 -1.01e-01 4.87e-02 -2.08 0.03732 *
bathrooms:wf.zipcode98014 -1.28e-01 3.62e-02 -3.53 0.00042 ***
bathrooms:wf.zipcode98019 -8.38e-02 4.13e-02 -2.03 0.04258 *
bathrooms:wf.zipcode98022 -4.53e-02 3.89e-02 -1.17 0.24392
bathrooms:wf.zipcode98023 -8.62e-02 2.96e-02 -2.91 0.00360 **
bathrooms:wf.zipcode98024 -8.05e-02 3.69e-02 -2.18 0.02916 *
bathrooms:wf.zipcode98027 -8.02e-02 2.94e-02 -2.73 0.00640 **
bathrooms:wf.zipcode98028 -7.74e-02 4.07e-02 -1.90 0.05741 .
bathrooms:wf.zipcode98029 -1.20e-01 3.62e-02 -3.32 0.00089 ***
bathrooms:wf.zipcode98030 -6.59e-02 3.64e-02 -1.81 0.06996 .
bathrooms:wf.zipcode98031 -1.07e-01 3.64e-02 -2.95 0.00321 **
bathrooms:wf.zipcode98032 -8.15e-02 4.88e-02 -1.67 0.09451 .
bathrooms:wf.zipcode98033 -8.37e-02 2.91e-02 -2.88 0.00400 **
bathrooms:wf.zipcode98034 -7.19e-02 2.90e-02 -2.48 0.01326 *
bathrooms:wf.zipcode98038 -6.19e-02 3.46e-02 -1.79 0.07362 .
bathrooms:wf.zipcode98039 -1.34e-01 4.20e-02 -3.18 0.00145 **
bathrooms:wf.zipcode98040 -1.22e-01 3.21e-02 -3.80 0.00014 ***
bathrooms:wf.zipcode98040-1 -1.06e-01 5.39e-02 -1.96 0.05039 .
bathrooms:wf.zipcode98042 -6.62e-02 3.00e-02 -2.21 0.02736 *
bathrooms:wf.zipcode98045 -1.40e-01 3.25e-02 -4.30 1.7e-05 ***
bathrooms:wf.zipcode98052 -1.10e-01 3.09e-02 -3.56 0.00038 ***
bathrooms:wf.zipcode98053 -1.16e-01 3.08e-02 -3.76 0.00017 ***
bathrooms:wf.zipcode98055 -7.54e-02 3.07e-02 -2.45 0.01411 *
bathrooms:wf.zipcode98056 -4.02e-02 2.96e-02 -1.36 0.17410
bathrooms:wf.zipcode98058 -6.20e-02 2.97e-02 -2.09 0.03669 *
bathrooms:wf.zipcode98059 -3.74e-02 2.92e-02 -1.28 0.19964
bathrooms:wf.zipcode98065 -8.57e-02 3.08e-02 -2.78 0.00543 **
bathrooms:wf.zipcode98070 -1.09e-01 4.03e-02 -2.69 0.00705 **
bathrooms:wf.zipcode98070-1 -9.04e-02 7.29e-02 -1.24 0.21517
bathrooms:wf.zipcode98072 -9.63e-02 3.75e-02 -2.57 0.01028 *
bathrooms:wf.zipcode98074 -1.19e-01 3.22e-02 -3.71 0.00021 ***
bathrooms:wf.zipcode98075 -1.55e-01 3.21e-02 -4.83 1.4e-06 ***
bathrooms:wf.zipcode98075-1 -8.63e-02 8.83e-02 -0.98 0.32859
bathrooms:wf.zipcode98077 -1.11e-01 3.66e-02 -3.03 0.00245 **
bathrooms:wf.zipcode98092 -9.06e-02 3.37e-02 -2.69 0.00721 **
bathrooms:wf.zipcode98102 -1.10e-01 4.58e-02 -2.40 0.01627 *
bathrooms:wf.zipcode98103 -9.37e-02 2.75e-02 -3.41 0.00065 ***
bathrooms:wf.zipcode98105 -4.58e-02 3.20e-02 -1.43 0.15293
bathrooms:wf.zipcode98105-1 -1.24e-01 1.20e-01 -1.03 0.30103
bathrooms:wf.zipcode98106 -9.88e-02 3.05e-02 -3.24 0.00119 **
bathrooms:wf.zipcode98107 -9.27e-02 3.07e-02 -3.02 0.00253 **
bathrooms:wf.zipcode98108 -9.57e-02 3.62e-02 -2.65 0.00815 **
bathrooms:wf.zipcode98109 -1.90e-01 3.84e-02 -4.96 7.3e-07 ***
bathrooms:wf.zipcode98112 -5.72e-02 3.08e-02 -1.86 0.06337 .
bathrooms:wf.zipcode98115 -7.97e-02 2.78e-02 -2.86 0.00421 **
bathrooms:wf.zipcode98116 -1.18e-01 3.09e-02 -3.82 0.00014 ***
bathrooms:wf.zipcode98117 -1.24e-01 2.82e-02 -4.40 1.1e-05 ***
bathrooms:wf.zipcode98118 -7.12e-02 2.82e-02 -2.52 0.01161 *
bathrooms:wf.zipcode98119 -9.69e-02 3.33e-02 -2.91 0.00365 **
bathrooms:wf.zipcode98122 -6.48e-02 3.14e-02 -2.07 0.03890 *
bathrooms:wf.zipcode98125 -9.07e-02 2.90e-02 -3.13 0.00177 **
bathrooms:wf.zipcode98126 -6.07e-02 3.04e-02 -1.99 0.04609 *
bathrooms:wf.zipcode98133 -1.13e-01 2.89e-02 -3.92 8.8e-05 ***
bathrooms:wf.zipcode98136 -1.05e-01 3.25e-02 -3.24 0.00122 **
bathrooms:wf.zipcode98136-1 -3.66e-01 1.41e-01 -2.60 0.00925 **
bathrooms:wf.zipcode98144 -4.00e-02 3.03e-02 -1.32 0.18595
bathrooms:wf.zipcode98146 2.73e-02 3.28e-02 0.83 0.40519
bathrooms:wf.zipcode98148 -8.12e-02 5.73e-02 -1.42 0.15629
bathrooms:wf.zipcode98155 -8.05e-02 3.02e-02 -2.66 0.00778 **
bathrooms:wf.zipcode98166 1.40e-02 3.45e-02 0.41 0.68478
bathrooms:wf.zipcode98166-1 -2.32e-01 9.18e-02 -2.53 0.01133 *
bathrooms:wf.zipcode98168 -1.64e-02 3.83e-02 -0.43 0.66848
bathrooms:wf.zipcode98177 -6.55e-02 3.11e-02 -2.11 0.03514 *
bathrooms:wf.zipcode98178 -4.68e-02 3.42e-02 -1.37 0.17174
bathrooms:wf.zipcode98188 -6.24e-02 3.43e-02 -1.82 0.06856 .
bathrooms:wf.zipcode98198 -6.55e-03 3.40e-02 -0.19 0.84720
bathrooms:wf.zipcode98198-1 8.95e-02 1.10e-01 0.81 0.41733
bathrooms:wf.zipcode98199 -9.87e-02 3.04e-02 -3.24 0.00118 **
bathrooms:wf.zipcodewf-high -5.81e-02 6.90e-02 -0.84 0.39978
bathrooms:wf.zipcodewf-low -6.86e-01 1.73e-01 -3.98 7.1e-05 ***
bathrooms:wf.zipcodewf-mid -1.61e-01 4.82e-02 -3.34 0.00084 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.181 on 9784 degrees of freedom
Multiple R-squared: 0.886, Adjusted R-squared: 0.884
F-statistic: 355 on 215 and 9784 DF, p-value: <2e-16
3. Run the model¶
The following code is based on the steps used to predict prices for the development data.
A copy of the function to prepare the dataframe
In [56]:
Prepare.Data <- function(data) {
# Remove the id and price columns
newdata <- data
newdata$id <- NULL
newdata$price <- NULL
# The model will fail if any grades of 1 or 2 are present, so change these to 3
newdata$grade[newdata$grade < 3] <- 3
# Set the factors
newdata$waterfront <- as.factor(newdata$waterfront)
newdata$condition <- as.factor(newdata$condition)
newdata$grade <- as.factor(newdata$grade)
newdata$zipcode <- as.factor(newdata$zipcode)
# Generate the new variables
newdata <- GenerateVariables(newdata)
# Return the prepared dataframe
return(newdata)
}
A copy of the function to calculate the RMSE
In [57]:
RMSE <- function(predicted, target) {
se <- 0
for (i in 1:length(predicted)) {
se <- se + (predicted[i]-target[i])^2
}
return (sqrt(se/length(predicted)))
}
The following code will prepare the test data and generate the price predictions. Replace the file name with the name of the file containing the test set.
Note: The exp function is applied to the predictions as the model estimates the log of the house price.
In [58]:
houseTest <- read.csv("dev.csv")
houseTest2 <- Prepare.Data(houseTest)
houseTest.predict <- exp(predict(house.model,houseTest2,type="response"))
cat("RMSE for the test predictions is:", RMSE(houseTest.predict,houseTest$price))
RMSE for the test predictions is: 114025
In [ ]: