Contents
Part 1: Description of the Task
1
Assignment
Lex Comber January 2020
1.Overview…………………….. 2.Packages……………………..
3.Data ………………………. 4.ModellingandaninitialOLSRegressionmodel ……………………… 3 5.DataPre-processing……………………………………. 4 6.ArefinedOLSRegressioninitialmodel………………………….. 7 7.Summaryofthetask …………………………………… 9
Part 2: Random Forest methods, example and illustration 10
1.RandomForests-overviewandbackground ……………………….. 10 2.RandomForests-implementationinR ………………………….. 11 4.Data …………………………………………… 11 5.WorkedExample …………………………………….. 12
References 17
Part 1: Description of the Task 1. Overview
The assignment is a 2500 word project report due before 2pm on Thursday 12th March (Week 20) and represents 100% of module mark.
Your task is to construct a predictive Random Forest (RF) model of AirBnb pricing using a number of factors related to the rental (bedrooms, reviews etc) and the location of the property. The RF model will then be applied to data of potential listings in order to suggest the rental price, and the RF price predictions will be compared to those from a standard linear model. There are further details through the Assignment description document so please read this carefully and thoroughly.
The report should contain the following sections:
1. Introduction: describe the problem, the study aims and the potential advantages of the RF approach to be used (10 marks)
2. Methods: describe the data, provide a description of the RF model and any data pre-processing, etc that will need to be undertaken (10 marks)
3. Results: describe the Random Forest model, its tuning, evaluation and its application. This includes the tuning results (is it a good model?) and the results of the RF model prediction. Compare the RF prediction results with the OLS predictions (30 marks)
4. Discussion: critically evaluate the model and the results, the method (limitations, assumptions, etc, linking back to the literature and any areas of future / further work (30 marks)
Up to 20 marks will be awarded for presentational clarity, correct and consistent referencing and critical reflection. NB The indication of marks in this assignment are a guide to where your effort should be directed and students should remember that the assignments will be marked using the standard School of Geography marking criteria for Masters assignments.
1
………………….. 1 ………………….. 2 ………………….. 2
Please read through this assignment carefully before starting: In terms of R coding:
• you will be given code to clean data and construct a standard OLS regression model of AirBnb rental price;
• you will then construct a Random Forest model of rental price, applying a tuning grid to determine the best model;
• finally, you will then apply this model to some data in order to predict rental prices for specific potential AirBnb properties properties given some information about them (imagine you are advising a large property owner about their portfolio about which properties to put on the market);
In terms of the project assignment, you will then describe what you have done and critically evaluate the results, the approach and suggest potential next steps,and areas for further work.
2. Packages
You will likely need the following packages installed and loaded. You may have used many of them in previous practical sessions.
library(tidyverse) library(rgdal) library(sf) library(tmap) library(randomForest) library(ranger) library(caret) library(scales)
3. Data
All the data needed for this practical has been stored on the VLE for you to download. The project data is of AirBnb lettings in Manchester in November 2019. These were downloaded from http://insideairbnb.com/get- the-data.html.
There are 3 files in the assignment data zip file:
• AirBnb listing data table (“listings.csv”) with data to use to construct the model;
• a data table of new potential AirBnb listings (“potential_rentals.RData”) to test the model; • an RData file of the counties in Georgia, USA (“georgia.RData”)
You should load and examine the listings data :
The listings dataset includes a number of attributes that are of potential interest to a number of different analyses. However, for this the task variables such as price, beds, property_type, bedrooms, reviewer scores, room type, property type, etc could be considered
## # A tibble: 4,848 x 7
# read the data
listings = as_tibble(read.csv(“listings.csv”, stringsAsFactors = F)) str(listings)
listings[, c(“id”, “price”, “beds”,”property_type”, “bedrooms”, “bathrooms”, “room_type” )]
## id price
##
## 1 68951 $65.00
## 2 85109 $60.00
beds property_type bedrooms bathrooms room_type
6 House
1 Apartment
1 Entire home/apt
1.5 Private room
2
2 1 Entire home/apt
1 1 Shared room
1 1 Private room
1 1 Private room
1 1.5 Entire home/apt
1 1 Private room
1 1 Private room
1 1 Private room
4. Modelling and an initial OLS Regression model
For this assignment task, you will construct a model of the AirBnb rental price (price in the data) using the the following predictor variables:
• number of people the rental accommodates (accommodates);
• number of bathrooms (bathrooms);
• whether there is cleaning fee or not (cleaning_fee = this variable will have to be manipulated); • whether the property type is a House or Other (the property_type will need to be manipulated); • whether the room type is Private or Shared (the room_type variable will need to be manipulated); • the distance to the city centre
Formally you will train the predictive model of the price:
𝑃 𝑟𝑖𝑐𝑒 = 𝛽 + 𝛽𝐴𝑐𝑐𝑜𝑚𝑜𝑑𝑎𝑡𝑒𝑠 + 𝛽𝐵𝑎𝑡h𝑟𝑜𝑜𝑚𝑠 + 𝛽𝐶𝑙𝑒𝑎𝑛𝑖𝑛𝑔𝐹 𝑒𝑒 + 𝛽𝑃 𝑟𝑜𝑝𝑒𝑟𝑡𝑦𝑇 𝑦𝑝𝑒𝐻𝑜𝑢𝑠𝑒 + 𝛽𝑃 𝑟𝑜𝑝𝑒𝑟𝑡𝑦𝑇 𝑦𝑝𝑒𝑂𝑡h𝑒𝑟 +
𝛽𝑅𝑜𝑜𝑚𝑇𝑦𝑝𝑒𝑃𝑟𝑖𝑣𝑎𝑡𝑒 +𝛽𝑅𝑜𝑜𝑚𝑇𝑦𝑝𝑒𝑆h𝑎𝑟𝑒𝑑 +𝛽𝐶𝑖𝑡𝑦𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒
You will use this to predict the price for the potential properties in the “potential_rentals.RData” file.
An initial standard linear regression model of this can be constructed after some cleaning of the price variable. The code below transforms and cleans the 2 values relating to price after defining a function that converts dollars to numbers:
## 3 157612 $34.00
## 4 159189 $55.00
## 5 163826 $950.00
## 6 283495 $60.00
## 7 299194 $50.00
## 8 303111 $22.00
## 9 310742 $37.00
## 10 332580 $48.00
## # … with 4,838 more rows
2 Loft
2 House
1 Apartment
1 House
1 Chalet
1 Townhouse
1 Apartment
1 Apartment
# convert prices to numbers
dollar_to_number = function(x) { x = gsub(“[\\$]”, “”, x)
x = gsub(“,”, “”, x)
x = as.numeric(x)
x}
listings$price = dollar_to_number(listings$price) listings$cleaning_fee = dollar_to_number(listings$cleaning_fee) # convert cleaning fee to a binary
listings$cleaning_fee = (listings$cleaning_fee > 0 )+0 listings$cleaning_fee[is.na(listings$cleaning_fee)] = 0
We can examine the outputs:
## # A tibble: 4,846 x 6
listings %>%
select(price, accommodates, bathrooms, cleaning_fee,property_type, room_type) %>% drop_na()
##
##
## 1
## 2
## 3
## 4
price accommodates bathrooms cleaning_fee property_type room_type
65
60
34
55
4 1
2 1.5 3 1 2 1
1 House
0 Apartment
1 Loft
0 House
Entire home/apt
Private room
Entire home/apt
Shared room
3
##5 950
##6 60
##7 50
##8 22
##9 37
## 10 48
## # … with 4,836 more rows
0 Apartment
0 House
1 Chalet
0 Townhouse
0 Apartment
1 Apartment
Private room
Private room
Entire home/apt
Private room
Private room
Private room
And fit a model:
2 1 2 1 4 1.5 2 1
1 1
2 1
summary(lm(price~accommodates+bathrooms+cleaning_fee+ factor(property_type)+factor(room_type), data = listings[!is.na(listings$price),]))
This is a very poor model for so many reasons: weak fit, variables with seemingly little relationship to price, too many room and property types etc. The end result is that it is very difficult to be confident that a predictive model could be constructed from this data: the data and the regression model are not much use at this stage. Both need to be thought about. This is done in the next section.
5. Data Pre-processing
It is obvious that the listings data are messy, need to be cleaned and new variables need to be created:
• data describing prices need to be converted to numbers as above), and the distribution of the target variable (price) should be checked for normality, and transformed if necessary;
• the categorical variables such as property types and room types should be reduced to fewer categories and converted to binary variables (ie in 1 or 0 form);
• gaps in the data (either NA or missing data) need to be filled or the records removed; • the distance to the city centre needs to be included;
First lest go back to the rental price variable, price in the data and examine it as in Figure 1.
This has a classic skewed distribution and need to be transformed. The usual routes are using logs or square roots. Logs look to be a good fit – see Figure 2. We can log the rental prices.
hist(log(listings$price), col = “cornflowerblue”)
And get rid of any records that have a rental price of zero (generating infinity logs!)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.079 3.401 3.970 4.016 4.489 9.421
Next we can reduce some of the types variables. Here it looks like Apartment and House are the types that are most interest. The rest can be put into Other:
Then we can convert some of the variables to binary variables including cleaning_fee as before as well as some others that might potentially be interesting:
# convert price to numbers
listings$price = dollar_to_number(listings$price) hist(listings$price, col = “salmon”, breaks = 150)
listings = listings[listings$price >0,] listings$log_price = log(listings$price) summary(listings$log_price)
index = listings$property_type == “Apartment” | listings$property_type == “House” listings$property_type[!index] = “Other”
4
Histogram of listings$price
0 2000 4000 6000 8000 10000 12000
listings$price
Figure 1: The raw AirBnb price data,
Histogram of log(listings$price)
2468
log(listings$price)
Figure 2: The log of the AirBnb price data. 5
Frequency
Frequency
0 200 600 1000
0 1000 2000 3000 4000
listings$cleaning_fee = dollar_to_number(listings$cleaning_fee) # convert to binary variables
listings$cleaning_fee = (listings$cleaning_fee > 0 )+0 listings$cleaning_fee[is.na(listings$cleaning_fee)] = 0
# others
listings$property_type_House = (listings$property_type == “House”)+0 listings$property_type_Other = (listings$property_type == “Other”)+0 listings$room_type_Private_room = (listings$room_type == “Private room”)+0 listings$room_type_Shared_room = (listings$room_type == “Shared room”)+0
It is also possible to fill gaps (empty values or NAs) in data by applying the median value:
Finally the distance to the city centre can be calculated. The code below creates a pint sf layer using the latitude and longitude of Manchester city centre:
Then the distance in km to the listings observations can be calculated from their locations:
And the variables can be allocated to a new data table and checked:
## log_price accommodates beds bathrooms
## Min. :2.079 Min. : 1.000 Min. : 0.000 Min. :0.000
## 1st Qu.:3.401 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.:1.000
## Median :3.970 Median : 2.000 Median : 1.000 Median :1.000
## Mean :4.016 Mean : 3.393 Mean : 1.956 Mean :1.323
## 3rd Qu.:4.489 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.:1.500
## Max. :9.421 Max. :30.000 Max. :20.000 Max. :8.000
## cleaning_fee property_type_House property_type_Other room_type_Private_room
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.618 Mean :0.3846 Mean :0.2113 Mean :0.4909
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
oom”,
listings$bathrooms[is.na(listings$bathrooms)] = median(listings$bathrooms, na.rm = T) listings$beds[is.na(listings$beds)] = median(listings$beds, na.rm = T)
manc_cc = st_as_sf(
data.frame(city = “manchester”, longitude = -2.23743,
latitude = 53.48095),
coords = c(“longitude”, “latitude”), crs = 4326)
listings$ccdist <- as.vector(
st_distance(st_as_sf(listings,coords = c("longitude", "latitude"),
crs = 4326) , manc_cc))/1000
data_anal = listings[,c("log_price", "accommodates", "beds", "bathrooms", "cleaning_fee", "property_type_House", "property_type_Other", "room_type_Private_r
summary(data_anal)
"room_type_Shared_room", "ccdist")]
## room_type_Shared_room
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01052
## 3rd Qu.:0.00000
## Max. :1.00000
ccdist
Min. : 0.01362
1st Qu.: 1.85816
Median : 3.83137
Mean : 5.40314
3rd Qu.: 6.97050
Max. :33.50063
6
6. A refined OLS Regression initial model
Having made some decisions about data attributes and done some pre-processing of these a second OLS regression model can be fitted, this time using the log of rental price as the target variable. The code below fits model and examines the results:
##
## Call:
## lm(formula = reg.mod, data = data_anal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9573 -0.3125 -0.0639 0.2240 5.0746
##
## Coefficients:
##
## (Intercept)
## accommodates
## beds
## bathrooms
## cleaning_fee
## property_type_House
## property_type_Other
## room_type_Private_room -0.588929 0.019328 -30.470 < 2e-16 ***
## room_type_Shared_room -0.766737 0.074486 -10.294 < 2e-16 ***
## ccdist -0.011071 0.001581 -7.001 2.89e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5158 on 4836 degrees of freedom
## Multiple R-squared: 0.5376, Adjusted R-squared: 0.5368
## F-statistic: 624.8 on 9 and 4836 DF, p-value: < 2.2e-16
This is a much improved model. There is obviously some collinearity between variables - one would expect bedroom, beds and accommodates to be explaining the same variation for example - but as this is a predictive model rather than an inferential one (for process understanding) this is less important.
The relative importance of each predictor variable can also be examined
varImp(m, scale = FALSE)
##
## accommodates
## beds
## bathrooms
## cleaning_fee
## property_type_House
## property_type_Other
## room_type_Private_room 30.470300
reg.mod =
as.formula(log_price ~ accommodates + beds + bathrooms +
cleaning_fee + property_type_House + property_type_Other + room_type_Private_room + room_type_Shared_room + ccdist)
m = lm(reg.mod, data = data_anal) summary(m)
Estimate Std. Error t value Pr(>|t|)
3.973292 0.027010 147.104 < 2e-16 ***
0.109807 0.005819 18.872 < 2e-16 ***
0.013848
0.065108
-0.094154
-0.123511
0.091248
0.007644 1.812 0.0701 .
0.013776 4.726 2.35e-06 ***
0.016205 -5.810 6.64e-09 ***
0.019228 -6.424 1.46e-10 ***
0.020419 4.469 8.05e-06 ***
Overall
18.871820
1.811603
4.726187
5.810241
6.423528
4.468692
7
## room_type_Shared_room 10.293665
## ccdist 7.000975
And the model prediction compared with the actual, observed logged price values as in Figure 3.
ggplot(data.frame(Observed = data_anal$log_price, Predicted = m$fitted.values),
aes(x = Observed, y = Predicted))+ geom_point(size = 1, alpha = 0.5)+ geom_smooth(method = "lm", col = "red")
7
6
5
4
3
Then, the model can be applied to the test data to make prediction of the likely AirBnb rental price for potentially new Airbnb properties. Load the potential rentals listings and examine them:
load("potential_rentals.RData") data.frame(potential_rentals)
## ID accommodates beds bathrooms cleaning_fee property_type_House ##114211 0 ##222110 0 ##332111 0 ##442111 0 ##556421 0 ##662111 0 ##776211 0 ##882111 1 ##992111 0
Predicted
2468
Observed
Figure 3: Predicted (modelled) values of logged AirBnb rental proces againts observed ones).
8
##10104210 0
## property_type_Other room_type_Private_room room_type_Shared_room ccdist
##1 0 0 ##2 0 1 ##3 1 1 ##4 1 1 ##5 0 0 ##6 1 0 ##7 1 0 ##8 0 1 ##9 1 0 ## 10 1 0
09.2 05.4 01.3 01.0 00.5 03.9 00.1 01.5 01.4 0 0.3
This is a data table of 10 properties that the owners are thinking of putting on the market with AirBnb. The regression model above can be used to predict the market rate using the predict function and the antilog function, exp to turn the result back to dollars:
## 1 2 3 4 5 6 7 8
## 74.38954 37.45533 39.08095 39.21097 111.95008 68.42777 112.27222 31.45825
## 9 10
## 70.34817 98.81523
And with a bit of cleaning, and rounding to $5, the recommended prices would be as below:
## ID OLS.Price ##11 $75 ##22 $35 ##33 $40 ##44 $40 ##5 5 $110 ##66 $70 ##7 7 $110 ##88 $30 ##99 $70 ##1010 $100
7. Summary of the task
Your task is to create a tuned Random Forest model, use it to predict prices for these potential AirBnb properties and compare the results with the OLS ones generated in sub-section 6 above.
So for the assignment task you will:
• split the data into a training and validation subsets;
• create and tune a Random Forest model with the training subset;
• evaluate the model using the evaluation subset;
• apply the model to the new properties data to predict their price;
• compare the results with the OLS prediction;
• write up the assignment in the way suggested in the Part 1 Overview.
An illustration of doing these using a Random Forest model is provided in Part 2.
pred = exp(predict(m, newdata = potential_rentals)) pred
data.frame(ID = potential_rentals$ID,
`OLS Price` = paste0("$", round(pred/5)*5))
9
Part 2: Random Forest methods, example and illustration 1. Random Forests - overview and background
The second part of this document describes a the construction of Random Forest model. Random Forests (Breiman 2001) were introduced in Week 17 and implemented using the caret package which provided a wrapper for the implementation of Random Forests in the randomForest package.
Random Forests are extension of regression trees. These are a kind of decision trees. They have a hierarchical structure and use a series of binary rules that seek to partition the data in order to determine or predict an outcome. If you have played the games “20 Questions” or “Guess Who” (https://en.wikipedia.org/wiki/ Guess_Who%3F) then you will have constructed your own Decision Tree to get to the answer. Decision Trees have a Root Node at the top which performs the first split. In Guess Who my strategy was always to try to split the potential solutions by asking whether the person wore anything on their head (i.e. to include both glasses and hats - my kids still hate me!). From the root node branches connect to other nodes, with all branches relating to a yes or no answer, connecting to any subsequent intermediary nodes (and associated division of potential outcomes) before connecting to Terminal Nodes with predicted the outcomes.
Regression trees are a type of decision trees that create supervised learning predictive models. There are many methodologies for constructing regression trees but one of the oldest is known as the classification and regression tree (CART) approach developed by Breiman (1984). Regression trees generate results that are easily interpreted: the associations between co-variates (predictor variables) and the target variable are clear and the their associations are evident. It is clear which variables are important in the model through the nodes reduce the SSE the most.
However there can be a number of problems that the user should be aware of. The main issue is that they are sensitive to the initial data split.
Typically the initial partitions at the top of the tree will be similar but with differences lower down. This is because later (deeper) nodes tend to overfit to specific sample data attributes in order to further partition the data. As a result, samples that are only slightly different can result in variable models and differences in predicted values. More formally this high variance problem causes model instability. The results and predictions are sensitive to the initial training data sample. This means that can have poor predictive accuracy.
In order to overcome the high variance problem of single regression trees described above, Bootstrap aggre- gating (bagging) was proposed by Breiman (1996) to improve regression tree performance. Bagging, as the name suggests generates multiple models with the same parameters and averages the results from multiple tress. This reduces the chance of over-fitting as might arise with a single model and improves model predic- tion. The bootstrap sample in bagging will on average contain 63% of the training data, with about 37% left out of the bootstrapped sample. This is the out-of-bag (OOB) sample which are used to determine the model’s accuracy through a cross-validation process.
There are three steps to Bagging:
1. Create a number of samples from the training data. These are termed Bootstrapped samples because they are repeatedly sampled from a training set, before the model is computed from them. These samples contain slightly different data but with the same distribution properties of the full dataset.
2. For each bootstrap sample create (train) a regression tree.
3. Determine the average predictions from each tree, to generate an overall average predicted value.
So Bagging regression trees results in a predictive model that overcomes the problems of high variance and thus reduced predictive power in single tree model.
However, there can still be problems with Bagging regression trees, which may exhibit tree correlation (i.e. the trees in the bagging process are not completely independent of each other because all the predictors are considered at every split of every tree). As a result, trees from different bootstrap samples will have a similar structure to each other and this can prevent Bagging from optimally reducing variance of the
prediction and the performance of the model.
10
Random Forests Breiman (2001) seek to reduce tree correlation. They build large collections of decorrelated trees by adding randomness to the tree construction process. They do this by using a Bootstrap process and by split-variable randomization. Bootstrap is similar to bagging, but repeatedly resamples from the original sample. Trees that are grown from a bootstrap resampled data set are more likely to be decorrelated. The split-variable randomization introduces random splits and noise to the response, limiting the search for the split variable to a random subset of the variables.
The basic regression random forest algorithm proceeds as follows:
1. Select the number of trees (`ntrees`)
2. for i in `ntrees` do
3. | Create a bootstrap sample of the original data
4. | Grow a regression tree to the bootstrapped data
5. | for each split do
6. | | Randomly select `m` variables from `p` possible variables 7. | | Identify the best variable/split among `m`
8. | | Split into two child nodes
9. | end
10. end
2. Random Forests - implementation in R
The case for using random forests is developed in considerable detail above: they generate stable results and are able to capture the nuance in the often non-linear relationships between variables. Random Forest were specifically developed to overcome the problems of high variance and reduced predictive power associated with regression trees and bagged regression trees.
Here the caret implementation is not implemented: it is slow and tuning options are limited. There are a greater number of parameters that can be tuned in the ranger C++ implementation of Brieman’s random forest algorithm than the caret implementation:
4. Data
The Random Forest will be illustrated using the georgia data. This is a spatial dataset but we will use the attribute table to generate RF model of median income (MedInc)
load("georgia.RData") qtm(georgia, "MedInc", frame = F)
11
The target variable can be converted to 1000s of dollars:
georgia$MedInc = georgia$MedInc/1000 5. Worked Example
First load the packages:
library(ranger) library(randomForest) library(caret)
And then create training validation data splits. The createDataPartition function form caret ensures that the target variable has the same distribution in the training and validation (test) split:
set.seed(123) # reproducibility
train.index = createDataPartition(georgia$MedInc, p = 0.7, list = F) summary(georgia$MedInc[train.index])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.85 29.87 34.33 37.10 41.20 77.53
summary(georgia$MedInc[-train.index])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.46 29.57 33.74 37.27 41.45 81.60 train_x = st_drop_geometry(georgia)[train.index,]
test_x = st_drop_geometry(georgia)[-train.index,]
Then we can create an initial model using the RF implementation in the randomForest package. Here we
12
MedInc
20,000 to 30,000 30,000 to 40,000 40,000 to 50,000 50,000 to 60,000 60,000 to 70,000 70,000 to 80,000 80,000 to 90,000
want to determine an appropriate number of trees as shown in Figure 4:
reg.mod = MedInc~PctRural+PctBach+PctEld+PctFB+PctPov+PctBlack rf1 <- randomForest(
formula = reg.mod, ntree= 500,
data = train_x
)
# number of trees with lowest error
which.min(rf1$mse)
## [1] 328
# plot!
plot(rf1)
rf1
0 100 200
300 400 500
trees
Figure 4: The error with different tree number.
So for this data 500 trees seems sensible. If the values was near to 500 then this might suggest that the number of trees needs to be increased. In this the ntree parameter would have to be increased to say 1000 and if that did not work 5000, etc. The increase in trees increases processing time.
Now we can set up a tuning grid to evaluate different parameter values that are passed to the RF, but this time using the ranger implementation. You should examine the help for ranger.
params <- expand.grid(
mtry = c(3:6), # the max value should be equal to number of predictors node_size = seq(3, 15, by = 2),
samp_size = c(.65, 0.7, 0.8, 0.9, 1)
)
# have a look!
dim(params)
13
Error
30 40 50 60
## [1] 140 3
head(params)
## mtry node_size samp_size ##1 3 3 0.65 ##2 4 3 0.65 ##3 5 3 0.65 ##4 6 3 0.65 ##5 3 5 0.65 ##6 4 5 0.65
tail(params)
## mtry node_size samp_size ##1355 13 1 ##1366 13 1 ##1373 15 1 ##1384 15 1 ##1395 15 1 ##1406 15 1
Now a loop can be set up that passes each combination of parameters in turn to the RF algorithm, with the error results saved off!
# define a vector to save the results of each iteration of the loop
rf.grid = vector()
# now run the loop
for(i in 1:nrow(params)) {
# create the model
rf.i <- ranger( formula
)
# add OOB error to res.vec
rf.grid <- c(rf.grid, sqrt(rf.i$prediction.error)) # to see progress
if (i%%10 == 0) cat(i, "\t")
}
= reg.mod,
= train_x,
= 500,
= params$mtry[i],
= params$node_size[i],
data
num.trees
mtry
min.node.size
sample.fraction = params$samp_size[i], seed = 123
##10 2030405060708090100 110 120 130 140
Now the results can be inspected and the best performing combination of parameters extracted using which.min:
params[which.min(params$OOB),]
## mtry node_size samp_size OOB
## 57 3 3 0.8 4.914253
These can be assigned to best_vals and passed to a final model:
# add the result to params
params$OOB = rf.grid
14
best_vals = unlist(params[which.min(params$OOB),]) rfFit = ranger(
)
formula
data
num.trees
mtry
min.node.size
sample.fraction = best_vals[3],
seed = 123,
importance = "impurity"
= reg.mod,
= train_x,
= 500,
= best_vals[1],
= best_vals[2],
And the final model can be evaluated by using it to predict median income values for the test data:
pred.rf = predict(rfFit, data = test_x)$predictions
and the model accuracy evaluated using the postResample function form the caret package:
postResample(pred = pred.rf, obs = test_x$MedInc) ## RMSE Rsquared MAE
## 5.9837680 0.7702331 3.7402415
The R squared (𝑅2) error tells us that the final model explains about 77% of the variation.
We can also plot the variable importance as in Figure 5 which shows how the variables are contributing to
the model:
And of course predicted and observed data can be compared graphically as before:
Finally, the RF model can be used to predict median income over unknown or future observations with the estimates of the input parameters:
data.frame(name = names(rfFit$variable.importance),
value = rescale(rfFit$variable.importance, c(0,100))) %>%
arrange(desc(value)) %>% ggplot(aes(reorder(name, value), value)) + geom_col() + coord_flip() + xlab(“”) + theme(axis.text.y = element_text(size = 7))
data.frame(Predicted = pred.rf, Observed = test_x$MedInc) %>% ggplot(aes(x = Observed, y = Predicted))+
geom_point(size = 1, alpha = 0.5)+ geom_smooth(method = “lm”, col = “red”)
# create a data.frame
test_df = data.frame(PctRural = seq(10, 50, 10), PctBach = seq(10, 50, 10),
PctEld = seq(10, 50, 10), PctFB = seq(10, 50, 10), PctPov = seq(10, 50, 10), PctBlack = seq(10, 50, 10))
# predict median income ($1000s)
round(predict(rfFit, data = test_df)$predictions, 3)
## [1] 44.423 36.366 32.259 31.963 32.193
15
PctPov
PctEld
PctBach
PctBlack
PctFB
PctRural
0 25 50 75 100
value
Figure 5: The variable importance in the RF model of median Income.
16
References
Breiman, Leo. 1984. Classification and Regression Trees. Chapman; Hall/CRC. ———. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer: 123–40. ———. 2001. “Random Forests.” Machine Learning 45 (1). Springer: 5–32.
17