Regression Trees, Bagged Trees and Random Forests
Lex Comber January 2020
Contents
1.Overview…………………………………………. 1 Packages………………………………………… 1 Data………………………………………….. 1
2.DecisionTrees:findingstructureinthedata ……………………….. 2 3.RegressionTrees……………………………………… 2 4.Bagging:Bootstrapaggregating ……………………………… 8 5.RandomForests……………………………………… 10 6.Summary…………………………………………. 14 References………………………………………….. 17
1. Overview
This additional practical provides some technical background to the Assignment. It will support your un- derstanding in developing your analyses for the Assignment. It describes a sequence of developments in thinking that has led to an evolution of decision tree models that starts with Classification and Regression Trees, then Bootstrap Aggregation (Bagging) and finally Random Forests. Each of these seeks to identify structure in the data in order to create a model.
This supporting practical should be undertaken AFTER doing the Week 17 practical and BEFORE starting the Assignment.
Packages
You will need to load the following packages:
library(sf) library(tidyverse) library(tmap) library(rpart) library(rpart.plot) library(visNetwork) library(caret) library(randomForest) library(ranger)
Data
You will need to load the following RData file.
The oa_sf object that is is loaded is a spatial dataset of Output Areas (LSOAs) for Liverpool. OAs are census areas, designed to contain typically ~300 people. The oa_sf data table contains the following UK 2011 population census attributes for each OA describing unemployment (unmployd) as a proxy for economic wellbeing and percentages of the following variables under 16 years, 16-24 years, 25-44 years, 45-64 years and over 65 as a lifestage indicators (u16, u25, u45, u65 and o65 respectively), the percentages of one
load(“Assignment_prep.RData”) oa_sf
1
person households (OnePH), one family household (OneFamH), people with a bachelors degree (Degree), and then the percentage of very broad ethnic groups (White, Mixed, Asian, Black and Other) and variable of the percentage of the OA containing greenspace (gs_area).The unemployment and age data were from the 2011 UK population census (https://www.nomisweb.co.uk) and the greenspace proportions were extracted from the Ordnance Survey Open Greenspace layer (https://www.ordnancesurvey.co.uk/opendatadownload/ products.html). The spatial frameworks were from the EDINA data library (https://borders.ukdataservice. ac.uk) and the OA data are projected to the OSGB projection (EPSG 27700).
The models this practical will all use unmployd (unemployment percentage) as the target variable.
The code below selects variables from oa_sf to create anal_data and then use the createDataPartition function from the caret package generate some training and testing data. These will be used to construct and evaluate the different tree based learning models of unmployd (the proportion of the population who are unemployed).
You can examine the distributions of the target variable
2. Decision Trees: finding structure in the data
Data explorations can reveal information about data distributions in 1D (e.g. histograms), 2D (e.g. corre- lations between variables) and 3D (e.g. correlations between variables across groups) with some refinement using polar plots. However, these visual tools, partitioning and correlations provides only limited informa- tion on data structure: it is difficult to determine the multivariate interactions and relationships between variables.
One of the ways of generating understandings of such structure is through decision trees. These have a hierarchical structure and use a series of binary rules to try to partition the data based on the application of the rules, in order to determine or predict an outcome. If you have played the games “20 Questions” or “Guess Who” (https://en.wikipedia.org/wiki/Guess_Who%3F) then you will have constructed your own Decision Tree to get to the answer. In Guess Who, you ask questions that have a yes or no answer. My strategy was always to try to split the potential solutions by asking complex questions “Does your person wear anything on their head, include glasses, hats, scarves, earings” – my children hate me!.
In a similar way, Decision Trees have a Root Node which performs the first split. From the root node branches connect to other nodes, with all branches relating to a yes or no answer, connecting to any subsequent intermediary nodes (and associated division of potential outcomes) before connecting to Terminal Nodes with predicted the outcomes.
3. Regression Trees
Regression trees are a type of decision tree that create supervised learning predictive models. There are many approaches for constructing regression trees but one of the oldest is known as the classification and regression tree (CART) approach developed by Breiman (1984).
What this does is partition data into subsets and then fits a simple model (a constant) to each member of the subset. This binary partitioning is repeated and successive partitions are undertaken on the different
oa_sf %>% st_drop_geometry() %>%
select(-c(code, Easting, Northing)) -> anal_data
# create training and testing data
set.seed(1234) # for reproducibility
train.index = createDataPartition(anal_data$unmplyd, p = 0.7, list = F) # split into training and validation (testing) subsets:
train_x = anal_data[train.index,]
test_x = anal_data[-train.index,]
summary(anal_data$unmplyd[train.index]) summary(anal_data$unmplyd[-train.index])
2
predictor variables. The aim of this recursive partitioning is to generate multiple subsets such that the members of each of the final subsets are as homogeneous as possible. Prediction is then done using the average response values for all observations in that final subset, and the results describe a series of rules that were used to create the predictions / subsets. This process can be undertaken for both continuous variables using regression trees and categorical variables using classification trees and CART models are typically visualised as a binary tree.
We will use the example of creating predictive models of long term illness using the lsoa_sf data. Formally, we have a continuous response variable 𝑌 and a number of inputs variables 𝑋𝑛, and the recursive partitioning generates 𝑖 predicted outcomes or regions (𝑅𝑖) and the model predicts 𝑌 with a constant 𝑐𝑚 for each region 𝑅𝑚:
𝑖
𝑓(𝑋) = ∑𝑚=1 𝑐𝑚𝐼(𝑋𝑛) ∈ 𝑅𝑚
A regression tree can be fitted tree using the rpart package:
Note the anova parameter passed to method when fitting a regression tree. You should examine the help for rpart to see what the method parameter options are. The regression tree can be examined directly:
̂
set.seed(124) m1 <- rpart(
formula = unmplyd ~ ., data = train_x, method = "anova"
)
m1
Or using a graphic as in figure \ref{fig:f1 using rpart.plot. rpart.plot(m1, box.palette = "Or")
OAC_class = Cosmopolitans,Suburbanites,Urbanites
yes
12 100%
13 40%
Black < 7.5
no
15 64%
Degree >= 11
18 25%
OneFamH >= 44
12 20 16 22 33% 7% 18% 7%
OnePH < 37 Degree >= 22 u16 < 24 o65 >= 25
26 4%
u16 >= 13
5.6 9.8 14 15 22 15 20 16 23 38 36% 17% 16% 2% 4% 13% 5% 3% 4% 1%
Figure 1: An ‘rpart.plot‘ of the regression tree.
3
The output of rpart.plot by default shows the percentage of data each node and the average unmployd percentages for that branch. Each node has 2 values in the plot: the mean value of the target variable in the split and the percentage of the training data observations in the node. In this case, the tree contains 9 internal nodes and 10 terminal nodes, and indicates that only 7 out of 15 predictor variables are being used to create model.
A more refined visualisation can be generated using the visTree function in the visNetwork package as in the figure generated by the code below. This shows the some of the characteristics of tree in mode detail, particularly the proportions of the training data that are partitioned at each branch through the branch thickness. This makes it much easier to understand the importance of different factors associated with the response variable, unmplyd in this case (Figure 2)
Figure 2: A visTree plot of the regression tree.
visTree(m1,legend=FALSE,collapse=TRUE,direction=’LR’)
This generates an interactive plots whose components (end points and branches) can be interrogated.
Together these visualizations and the model print out show that we start with 1111 observations in the train dataset and the first variable that the tree splits on whether the OAC class is Cosmopolitans, Surburbanites or Urbanites is or not. One subsequent branch from this goes straight to a terminal node, and the other, is split by thresholds applied to the percentage with a degree, then one parent families and percentage black.
This structure is determined by the rpart function. This identifies the variable and threshold values that reduces that reduces the SSE – the Sum of Squares Error – in this case the difference between observed values of unmplyd and the potential group mean of the branch.
So what has happened here? The predictor variables have been partitioned in a top-down, greedy way – 4
Table 1: Optimal tuning information.
CP nsplit
0.3725 0 0.0650 1 0.0270 3 0.0243 5 0.0204 6
0.0160 7 0.0147 8 0.0100 9
rel error xerror
1.0000 1.0011 0.6275 0.6294 0.4974 0.5456 0.4434 0.5353 0.4191 0.5132
0.3987 0.4972 0.3827 0.4887 0.3680 0.4790
xstd
0.0580 0.0455 0.0394 0.0390 0.0365
0.0351 0.0341 0.0328
i.e. any partitions made are not revisited, for example because of the results of later partitions. What the model does is search the entire training data set and considers each distinct value for all of the predictor variables, with the aim of finding the predictor and value for that predictor that partitions the data into two regions (𝑅1 and 𝑅2). It evaluates potential partitions be the degree to which they minimise the overall sums of squares error, 𝑆𝐸𝐸. Having identified this initial split, the partitioning process is repeated until some stopping criterion is reached. The result is usually a deep and complex tree.
Typically, what we would like is to strike an optimal balance between the depth and complexity of the tree and the prediction performance. How this is done is to prune the original tree to find an optimal subtree using a cost complexity parameter 𝛼 to penalizes the objective function of minimising the 𝑆𝐸𝐸 for the tree terminal nodes 𝑇:
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒{𝑆𝑆𝐸 + 𝛼|𝑇 |}
The result returns a pruned tree with the lowest penalized error for a given value of 𝛼. Smaller values typically result in more complex models (i.e. larger trees) and larger ones in smaller trees. Best practice is to evaluate the effects of multiple 𝛼 values and use cross-validation to determine the optimal subtree. However, the rpart function does this automatically and applies a range of 𝛼 values and undertakes a 10-fold cross validation, to determine the error for a given 𝛼 value computed on hold-out validation data. These can be examined:
plotcp(m1)
This shows the error (y-axis) and cost complexity (x-axis) and in this case there are diminishing returns after 7 terminal nodes (upper x-axis), with horizontal line through the point |𝑇 | = 7. Breiman (1984) suggested in practice to use the smallest tree within 1 standard deviation of the minimum cross validation error, so here a tree with 7 terminal nodes could be used with similar results.
Thus rpart undertakes some parameter tuning, with an optimal subtree of 9 splits, 10 terminal nodes, and a cross-validated error of around 0.48. Information about the optimal prunings can be examined in the cptable output of rpart:
round(m1$cptable, 4)
Having the established the model and considered the cost complexity parameter (𝛼), further fine tuning of 2 rpart parameters can be undertaken. First, to determine the minimum number of data points required to attempt a split before a terminal node is created (the minsplit parameter), which has a default of 20. Second, the maximum number of internal nodes between the root and terminal nodes, which has a default is 30 (the maxdepth parameter). The control argument can be used to pass a list of parameter values and even ranges of parameter value. The code below specifies a minsplit of 30 and maxdepth of 12:
set.seed(123) m2 <- rpart(
5
formula = unmplyd ~ .,
data = train_x,
method = "anova",
control = list(minsplit = 30, maxdepth = 12)
) m2$cptable
This does not change much! More usefully a range of parameter values should be evaluated. The code below creates a data frame of 189 combinations of minsplitand maxdepth parameters:
## [1] 189 2
head(params)
## minsplit maxdepth ##1 10 4 ##2 11 4 ##3 12 4 ##4 13 4 ##5 14 4 ##6 15 4
Each combination of these can be passed to rpart in a loop to generate a table of results:
params <- expand.grid( minsplit = seq(10, 30, 1), maxdepth = seq(4, 12, 1)
) dim(params)
param_test <- matrix(nrow = 0, ncol = 4)
colnames(param_test) = c("minsplit", "maxdepth", "cp", "xerror") for (i in 1:nrow(params)) {
# get values for row i in params
minsplit.i <- params$minsplit[i] maxdepth.i <- params$maxdepth[i] # create the model
mod.i <- rpart(
formula = unmplyd ~ .,
data = train_x,
method = "anova",
control = list(minsplit = minsplit.i, maxdepth = maxdepth.i)
)
# extract the optimal complexity paramters
min <- which.min(mod.i$cptable[, "xerror"]) cp <- mod.i$cptable[min, "CP"]
# get minimum error
xerror <- mod.i$cptable[min, "xerror"]
res.i = c(minsplit.i, maxdepth.i, cp, xerror) param_test = rbind(param_test,res.i )
# uncomment for progress
# if (i%%10 == 0) cat(i, "\t")
}
param_test = data.frame(param_test)
The table of results can be ordered and displayed:
6
head(param_test[order(param_test$xerror),])
##
## res.i.43 11
## res.i.66 13
## res.i.94 20
## res.i.74 21
## res.i.139 23
## res.i.88 14
minsplit maxdepth cp xerror
6 0.01 0.4553936
7 0.01 0.4588747
8 0.01 0.4628424
7 0.01 0.4652252
10 0.01 0.4691769
8 0.01 0.4708061
And in this case the this suggests that a model parameterised in a slightly different way would improve the model. The code snippet below creates a model using these parameters and uses this to make predictions for the test_x data created at the start of this:
set.seed(123)
# assign the best to a vector
vals = param_test[order(param_test$xerror)[1],] m3 <- rpart(
formula = unmplyd ~ .,
data = train_x,
method = "anova",
control = list(minsplit = vals[1], maxdepth = vals[2], cp = vals[3])
)
pred <- predict(m3, newdata = test_x) RMSE(pred = pred, obs = test_x$unmplyd)
## [1] 4.676102
Here the optimal model predicts the proportion of unemployment in any given LSOA from the variables in the social data table within 4.68, 1’% of the actual value.
Regression trees generate results that are easily interpreted: the associations between co-variates (predictor variables) and the target variable are clear and the their associations are evident. It is clear which variables are important in the model through the nodes reduce the SSE the most.
However there can be a number of problems that the user should be aware of. The main issue is that they are sensitive to the initial data split. Try running the code below, using different values of the prop parameter passed to split (e.g. 0.6, 0.65, 0.7, 0.75, 0.8) to see the impact that differences in the split can have on the model (look at the internal and terminal nodes):
train.index = createDataPartition(anal_data$unmplyd, p = 0.7, list = F) train_x2 = anal_data[train.index,]
test_x2 = anal_data[-train.index,]
m.temp <- rpart(
formula = unmplyd ~ ., data = train_x2, method = "anova"
)
m.temp$cptable rpart.plot(m.temp)
Typically the initial partitions at the top of the tree will be similar but with differences lower down. This is because later (deeper) nodes tend to overfit to specific sample data attributes in order to further partition the data. As a result, samples that are only slightly different can result in variable models and differences in predicted values.
More formally this high variance problem causes model instability. The results and predictions are sensitive to the initial training data sample. This means that can have poor predictive accuracy.
7
4. Bagging: Bootstrap aggregating
In order to overcome the high variance problem of single regression trees described above, Bootstrap aggre- gating (bagging) was proposed by Breiman (1996) to improve regression tree performance. Bagging, as the name suggests generates multiple models with the same parameters and averages the results from multiple tress. This reduces the chance of over-fitting as might arise with a single model and improves model predic- tion. The bootstrap sample in bagging will on average contain 63% of the training data, with about 37% left out of the bootstrapped sample. This is the out-of-bag (OOB) sample which are used to determine the model’s accuracy through a cross-validation process.
There are three steps to Bagging:
1. Create a number of samples from the training data. These are termed Bootstrapped samples because they are repeatedly sampled from a training set, before the model is computed from them. These samples contain slightly different data but with the same distribution properties of the full dataset.
2. For each bootstrap sample create (train) a regression tree.
3. Determine the average predictions from each tree, to generate an overall average predicted value.
The code below illustrates bagging using the caret package. This undertakes cross-validation to generate a robust understanding of the true expected test error and allows the variable importance to be assessed across the bagged trees. Here, a 10-fold cross-validated bagged model is constructed:
ctrl <- trainControl(method = "cv", number = 20) # CV bagged model
bagged_cv <- train(
form = unmplyd ~ ., data = train_x, method = "treebag", trControl = ctrl, importance = TRUE
)
The results and the contribution of different variables can be examined, as in figure 3:
bagged_cv
And this shows that the RMSE is 4.8% (it was 4.7%) and the importance of the different predictor variables. Variable importance describes the total amount that SSE is decreased by splits over a given predictor, averaged over all trees. The variables associated with the largest decreases in SSE are considered most important, with the importance value describing the relative mean decrease in SSE compared to the most important variable on a scale of 0-100. If this is compared to the test set out of sample, the cross-validated error estimate was very close, and the error has been reduced to about 4.5%:
## [1] 4.518856
In the next section bagging with random forests is used to reduce this error further.
tmp = varImp(bagged_cv)
data.frame(name = rownames(tmp$importance),
value = tmp$importance$Overall) %>% arrange(desc(value)) %>%
ggplot(aes(reorder(name, value), value)) + geom_col(fill = “thistle”) +
coord_flip() +
xlab(“”) +theme_bw()
pred <- predict(bagged_cv, test_x) RMSE(pred, test_x$unmplyd)
8
OnePH Black Degree OneFamH u45 White u25 u16 o65 Mixed u65 Other OAC_classEthnicity Central Asian OAC_classCosmopolitans OAC_classSuburbanites OAC_classHard−Pressed Living OAC_classMulticultural Metropolitans OAC_classUrbanites gs_area ‘OAC_classMulticultural Metropolitans‘ ‘OAC_classHard−Pressed Living‘ ‘OAC_classEthnicity Central‘
0 25 50 75 100
value
Figure 3: The imporant of the different variables under a bagging approach.
9
5. Random Forests
The preceding sections have shown how Bagging regression trees results in a predictive model that overcomes the problems of high variance and thus reduced predictive power in single tree model. However, there can be problems with Bagging regression trees, which may exhibit tree correlation (i.e. the trees in the bagging process are not completely independent of each other because all the predictors are considered at every split of every tree).
As a result, trees from different bootstrap samples will have a similar structure to each other and this can prevent Bagging from optimally reducing variance of the prediction and the performance of the model.
Random Forests Breiman (2001) seek to reduce tree correlation. They build large collections of decorrelated trees by adding randomness to the tree construction process. They do this by using a Bootstrap process and by split-variable randomization. Bootstrap is similar to bagging, but repeatedly resamples from the original sample. Trees that are grown from a bootstrap resampled data set are more likely to be decorrelated. The split-variable randomization introduces random splits and noise to the response, limiting the search for the split variable to a random subset of the variables.
The basic regression random forest algorithm proceeds as follows:
1. Select the number of trees (`ntrees`)
2. for i in `ntrees` do
3. | Create a bootstrap sample of the original data
4. | Grow a regression tree to the bootstrapped data
5. | for each split do
6. | | Randomly select `m` variables from `p` possible variables 7. | | Identify the best variable/split among `m`
8. | | Split into two child nodes
9. | end
10. end
The code below implements the RF function from the randomForest package:
##
## Call:
## randomForest(formula = unmplyd ~ ., data = train_x)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 18.87316
# set random seed for reproducibility
set.seed(789)
# default RF model m4 <- randomForest(
formula = unmplyd ~ .,
data = train_x
)
m4
##
## [1] 3.93401
% Var explained: 65.73
pred <- predict(m4, test_x) RMSE(pred, test_x$unmplyd)
The default random forest performs 500 trees and in this case 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠/3 = 5 randomly selected predictor variables at each split. Plotting the error rate, as in figure 4 shows that it stabilizes at around 250 trees and
10
slowly decreases to around 300 trees.
ggplot(data.frame(error = m4$mse), aes(y = error, x = 1:500)) + geom_line()+
xlab("trees")+theme_bw()
40
35
30
25
20
0 100 200
300 400 500
trees
Figure 4: Error rates for different trre numbers.
The error rate in the plot above is based on the OOB sample error (see m4$mse) and we can determine the number of trees providing the lowest error rate, in this case the 219th with an average unemployment rate error of 4.3%
which.min(m4$mse) ## [1] 219
sqrt(m4$mse[which.min(m4$mse)]) ## [1] 4.327556
We can measure predictive accuracy if we do not want to use the OOB samples by splitting the training data to create a second training and validation set (xtest and ytest):
# create training and validation data
# set random seed for reproducibility
set.seed(789)
train.index = createDataPartition(anal_data$unmplyd, p = 0.7, list = F) train_x2 = anal_data[train.index,]
test_x2 = anal_data[-train.index,]
x_test <- train_x2[setdiff(names(train_x2), "unmplyd")]
11
error
y_test <- train_x2$unmplyd rf_comp <- randomForest( formula = unmplyd ~ .,
)
data
xtest
ytest
= train_x2,
= x_test,
= y_test
# extract OOB & validation errors
oob <- sqrt(rf_comp$mse)
validation <- sqrt(rf_comp$test$mse)
The OOB and new split error rates can be visualised as in Figure 5:
tibble(
`Out of Bag Error` = oob,
`Test error` = validation,
ntrees = 1:rf_comp$ntree) %>% gather(Metric, RMSE, -ntrees) %>% ggplot(aes(ntrees, RMSE, color = Metric)) + geom_line() +
xlab(“Number of trees”)
6
5
4
3
2
0 100
200 300 400 500
Number of trees
Metric
Out of Bag Error
Test error
Figure 5: OOB and new split error rates.
This demonstrates the power of Random forests: the RMSE is reduced to well below 2.5% without any tuning which, which is much lower than the RMSE achieved with a fully-tuned bagging model above. The model can potentially be improved further by additional tuning of the random forest model.
12
RMSE
Random forests have a small number of tuning parameters. The main or initial consideration in tuning is to determine the number of candidate variables to select from at each split and there some further parameters to explore:
• ntree defines the number of trees and the aim is to have enough trees to stabilize the error but not so many as make the model inefficient;
• mtry is the number of variables to randomly sample at each split. A common heuristic is to start with 5 values evenly spaced across the range from 2 to p (where p is the the number of variables);
• sampsize indicate the number of samples to train on (the default is 63.25% of the training). Reducing this can reduce the training time but may introduce bias and increasing can risk over-fitting due to increased variance. Typically values in the range 60-80% are investigated;
• nodesize is the minimum number of samples in the terminal nodes and this defines the complexity of the trees. Smaller values causes smaller trees to be grown and larger ones allows for deeper, more complex trees. Again this is a further bias-variance trade-off as deeper trees risk over-fitting (due to higher variance) and shallower trees risk reduced predictive power because of increased bias;
• maxnodes: sets the maximum number of terminal nodes as a further method to control tree complexity. As before, we can set up combinations of some of these parameters to be evaluated, and then use a loop to
create a model for each parameter combination and then evaluate the model.
## [1] 120 3
You will have notice that the code above creating the RF models with randomForest is a bit slow. To overcome this and evaluate the 120 combinations of parameters in params, the ranger function can be used. This is a C++ implementation of Brieman’s random forest algorithm that runs much faster than the randomForest implementation in R. The code below will still take some time to run!
params <- expand.grid(
mtry = seq(2, 7),
node_size = seq(3, 9, by = 2), samp_size = c(.60, .65, .70, .75, .80)
) dim(params)
res.vec = vector()
for(i in 1:nrow(params)) {
# create the model
m.i <- ranger( formula
)
# add OOB error to res.vec
res.vec <- c(res.vec, sqrt(m.i$prediction.error)) # to see progress
if (i%%10 == 0) cat(i, "\t")
}
= unmplyd ~ .,
= train_x,
= 500,
= params$mtry[i],
= params$node_size[i],
data
num.trees
mtry
min.node.size
sample.fraction = params$samp_size[i], seed = 123
##10 2030405060708090100 110 120
The best performing can be examined
params$OOB = res.vec head(params[order(params$OOB),])
13
## mtry node_size samp_size OOB
## 27 4 3
## 28 5 3
## 81 4 5
## 34 5 5
## 33 4 5
## 100 5 3
0.65 4.371547
0.65 4.380909
0.75 4.384461
0.65 4.385704
0.65 4.386955
0.80 4.387234
The best random forest model uses mtry = 4, terminal node size of 3 observations, and a sample size of 65%. We can repeat this model to get a better expectation of the error rate. In this case, the expected error ranges between ~1.73%-1.74% with probably value around 1.736%. The results are shown in Figure 6:
RMSE.vec <- vector() for(i in 1:100) {
m.i <- ranger( formula
)
RMSE.vec <- c(RMSE.vec, sqrt(m.i$prediction.error)) if (i%%10 == 0) cat(i, "\t")
}
data
num.trees
mtry
min.node.size
sample.fraction = .65,
importance = 'impurity'
= unmplyd ~ ., = train_x,
= 500,
= 5,
= 3,
##10 2030405060708090100
Note how the importance parameter was set as impurity. This allows variable importance to be assessed, which is a measure of the decrease in MSE each time a variable is used as a node split in a tree. The remaining predictive accuracy error after a node split (node impurity) and variables that reduces this are considered to be more important than those that do not. Thus,the reduction in MSE for each variable across all the trees is accumulated and the variable with the greatest accumulated impact is considered the more important, or impactful. In figure 7, we can see that Degree has the greatest impact in reducing MSE across the trees, followed by OAC_class, Black etc
6. Summary
This section has covered a lot of ground from trees, to bagging to forests. What you will have noticed is how the different tree structures can be used to understand the structure of the data. In this case, the examples have all sought to model the employed variable in the social data table.
The Regression Tree approach using a single tree generates outputs with structures and components that easy to implement and interpret. They can handle many different types predictor variables (sparse, skewed, continuous, categorical, etc.) without pre-processing. The user does not have to specify the form of the
ggplot(data.frame(rmse = RMSE.vec), aes(x = rmse))+ geom_histogram(bins = 15, fill = "tomato2", col = "grey")
data.frame(name = names(m.i$variable.importance), value = m.i$variable.importance) %>%
arrange(desc(value)) %>% ggplot(aes(reorder(name, value), value)) + geom_col(fill = “dodgerblue”) + coord_flip() +
xlab(“”)
14
15
10
5
0
4.375 4.400
rmse
4.425
Figure 6: Random Forest expected error rates.
15
count
Degree OAC_class Black OnePH OneFamH u16 u45 u25 u65 o65 White Mixed Other Asian gs_area
0 3000 6000 9000
value
Figure 7: Variables that associated with the reducted random forest errors.
16
predictors relationship to the response (as for example, required in a linear regression model). The model begins with the whole dataset and then seeks to identify variables and values in the variables with which to partition the data. However, they have two particular weaknesses. First they can result in model instability where slight changes in the training data can radially change tree structure. And second, because of this sensitive to initial data over-fitting lower down the tree, they can result in model instability and sub-optimal predictive performance.
To address the problems of single trees on Classification and Regression Tree approaches, researchers de- veloped ensemble methods that combine many trees into one model, Bootstrap Aggregation or Bagging approaches. These effectively by combining the predictions from many models. Bagging is a general approach that uses bootstrapping in conjunction with regression (or classification) models to construct an ensemble. Each model in the ensemble is used to generate a prediction for a new sample, with the predictions then averaged to give the bagged model prediction. The idea is that by aggregating over many versions of the data, the approach reduces the variance in the prediction results in more stable predictions. However, the results are less easy to interpret. A sound approach is to start with CART to provide initial parameters for Bagging approaches.
There are further problems with Bagging. Although the bootstrap samples introduces a random component into the tree building process and therefore the distribution of trees and predicted values for each sample, the trees in Bagging, are not completely independent: all of the original predictors are considered at every split of every tree. This tree correlation prevents Bagging from optimally reducing variance of the predicted values. The aim of Random Forests is to de-correlates trees by adding randomness to the tree construction through random split selection and adding noise to the response. This ensemble of many independent, strong learning trees results in robust improvements in error rates. But the ensemble nature of Random Forests makes it difficult (impossible?) to understand the relationships between predictor and response variables. Again a sound approach is to start with CART approaches as the base learner in Random Forest applications.
The key thing that these approaches support is the development of understanding of the data and how the data properties relate to the process of interest. The importance of this links back to many of the points made in the first lecture to this module: it is critical to examine how different factors, attributes and variables inform on variations in the process of interest. Do the variables sufficiently describe the process (error rates)? Are there additional data or variables that could be included? That is to take a critical approach to data science.
References
Breiman, Leo. 1984. Classification and Regression Trees. Chapman; Hall/CRC. ———. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer: 123–40. ———. 2001. “Random Forests.” Machine Learning 45 (1). Springer: 5–32.
17