CS代考 STAT318 — Data Mining

STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
, University of Canterbury 2020
STAT318 — Data Mining ,1 / 42

Tree-Based Methods
In this section we consider tree-based methods for regression and classification.
These methods partition the feature (predictor) space into a number of regions, and then fit a simple model (for example, a constant) in each region.
The set of splitting rules used the partition the feature space can be summarized in a tree, so these approaches are known as decision tree methods.
We will focus on the Classification and Regression Trees (CART) approach.
, University of Canterbury 2020
STAT318 — Data Mining ,2 / 42
Decision trees are great because they are:
• simple;
• easy to interpret (at least for small tree);
• scale well to large training data sets (divide-and-conquer operation); and
• they are basis for state-of-the-art supervised learning methods (random forests and boosting).
Decision trees also have some drawbacks, but we’ll get to these later.

Decision Tree
A decision tree is a recursive partition of the feature space that can be displayed in a flowchart-like structure consisting of nodes and branches.
1 Root Node: A node that has no incoming branches.
2 Internal Node: A node that has exactly one incoming branch and at least two
outgoing branches.
3 Terminal (Leaf) Node: A node with exactly one incoming branch and no outgoing branches.
Internal nodes have an associated query (splitting rule) and by convention, we proceed to the left descendent node (left daughter) if we answer ‘yes’ to the query.
, University of Canterbury 2020
STAT318 — Data Mining ,3 / 42
Here is an over-simplified decision tree for a medical triage system to illustrate the structure of a tree.
6Yes No7
Send to hospital Stay at home
2 Yes .
Send to hospital
Does the patient have . difficulty breathing?
1 No 3
.Is the patient bleeding? .
.
To classify a patient we answer the queries and follow the branches. For example, a patient with no diculty breathing and no bleeding is told to stay at home:
node 1 æ node 3 æ node 7 = stay at home. A patient with diculty breathing is told to go to hospital: node 1 æ node 2 = send to hospital.

Splitting Rules
Qualitative Feature:
1 Multiway split: let K be the number of values the ith feature, Xi, takes in node m. Using Xi , split the node m into K descendent nodes — one node for each value.
2 Binary split: the split has the form
Xi œ S, where S be a subset of values the ith feature takes.
Quantitative Feature:
1 Axis parallel splits: a split using the ith feature, Xi, has the form Xi Æs, wheresœR.
2 Oblique splits: the split uses a linear combination of features and has the form
a0+ÿp aiXiÆ0, wherea0,a1,…,apœR. i=1
, University of Canterbury 2020
STAT318 — Data Mining ,4 / 42
Consider a feature Car Type with three values: Sports, Family, luxury. A multiway split and an equivalent series of binary splits are shown below.
Car Type ..
Multiway Split Binary Split
… Sports Family Luxury
.. Car Type in {Sports} Luxury
Car Type in {Sports,Family}
… Sports Family
If the feature has many values, multiway splits tend to fragment the data too quickly, leaving insucient training data for further splitting of the descendent nodes. Noting that a multiway split can be formed using a series of binary splits (see above), binary splits are preferred. CART only uses binary splits for qualitative features.

Oblique and axes parallel splits
−2 −1 0 1 2 −2 −1 0 1 2 X1 X1
, University of Canterbury 2020
STAT318 — Data Mining ,5 / 42
In this example, we have a two-class classification problem (green or yellow) with two continuous predictors (X1 and X2). The classes are linearly separable and an oblique split of the form a0 + a1X1 + a2X2 Æ 0
separates the classes perfectly. However, in general, finding an optimal oblique split can be computationally expensive as there are many possibilities to con- sider. Oblique trees are also more dicult to interpret because each split is a linear combination of the predictors, rather than a split involving a single predictor.
CART uses axis parallel splits (Xi Æ s) for continuous predictors. Axis parallel splits can produce complex decision boundary structure (see the right figure) when compared with oblique splits, but they are easier to fit. We will not be considering oblique splits any further in this course.
X2
−2 −1 0 1 2
X2
−2 −1 0 1 2

Regression Trees: Hitters data set
●● ●
● ●●
● ● ● ● ● ●●

●●●

● ●●●
● ●
●●

●● ●●●
●●●●●● ●
● ●
●●●●●● ● ● ● ●
● ●
●●●●● ●●
● ● ● ● ● ● ● ● ● ● ● ●
●●●●● ● ● ●
●●●● ●● ●●● ●●
●●●●●●●●●
● ● ● ● ●
●●
●●●●● ● ●● ●● ●
● ● ● ● ● ● ● ● ●●● ●
● ●
● ●●● ●●
●●● ●●● ●● ●
●●●●●● ● ●●

● ●●● ● ● ● ●●●● ●
●●
●●●●●
●● ●●●● ●●●●●● ● ●
●●●●●● ● ●●●
●●● ●
● ● ● ● ● ● ●●●●●●
●●●●●●●●● ● ● ●
●●●●

●●
5 10 15 20 Years
, University of Canterbury 2020
STAT318 — Data Mining ,6 / 42
To motivate regression trees, we consider a simple example. The aim here is to predict a Baseball player’s salary using:
• Hits: the number of hits in the previous year; and
• Years: the number of years playing in the major leagues.
The response variable Salary has been log transformed so it’s more ‘bell shaped’ (Salary is measured in 1,000 of dollars). Players with low income are coloured blue and those with high income yellow.
Can you explain the player with a high income (yellow), zero hits and only one year playing in the major league?
How do we partition the training data?
Hits
0 50 100 150 200

Decision Tree: Hitters data set
R1
R3
R2
1 4.5
24
Years
238
117.5
1
, University of Canterbury 2020
STAT318 — Data Mining ,7 / 42
A partition of the feature space for the Hitters training data. The three regions can be written as follows:
R1 = {X :Years<4.5} R2 = {X : Years Ø 4.5, Hits < 117.5} R3 = {X : Years Ø 4.5, Hits Ø 117.5} We read {X : Years < 4.5} as ‘the set of training data points X such that Years is less than 4.5’. Hits Decision Tree: Hitters data set 5.11 Years| < 4.5 6.00 6.74 Hits < 117.5 , University of Canterbury 2020 STAT318 — Data Mining ,8 / 42 The regression tree corresponding to the partition on the previous slide. The lengths of the branches are proportional to the decrease in RSS after the split is made. That is, longer branches correspond to greater reductions in RSS. (Note: this is something that the tree R function does, not all functions plot trees in this way.) The mean response of the training observations at each terminal node is used for making predictions. For example, a player with 6 years experience and 80 hits has a predicted log salary of 6, (that’s a salary of $106, 1 million USD). Observations: • Years is the most important predictor (it’s at the top of the tree and it reduces the RSS the most). • Experienced players with a large number of hits get larger salaries. • Simple, easy to display, and easy to explain to a non-expert. Now we need to learn how to fit one! Tree Building Process (Regression) We have N training data observations {xi,yi}Ni=1, where xi is a vector containing p predictors and yi is its response value. Suppose we have partitioned the feature space into M distinct regions, R1, R2, . . . , RM and model the response as a constant, cm, in the mth region: f (x) = ÿM cmI(x œ Rm). m=1 To minimize the sum of squares q(yi ≠ f (xi ))2, we choose cm = ave(yi : xi œ Rm). , University of Canterbury 2020 STAT318 — Data Mining ,9 / 42 There are many possible partitions that can be formed. CART uses partitions consisting of nested boxes. These partitions are simple to form and the resulting model is easy to interpret. Tree Building Process (Regression) It is usually computationally infeasible to find a partition that minimizes the sum of squares. CART uses a top-down, greedy approach called binary recursive partitioning. 1 Top-Down: Start at the root node with all the data and then recursively split the feature space, where each split defines two new branches further down the tree. 2 Greedy: Make a series of locally optimal splits, rather than looking ahead and picking a split that will produce a better tree at some future step. , University of Canterbury 2020 STAT318 — Data Mining ,10 / 42 This figure illustrates how the data are divided following a split. A B After each split, we end up with two independent training sets (A and B). These sets can be treated separately in parallel for further splitting. A series of locally optimal splits (greedy) does not guarantee a globally optimal tree. A B Tree Building Process (Regression) Starting with all the training data, consider splitting on the jth feature using a split point s. This defines two regions: R1 ={X :Xj Æs} and R2 ={X :Xj >s}.
We seek the splitting feature j and the split point s that minimizes
__o
Z m i n Y] ÿ ( y i ≠ y ̄ R 1 ) 2 + ÿ ( y i ≠ y ̄ R 2 ) 2 ^ ,
j,s [xiœR1 xiœR2 \ where y ̄Rm is the mean in the mth region.
, University of Canterbury 2020
STAT318 — Data Mining ,11 / 42
If y ̄ ”= y ̄ , the split reduces the RSS, otherwise the RSS is unchanged. That is, R1 R2
the split cannot increase the RSS.
threshing
􏱚 􏱛 􏱝􏱜
For N points in Rd , there are d (N ≠ 1) possible splits to consider. For example, in R2 with N = 4 there are six possible splits:
X
2 4 17 6 X 22
..
.
.
.
features ar
enter
e
covered on the next page.)
X1 X1 < s X1 X2 < s The splittintgovalue s is taken as the midpoint between the two points. (Qualitative . . . Tree Building Process (Regression) The best split can be found quickly by scanning through all the features. Having found the best split, we partition the training data into two regions and repeat the splitting process on these two regions. oeg The splitting process is repeated until a stopping condition is satisfied, for T example, until no terminal node contains more than five observations. , University of Canterbury 2020 STAT318 — Data Mining ,12 / 42 Qualitative Features r.ro CART considers all 2K ≠1 ≠ 1 possible binary splits for qualitative features, where K is the number of dierent values the feature takes in the node. For the Car Type example on Slide 4, there are three possible splits to consider. Let Xj = Car Type, then the three possible splits are: Xj œ {Sports}, Xj œ {Sports,Family} and Xj œ {Sports,Luxury}. The corresponding regions are: R1 = {X : Xj œ {Sports}} and R2 = {X : Xj œ {Family,Luxury}}; R1 = {X : Xj œ {Sports,Family}} R1 = {X : Xj œ {Sports,Luxury}} and R2 = {X : Xj œ {Family}}. and R2 = {X : Xj œ {Luxury}}; Recursive Binary Partition X1 , University of Canterbury 2020 STAT318 — Data Mining ,13 / 42 This partition was not formed using recursive binary partitioning (the boxes are not nested). CART will not form a partition like this. X2 Recursive Binary Partition X1  t1 X2  t2 R1 R2 R3 t4 R2 R3 R5 R4 R1 X1  t3 X2  t4 t2 R4 R5 t1 t3 X1 X2 X1 , University of Canterbury 2020 STAT318 — Data Mining ,14 / 42 This is an example of a CART partition, the corresponding tree and a plot of the model (a piecewise continuous surface). Another simple regression example with one predictor is illustrated below. . x= 4 y y=8 . x < 4 . ... Corresponding tree y=2 . . x 28 . X2 Pruning a Tree The tree building process may produce good predictions on the training data, but is likely to overfit the training data. A smaller tree might give lower variance and better interpretation at the cost of increased bias. Rather than growing a small tree, it is better to grow a large tree and prune it back into a smaller tree. , University of Canterbury 2020 STAT318 — Data Mining ,15 / 42 Deep bushy trees localize the training data to relatively small regions Rm (in a similar way to KNN using a relatively small value of k). This suggests deep trees have low bias and high variance. Pruning the deep bushy aims to reduce the variance and reduce overfitting. Another strategy would be to grow the tree until the reduction in RSS becomes suciently small. However, a seemingly worthless split early on may be followed by a useful split that reduces the RSS substantially. It is good practice to grow a large tree and then prune it back into a smaller tree. Cost-Complexity Pruning Let T0 denote a fully grown tree. A subtree T μ T0 is any tree that can be obtained by collapsing any number of internal nodes. The cost complexity criterion is |T| C–(T)=ÿ ÿ(yi≠y ̄Rm)2+–|T|, m=1 xi œRm where |T| is the number of terminal nodes of tree T. For each – Ø 0, there is a unique subtree T– μ T0 that minimizes C–(T). , University of Canterbury 2020 STAT318 — Data Mining ,16 / 42 The idea here is to find a sequence of best trees of varying complexities (by changing –). Then, we choose the correct complexity for the problem at hand. The penalty parameter – governs the trade-o between tree size (complexity) and its goodness-of-fit to the training data, where large – values select smaller trees. • To find T–, we collapse the two terminal nodes (to their parent node) that yields the smallest gain in |T| RSS=ÿ ÿ(y≠y ̄ )2. i Rm m=1 xi œRm We continue this collapsing process to produce a sequence of trees until we reach the root note. This sequence of trees must contain T– (we will not prove this result). • We then find the best – (the one that gives the lowest testing error) using cross validation. Tree Algorithm 1 2 3 4 Use recursive binary partitioning to grow a large tree on the training data. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of –. Use K-fold cross-validation to choose –. For k = 1,2,...,K, do (a) Repeat steps (1) and (2) on the training data excluding the kth fold. (b) Evaluate the mean squared testing error of the kth fold, as a function of –. Average the results and choose – to minimize the average error. Return the subtree from step (2) that corresponds to the chosen value of –. , University of Canterbury 2020 STAT318 — Data Mining ,17 / 42 Step 2 gives us the best trees as a function of –: ... ... ... ...... ... full tree with 􏱞 = 0 Step 3 uses cross validation to find the best – (tree size). The tree from step 2 of that size is then selected. 􏱞 decreasing (complexity increasing) Fully Grown Tree: Hitters data Years < 4.5 | RBI < 60.5 Putouts < 82 Years < 3.5 Hits < 117.5 5.487 4.622 Years < 3.5 5.394 6.189 5.183 Walks < 52.5 Runs < 47.5 6.015 5.571 6.407 6.549 RBI < 80.5 Years < 6.5 6.459 7.007 < 43.5 Walks 7.289 , University of Canterbury 2020 STAT318 — Data Mining ,18 / 42 The is the fully grown regression tree to the Hitters data from earlier in this section. Observations: • There were 12 terminal nodes in the tree. • Splits further down the tree did not reduce the RSS as much splits higher up (as expected). • The most useful predictors were Years, RBI (number of runs batted in) and Hits. • There are 19 predictors in the Hitters data and the tree only used 6 of them to model the response (automatic feature selection, we didn’t need to specify the most useful predictors to fit the model). Trees only use the predictors that matter (in terms of reducing the RSS) and discard the rest. • Are we over-fitting the training data? Testing Error: Hitters data 2 4 6 8 10 Tree Size , University of Canterbury 2020 STAT318 — Data Mining ,19 / 42 Training Cross−Validation Test The training MSE decreases as tree size (complexity) increases (as expected be- cause larger trees have lower bias than smaller ones). The CV-error is a reasonable approximation to the testing error and shows the characteristic U shape. For this problem, a tree size of three fits the data well. The pruned tree was given on slide 8. Mean Squared Error 0.0 0.2 0.4 0.6 0.8 1.0 Classification Trees Classification trees are very similar to regression trees. The feature space is subdivided using recursive binary partitioning, but the sum of squares error is not used to choose the best split and the average is not used for predictions at terminal nodes. In a node m (region Rm of the partition) with Nm observations, denote the fraction of observations in class k pˆmk=N1 ÿI(yi=k). m xiœRm We classify an observation in a terminal node m to the majority class in node m k ( m ) = a r g m a x ( pˆ m k ) . k , University of Canterbury 2020 STAT318 — Data Mining ,20 / 42 Impurity Measures Misclassification Error: k Gini Index: ÿK I(m) = 1 ≠ pˆmk(m) = 1≠max{pˆmk}. I ( m ) = k = 1 pˆ m k ( 1 ≠ pˆ m k ) . Cross-Entropy (Deviance): I(m)=≠ÿK pˆmklog2(pˆmk) (with0log2(0)=0). k=1 , University of Canterbury 2020 STAT318 — Data Mining ,21 / 42 A small impurity value indicates that a node contains predominantly observations from one class (we want nodes to pure, or close to it). The choice of impurity measure doesn’t really have an impact on the performance of a classification tree. CART uses the Gini Index so we’ll focus on this one. For a two-class classification problem we have I(m) = 2p(1 ≠ p), where p is the proportion in the first (or second) class. Impurity Measures: Two-Class Problem 0.0 0.2 0.4 Impurity measures attain their maximum value when each class is equally represented at a node. p 0.6 0.8 1.0 , University of Canterbury 2020 STAT318 — Data Mining ,22 / 42 0.0 0.1 0.2 0.3 0.4 0.5 Entropy Gini index Misclassification error 2m 2m+1 m The best split in node m is the one that maximizes the reduction in impurity (I) = I(m) ≠ N2m I(2m) ≠ N2m+1 I(2m + 1), where I(m) is the impurity and Nm is the number of observations in node m. Note: it is not always possible to split an impure node. In such cases the node is declared terminal and observations are classified to the majority class. , University of Canterbury 2020 STAT318 — Data Mining ,23 / 42 Consider a binary classification problem.y œ {Yes, No} with 100 training obser- vations in node m. The node has been split, giving two daughter nodes with the following structure: .. Yes = 25, No = 50 Yes = 25, No = 0 Yes = 50, No = 50 m 2m 2m+1 That is, 25 of the Yes’ from node m go to the left daughter after the split and the remaining 25 go right. The change in impurity is calculated as follows: I(m) = I(2m) = I(2m + 1) = 2(1/2)(1 ≠ 1/2) = 1/2 2(2/3)(1 ≠ 2/3) = 4/9 0 (Parent Node) (Left Daughter) (Right Daughter). The change in impurity is (I) = 1/2 ≠ 3/4(4/9) ≠ 1/4(0) = 1/6. Classification Tree: Heart data Thal:a MaxHR < 145.5 No No Yes No Yes Chol < 244 MaxHR < 156 Chol < 244 No No No Yes RestBP < 157 Yes No No No Yes Yes No Sex < 0.5 Yes Ca < 0.5 Slope < 1.5 Oldpeak < 1.1 RestECG < 1 Yes | 0.5 Ca < MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b ChestPain:a , University of Canterbury 2020 STAT318 — Data Mining ,24 / 42 The fully grown classification tree for the Heart data. The Heart data contains 303 patients with 13 predictors. The outcome Yes indicates the presence of heart disease and No indicates no heart disease. Notice that some internal nodes have been split into terminal nodes with the same class label (both Yes or both No). How can this happen? Here is an example that illu.strates this: . . Yes = 10, No = 2 Yes = 7, No = 0 Yes = 3, No = 2 (Yes) (Yes) Hence, it is useful to report the estimated probability of being in a particular class. In the example above, the left node is pure so we are confident about the predicted Yes. However, for the right node, the estimated probability of Yes is 3/5 (not much better than flipping a coin). Classification Tree: Heart data Thal:a Ca < 0.5 | 0.5 Ca < 5 10 15 Tree Size Training Cross−Validation Test MaxHR < 161.5 No No ChestPain:bc No Yes Yes Yes , University of Canterbury 2020 STAT318 — Data Mining ,25 / 42 Cost-complexity pruning gives a simple tree with six terminal nodes. Of the 13 predictors, only four were included in the final model. Error 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Missing Values Deficient Observations: training or testing observations that have missing feature values (not missing response values). If deficient observations are present in the training set, the best split is found using the known feature values. To make predictions for deficient observations, surrogate splits are used. Let Xj < s be the splitting rule in node m, called the primary split. A surrogate split uses a dierent feature Xi (i ”= j) to split node m in a way that best approximates the distribution of training observations after the primary split. If training observation x goes to the left daughter after the primary split, it should also go left using the surrogate split etc. , University of Canterbury 2020 STAT318 — Data Mining ,26 / 42 Surrogate splits are used to decide whether to go left or right at a node when the splitting feature is missing. For example, consider passing x = (x1, x2, x3) = (missing, 4, 12) through a tree with the query x1 < 2. We cannot answer this query because we don’t have an x1 value. If the surrogate split is x3 < 14, we proceed to the left daughter node. Otherwise, we go to the right daugh- ter. In either case, a decision is made and we can continue moving through the tree. Surrogate splits are only used if they perform better than the blind rule, which sends x to daughter node with the most training observations. Pros and Cons of Decision Trees Small trees are simple, easy to explain and can be displayed graphically. Trees can easily handle categorical predictors without having to create dummy variables and can deal with missing values. Trees are invariant under all monotone transformations of the predictors in the training data. Trees perform variable selection and complexity reduction automatically. Tree-growing algorithms scale well to large training data sets. Unfortunately, trees generally have poor predictive accuracy when compared with other methods. However, by aggregating trees, their predictive performance can be greatly improved. , University of Canterbury 2020 STAT318 — Data Mining ,27 / 42 A accuratewithighvariance lowbias s i n g u l e r g t e r e i e t s e ml f a n y o b t t e h a t of Yes Byaveragingpredictmamwtsofteen aIdea usingBootstrapping Rseampleousramplwe ithreplacement aDc aor g e m m b e r t w o o s s o f r e s a m p e s . a t i wesomehow ueu.geno but variance can keep biastree reduce i i o Ei.hn 3 zn.es measurement sin.ie meansevemraeal surements H o d w o w b e u i l md a n t y r e e s each GJusbtuildatrefeor traininsgetthawt ere have training buwtoenly one set t Bagging bootst tre na je mrooBerrorestimo.nu Algon_tnmioobtainBbootstrapresamp.es witneuougnmnapresampescm.sn rapag greg a tion trees we a samphleas caseswemurandmydrwooase.am our no ousramplewinreplacement samples samples IfwdeoB wgeeBttrent samplinwgithreplacement critical otherwisewinae Bidenticsaal mples is we e􏱛 b a g g i n g t e setr r o ri u t Bootstrapping decision Estimating of resanpewiunotantamauofizsageit.es__bootstrap_aggregated theoriginatrl ainingcases ofoutrrainsample enoughB each caswe in eachsample cargecwwbias.nignrarian.estree For re growa bea tesctase tree name Averagaeggregatepredictionfsromauotfhetrees a R e g r e s s i o n t a k e t h me e a o n t f h e B p r e d i c t i o n s bclassificationtaktehe vote Bpredi.am majority otfhe when bootstrap bootstrap Bagging Deep trees suer from high variance (small changes in the training data can dramatically change the tree produced), but tend to have low bias. Bootstrap aggregation (bagging) is a general purpose approach for reducing the variance of a statistical learning technique (it is particularly useful for deep trees because of their high variance and low bias). Given a set of n independent observations X1, X2, . . . , Xn with common variance ‡2, the mean of these observations X ̄ has variance ‡2/n. Averaging a set of observations reduces variance. , University of Canterbury 2020 STAT318 — Data Mining ,28 / 42 The sample mean (from a random sample) X ̄ = n≠1 qni=1 Xi is a normally dis- tributed random variable (Central Limit Theorem) and its expected value is E(X ̄) = E1n≠1(X1+X2+...+Xn)2 = n≠1 (E(X1)+E(X2)+...+E(Xn))) = n≠1(μ+μ+...+μ)=μ. Hence, X ̄ is an unbiased estimate of the population mean μ and it’s variance is V(X ̄) = V1n≠1(X1+X2+...+Xn)2 = n≠2 (V(X1)+V(X2)+...+V(Xn))) (independence) = n≠21‡2+‡2+...+‡2)2=‡2/n, which decreases with increasing sample size (precision increases with sample size). Idea: Build an ensemble of bushy trees using dierent training data and average their predictions to reduce variance. Bagging Unfortunately, we usually only have one training data set. However, we can generate B dierent bootstrapped training data sets. Bagging: Train our tree on the bth bootstrapped training set to get Tb(x). The prediction is given by fˆ (x) = 1 ÿB T (x) (regression) bag Bb=1 b For classification, we record the prediction made by each tree and then take a majority vote. , University of Canterbury 2020 STAT318 — Data Mining ,29 / 42 We want our trees to have low bias and high variance so we do not prune trees in the ensemble. Pruning trees increases the pairwise correlation between trees which reduces the eect of averaging. Essentially, we want each tree in the ensemble to have low bias (big and bushy trees) and for each tree to make mistakes on dierent training observations. Bagging: Example Original Tree b = 1 b = 2 x.1 < 0.395 x.1 < 0.555 x.2 < 0.205 ||| 11 0100 10 011001 01 01 0 b = 3 b =4 b = 5 x.2 < 0.285 x.3 < 0.985 x.4 < 􏱖1.36 ||| 10 0 0 1 10 1 110110 The Elements of Statistical Learning, Hastie, Tibshirani and Friedman (2009). 0 1 11 0 , University of Canterbury 2020 STAT318 — Data Mining ,30 / 42 This is a two-class classification problem y œ {0,1} with p = 5 features x1, x2, . . . , x5. Clearly, each bootstrapped training set produces dierent tree structure. The root node splits also use several dierent features. However, x1 appears to be a dominant feature because it is used to split the root node in 7 of the 11 trees. To classification an x using this ensemble of 11 trees, we pass x through each tree and record each tree’s prediction. If more than five trees classify x as 0, then x is class 0. Otherwise, x is class 1. Bagging: Example b = 6 b =7 b = 8 x.1 < 0.395 x.1 < 0.395 x.3 < 0.985 ||| 1 1 01 00 000101 b = 9 b = 10 b = 11 10 11 0 The Elements of Statistical Learning, Hastie, Tibshirani and Friedman (2009). x.1 < 0.395 x.1 < 0.555 x.1 < 0.555 ||| 1 1 011001 0 10 01 1 , University of Canterbury 2020 STAT318 — Data Mining ,31 / 42 Random Forests Bagged trees tend to look quite similar to each other, particularly if there are only a few strong predictors. Hence, predictions from the bagged trees will be highly (positively) correlated. Unfortunately, averaging highly correlated quantities does not reduce variance as much as it would for uncorrelated quantities. Random forests provide a potential improvement over bagged trees by attempting to de-correlate trees in the ensemble. , University of Canterbury 2020 STAT318 — Data Mining ,32 / 42 If B identically distributed (same ‡2, but not necessarily independent) random variables Xi have positive pairwise correlation fl = cov(Xi , Xj ) = cov(Xi , Xj ) (identically distributed so same ‡), ‡Xi ‡Xj ‡2 the variance of the sample mean is VAB1ÿB XiB = B1AÿB V(Xi)+2ÿcov(Xi,Xj)B i=1 2 i=1 i