CS代写 STAT318 — Data Mining

STAT318 — Data Mining
Jiˇr ́ı Moravec
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Copyright By PowCoder代写 加微信 powcoder

Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 1 / 58

Tree-Based Methods:
Classification and Regression Trees (CART) Bagging (bootstrap-aggregating)
Random Forests
BART and MARS
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 2 / 58

Prerequisites
1 Validation set to compare method performance
2 Cross-Validation to tune up methods and prevent overfitting
3 Bootstrap to estimate confidence intervals
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 3 / 58
Validation The reason why we are training our model is to predict future data, we want to build a model that minimize error on the future data, but we have only training data. If we randomly divide our training data into training and test subset, we can estimate test error and how the model will perform on future data, in exchange of decreased sample and thus increased error. A problem with this is that the estimated test error will depend on the random paritioning of the dataset.
Cross-Validation Very flexible methods have the ability to fit to any pattern in data, including random noise. Fortunatelly, many such methods have tuning parameters to limit their flexibility: degree of polynomial, size of regression tree, number of non-zero features in ridge or Lasso regression. By splitting the training dataset into k-sets, training and then averaging the test error, we can calculate an average performance.
Bootstrap Bootstrap is random sampling with replacement, it simulates the data sam- pling/generating process. By pertubing the data, we can estimate standard error or by sum- marizing accross bootstrap samples, we can compensate for a high variability of the training method.

Tree-Based Methods
In this section we consider tree-based methods for regression and classification.
These methods partition the feature (predictor) space into a number of regions, and then fit a simple model (for example, a constant) in each region.
The set of splitting rules used the partition the feature space can be summarized in a tree, so these approaches are known as decision tree methods.
We will focus on the Classification and Regression Trees (CART) approach.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 4 / 58

Advantages/Disadvantages
Advantages
easy to interpret
nice graphical representation scale well
Form a basis for state-of-the-art supervised learning methods:
ensembles of decision trees-often known as ”random forests”-have been the most successful general-purpose algorithm in modern times (Howard and Bowled, 2012)
Disadvantages
lower predictive accuracy
non-robust/high variability (solved by bagging and random forests)
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 5 / 58

Decision Tree
A decision tree is a recursive partition of the feature space that can be displayed in a flowchart-like structure consisting of nodes and branches.
1 Root Node: A node that has no incoming branches.
2 Internal Node: A node that has exactly one incoming branch and at least two
outgoing branches.
3 Terminal (Leaf) Node: A node with exactly one incoming branch and no outgoing branches.
Internal nodes have an associated query (splitting rule) and by convention, we proceed to the left descendent node (left daughter) if we answer ‘yes’ to the query.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 6 / 58

Decision Tree – Simplified Example
Send to hospital
Does the.patient have . difficulty breathing?
.Is the patient bleeding? .
Send to hospital Stay at home
To classify a patient we answer the queries and follow the branches. For example, a patient with no difficulty breathing and no bleeding is told to stay at home:
node 1 → node 3 → node 7 = stay at home. A patient with difficulty breathing is told to go to hospital:
node 1 → node 2 = send to hospital.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 7 / 58
This figure illustrates how the data are divided following a split.

Regression Trees – Model
1 Divide the predictor space into J distinct non-overlapping regions R1, . . . , RM
2 For every xi ∈ Rm, yi = mean(yi : xi ∈ Rm); i.e., the response is a constant cm for
each region Rm
Or more formally:
f (X) = 􏰌 cmI(X ∈ Rm)
where I(.) is an indicator function that is 1 if (.) is true and 0 otherwise, and cm is:
RSS is then calculated as:
cm = mean(yi : xi ∈ Rm)
RSS= 􏰌 􏰌(yi −f(xi))2
Regions R1, . . . , RM are chosen in such way to minimize RSS
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 8 / 58

Regression Trees – Hitters data set
●●● ● ●●●●
● ● ● ● ● ●
● ● ●● ●● ●
● ●●● ●●●●●
● ●●●●● ●●
●● ●●●●●●●
● ● ● ● ● ●●●●●●
● ● ● ● ●●
●●●●● ● ●● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●● ● ● ● ●●●● ●
●● ●●●● ● ●●● ●
●●●● ● ●●●
● ● ● ● ● ● ●●●●
●●●●● ● ● ●
5 10 15 20
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 9 / 58
To motivate regression trees, we consider a simple example. The aim here is to predict a Baseball player’s salary using:
• Hits: the number of hits in the previous year; and
• Years: the number of years playing in the major leagues.
The response variable Salary has been log transformed so it’s more ‘bell shaped’ (Salary is measured in 1,000 of dollars). Players with low income are coloured blue and those with high income yellow.
Can you explain the player with a high income (yellow), zero hits and only one year playing in the major league?
How do we partition the training data?
0 50 100 150 200

Regresion Trees – Hitters data set
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 10 / 58
A partition of the feature space for the Hitters training data. The three regions can be written as follows:
{X :Years<4.5} {X : Years ≥ 4.5, Hits < 117.5} {X : Years ≥ 4.5, Hits ≥ 117.5} We read {X : Years < 4.5} as ‘the set of training data points X such that Years is less than 4.5’. Regresion Trees – Hitters data set Years| < 4.5 Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 11 / 58 The regression tree corresponding to the partition on the previous slide. The lengths of the branches are proportional to the decrease in RSS after the split is made. That is, longer branches correspond to greater reductions in RSS. (Note: this is something that the tree R function does, not all functions plot trees in this way.) The mean response of the training observations at each terminal node is used for making pre- dictions. For example, a player with 6 years experience and 80 hits has a predicted log salary of 6, (that’s a salary of $106, 1 million USD). Observations: • Years is the most important predictor (it’s at the top of the tree and it reduces the RSS the most). • Experienced players with a large number of hits get larger salaries. • Simple, easy to display, and easy to explain to a non-expert. Now we need to learn how to fit one! Regression Trees – Building Process It is usually computationally infeasible to find a partition that minimizes the sum of squares. CART uses a top-down, greedy approach called recursive binary partitioning. 1 Top-Down: Start at the root node with all the data and then recursively split the feature space, where each split defines two new branches further down the tree. 2 Greedy: Make a series of locally optimal splits, rather than looking ahead and picking a split that will produce a better tree at some future step. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 12 / 58 This figure illustrates how the data are divided following a split. After each split, we end up with two independent training sets (A and B). These sets can be treated separately in parallel for further splitting. A series of locally optimal splits (greedy) does not guarantee a globally optimal tree. Regression Trees – Building Process Starting with all the training data, consider splitting on the jth feature using a split point s. This defines two regions: R1 ={X :Xj ≤s} and R2 ={X :Xj >s}.
We seek the splitting feature j and the split point s that minimizes 
min 􏰌 (yi −y ̄R1)2 + 􏰌 (yi −y ̄R2)2, j,s xi∈R1 xi∈R2 
where y ̄Rm is the mean in the m-th region.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 13 / 58
If y ̄R1 ̸= y ̄R2 , the split reduces the RSS, otherwise the RSS is unchanged. That is, the split cannot increase the RSS.
For N points in Rd , there are d (N − 1) possible splits to consider. For example, in R2 with N = 4 there are six possible splits:
X1 X1 X2 < s The splitting value s is taken as the midpoint between the two points. (Qualitative features are covered on the next page.) Regression Trees – Building Process The best split can be found quickly by scanning through all the features. Having found the best split, we partition the training data into two regions and repeat the splitting process on these two regions. The splitting process is repeated until a stopping condition is satisfied, for example, until no terminal node contains more than five observations. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 14 / 58 Qualitative Features CART considers all 2K−1 −1 possible binary splits for qualitative features, where K is the number of different values the feature takes in the node. For the Car Type example on Slide 4, there are three possible splits to consider. Let Xj = Car Type, then the three possible splits are: Xj ∈ {Sports}, Xj ∈ {Sports,Family} and Xj ∈ {Sports,Luxury}. The corresponding regions are: R1 = {X : Xj ∈ {Sports}} and R2 = {X : Xj ∈ {Family,Luxury}}; R1 = {X : Xj ∈ {Sports,Family}} and R2 = {X : Xj ∈ {Luxury}}; R1 = {X : Xj ∈ {Sports,Luxury}} and R2 = {X : Xj ∈ {Family}}. Recursive Binary Partition Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 15 / 58 This partition was not formed using recursive binary partitioning (the boxes are not nested). CART will not form a partition like this. Recursive Binary Partition X 1 ≤ t1 X 2 ≤ t2 Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 16 / 58 This is an example of a CART partition, the corresponding tree and a plot of the model (a piecewise continuous surface). Another simple regression example with one predictor is illustrated below. . x < 4 . ... Corresponding tree Pruning a Tree The tree building process may produce good predictions on the training data, but is likely to overfit the training data. A smaller tree might give lower variance and better interpretation at the cost of increased bias. Rather than growing a small tree, it is better to grow a large tree and prune it back into a smaller tree. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 17 / 58 Deep bushy trees localize the training data to relatively small regions Rm (in a similar way to KNN using a relatively small value of k). This suggests deep trees have low bias and high variance. Pruning the deep bushy aims to reduce the variance and reduce overfitting. Another strategy would be to grow the tree until the reduction in RSS becomes sufficiently small. However, a seemingly worthless split early on may be followed by a useful split that reduces the RSS substantially. It is good practice to grow a large tree and then prune it back into a smaller tree. Cost-Complexity Pruning Let T0 denote a fully grown tree. A subtree T ⊂ T0 is any tree that can be obtained by collapsing any number of internal nodes. The cost complexity criterion is Cα(T)=􏰌 􏰌(yi−y ̄Rm)2+α|T|=RSST+α|T|, m=1 xi ∈Rm where |T| is the number of terminal nodes of tree T. For each α ≥ 0, there is a unique subtree Tα ⊂ T0 that minimizes Cα(T). Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 18 / 58 The idea here is to find a sequence of best trees of varying complexities (by changing α). Then, we choose the correct complexity for the problem at hand. The penalty parameter α governs the trade-off between tree size (complexity) and its goodness- of-fit to the training data, where large α values select smaller trees. • To find Tα, we collapse the two terminal nodes (to their parent node) that yields the smallest gain in RSS=􏰌 􏰌(yi −y ̄Rm)2. m=1 xi ∈Rm We continue this collapsing process to produce a sequence of trees until we reach the root note. This sequence of trees must contain Tα (we will not prove this result). • We then find the best α (the one that gives the lowest testing error) using cross validation. Cost-Complexity Pruning The series of trees for α does not have to be sequential! Trees of certain size can be “skipped” if their improvement in RSS is small enough. Let RSSi be residual sum squared of tree of size i, then Cα(Ti)=RSSi +αi. We will chose smaller tree compared to larger tree if Cα(Ti) RSS(i+1)), then for the smaller tree to be selected
∆RSSi,(i+k) < kα Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 19 / 58 Cost-Complexity Prunning – Example Have a series of trees of sizes t = 1,...,7 with RSS and calculated ∆RSS: α=3 α=2 α=1 Selected trees for different α size 1234567 RSS 21 13 8 5 3 1 0 ∆RSS 8 53221- 4 Which trees will we pick for particular α? Remember that we chose the smaller tree for particular α if ∆RSSi,(i+1) < α. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 20 / 58 The space is ridged, but not very bumpy. For relatively large ranges of α, we might get only a single tree. In this example, tree 5 would be skipped in favour of tree 4 or tree 6. Tree Algorithm 1 Use recursive binary partitioning to grow a large tree on the training data. 2 Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α. 3 Use K-fold cross-validation to choose α. For k = 1,2,...,K, do (a) Repeat steps (1) and (2) on the training data excluding the kth fold. (b) Evaluate the mean squared testing error of the kth fold, as a function of α. Average the results and choose α to minimize the average error. 4 Return the subtree from step (2) that corresponds to the chosen value of α. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 21 / 58 Step 2 gives us the best trees as a function of α: ... . ......... .. ... full tree with a = 0 Step 3 uses cross validation to find the best α (tree size). The tree from step 2 of that size is then selected. a decreasing (complexity increasing) Fully Grown Tree: Hitters data Putou < 3.5 Years < 3.5 5.487 5.394 4.622 5.183 Years < 4.5 RBI < 117.5 ts < 82 Years Walks < 52.5 Runs < 47.5 6.015 5.571 6.407 6.549 RBI < 80.5 Years < 6.5 6.459 7.007 < 43.5 Walks Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 22 / 58 The is the fully grown regression tree to the Hitters data from earlier in this section. Observations: • There were 12 terminal nodes in the tree. • Splits further down the tree did not reduce the RSS as much splits higher up (as expected). • The most useful predictors were Years, RBI (number of runs batted in) and Hits. • There are 19 predictors in the Hitters data and the tree only used 6 of them to model the response (automatic feature selection, we didn’t need to specify the most useful predictors to fit the model). Trees only use the predictors that matter (in terms of reducing the RSS) and discard the rest. • Are we over-fitting the training data? Testing Error: Hitters data 2 4 6 8 10 Tree Size Training Cross−Validation Test Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 23 / 58 The training MSE decreases as tree size (complexity) increases (as expected because larger trees have lower bias than smaller ones). The CV-error is a reasonable approximation to the testing error and shows the characteristic U shape. For this problem, a tree size of three fits the data well. The pruned tree was given on slide 8. Mean Squared Error 0.0 0.2 0.4 0.6 0.8 1.0 Classification Trees Classification trees are very similar to regression trees. The feature space is subdivided using recursive binary partitioning, but the RSS is not used to choose the best split and the average response cm for all xi ∈ Rm is not used for predictions at terminal nodes. Instead, we use classification error rate and most commonly occuring class Define pˆmk as a proportion of (training) observations of the class k in the m-th region: pˆmk=1 􏰌I(yi=k). Nm xi∈Rm Then the most commonly occuring class is max(pˆmk) and the classification error rate E =1−max(pˆmk) k Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 24 / 58 Impurity Measures Turns out that classification error rate is not sufficiently sensitive for tree growing. Instead, we use either Gini Index or Entropy Classification Error rate: (misclassification) E =1−max(pˆmk), Gini Index: G = 􏰌 pˆmk(1 − pˆmk), k=1 Entropy: (Cross-entropy, Deviance) D = − 􏰌 pˆmk log2(pˆmk ) (with 0 log2(0) = 0). Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 25 / 58 Impurity Measures: Two-Class Problem 0.0 0.2 0.4 Impurity measures attain their maximum value when each class is equally represented at a node. 0.6 0.8 1.0 Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 26 / 58 Gini index and entropy are more sensitive to (im)purity than missclassification error rate. 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error The best split in node m is the one that maximizes the reduction in impurity ∆(I) = I(m) − N2m I(2m) − N2m+1 I(2m + 1), where I(m) is the impurity and Nm is the number of observations in node m. Note: it is not always possible to split an impure node. In such cases the node is declared terminal and observations are classified to the majority class. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 27 / 58 Consider a binary classification problem y .∈ {Yes, No} with 100 training observations in node m. The node has been split, giving two daughter nodes with the following structure: Yes = 25, No = 50 Yes = 25, No = 0 Yes = 50, No = 50 That is, 25 of the Yes’ from node m go to the left daughter after the split and the remaining 25 go right. The change in impurity is calculated as follows: I(m) = I(2m) = I(2m + 1) = The change in impurity is 2(1/2)(1 − 1/2) = 1/2 2(2/3)(1 − 2/3) = 4/9 0 (Parent Node) (Left Daughter) (Right Daughter). ∆(I) = 1/2 − 3/4(4/9) − 1/4(0) = 1/6. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 28 / 58 Classification Tree: Heart data MaxHR < 145.5 No No No Yes Chol < 244 MaxHR < 156 Chol < 244 Ca < 0.5 Slope < 1.5 Oldpeak < 1.1 RestECG < 1 RestBP < 157 MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b ChestP 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com