CS代考 SAS Viya for Learners – Visual Statistics

SAS Viya for Learners – Visual Statistics
Decision Trees

Decision Trees

Copyright By PowCoder代写 加微信 powcoder

Decision Trees 3
Objectives
• Describe how decision trees partition data in SAS Visual Statistics.
• Describe how predictions are formulated for a decision tree.
• Explain variable selection methods for decision trees.
• Identify the tree variable selection methods that are available in SAS Visual Statistics.
Copyright © SAS Institute Inc. All rights reserved.
As seen in the previous section, regressions, as parametric models, assume a specific association structure between inputs and target. By contrast, decision trees, as predictive algorithms, do not assume any association structure. They simply seek to isolate concentrations of cases with like- valued target measurements.
Decision trees are similar to other modeling methods that are described in this course. Cases are scored using prediction rules. A split-search algorithm facilitates predictor variable selection.
Useful predictions depend, in part, on a well-formulated model. Good formulation primarily consists of preventing the inclusion of redundant and irrelevant predictors (input variables) in the model. The predictor variable selection function is complicated with large data. There are usually many predictors to consider and many pieces of information (rows) about these columns to process. This complication adds to the requirements of the input search method for any given model. The method must eradicate redundancies and irrelevancies, and also be extremely efficient. (The input search methods available in the decision tree algorithm in Visual Statistics are described in this section.)
The simple prediction problem described below illustrates each of these model essentials.
Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.

Model Essentials: Decision Trees
Predict cases.
Select useful predictors.
Copyright © SAS Institute Inc. All rights reserved.
Prediction rules
Split search
Consider a data set with two predictors and a binary target. The predictors, x1 and x2, locate each case in the unit square. The target outcome is represented by a color. Yellow is primary and blue is secondary. The analysis goal is to predict the outcome based on the location in the unit square.
To predict cases, decision trees use rules that involve the values of the predictor variables. Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED.
Decision Tree Prediction Rules
<0.63 ≥0.63 1.0 0.9 0.8 0.7 0.6 ≥0.51 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. interior node ≥0.52 <0.51 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 The rules are arranged hierarchically in a tree-like structure with nodes connected by lines. The nodes represent decision rules, and the lines order the rules. The first rule, at the base (top) of the tree, is named the root node. Subsequent rules are named interior nodes. Nodes with only one connection are leaf nodes. To score a new case, examine the associated input variable values. Then apply the rules that are defined by the decision tree. Decision Trees 5 Decision Tree Prediction Rules 1.0 0.9 0.8 0.7 0.6 interior node x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. <0.51 ≥0.51 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. Decision Tree Prediction Rules 1.0 0.9 0.8 0.7 0.6 Decision = Estimate = 0.70 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. <0.51 ≥0.51 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 The input variables’ values of a new case eventually lead to a single leaf in the tree. A tree leaf provides a decision (for example, classify as yellow) and an estimate (for example, the primary- target proportion). Model Essentials: Decision Trees Predict cases. Select useful predictors. Copyright © SAS Institute Inc. All rights reserved. Prediction rules Split search To select useful predictors, trees use a split-search algorithm. Decision trees confront the “curse of dimensionality” by ignoring irrelevant predictors. Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. Understanding a split-search algorithm for building trees enables you to better use the Tree tool and interpret your results. The description presented here assumes a binary target, but the algorithm for interval targets is similar. The first part of the algorithm is called the split search. The split search starts by selecting an input for partitioning the available training data. If the measurement scale of the selected input is interval, unique values serve as a potential split point for the data. If the input is categorical, the average value of the target is taken within each categorical input level. The averages serve the same role as the unique interval input values in the discussion that follows. For a selected input and fixed split point, two groups are generated. Cases with input values less than the split point are said to branch left. Cases with input values greater than the split point are said to branch right. The groups, combined with the target outcomes, form a 2x2 contingency table. The columns specify branch direction (left or right) and rows specify target value (0 or 1). An information gain statistic that is based on the entropy of the root node and the entropy of the data in each partition of the split can be used to quantify the separation of counts in the table’s columns. Large values for the gain statistic suggest that the proportion of zeros and ones in the left branch is different from the proportion in the right branch. A large difference in outcome proportions indicates a good split. An example of this calculation is given below. The split-search diagnostic used in Visual Statistics depends on the method that is used to grow or train the tree. Under Interactive training, a chi-square log-worth-based statistic is used to evaluate splits. The default split search method under autonomous tree growth is based on an information gain statistic. An example of calculating the gain under the default method is given below. The Rapid Growth functionality combines k-means clustering with the gain statistic to grow the tree. A gain ratio statistic is used for split evaluation when the Split Best option is used in combination with the Rapid Growth property. Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. Decision Trees 7 left right Confusion Matrix 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Decision Tree Split Search Calculate information gain on partitions on input x1. Copyright © SAS Institute Inc. All rights reserved. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 left right 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Decision Tree Split Search max gain(x1) 0.0112 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Select the partition with the maximum gain. Copyright © SAS Institute Inc. All rights reserved. The best split for a predictor is the split that yields the highest information gain. For the gain calculation example, assume that there are 100 total observations and a 50/50 split of yellow/blue in the training data. Also, there are 52 observations to the left of the 0.52 split point and 48 observations to the right of the split in the diagram shown above. Based on this and the numbers given in the table, gain can be formulated as shown below. EntropyTotal −.5log2(.5)−.5log2(.5)=1 EntropyLeft −.53log2(.53)−.47log2(.47)=0.997 EntropyRight −.42log2(.42)−.58log2(.58)=0.98 Gain 1−(52/100)0.997−(48/100)0.98= 0.0112 Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. The partitioning process is repeated for every input in the training data. Again, the optimal split for the input is the one that maximizes the gain function. Decision Trees 9 left right Decision Tree Split Search max Gain(x1) 0.0112 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Repeat for input x2. Copyright © SAS Institute Inc. All rights reserved. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Decision Tree Split Search bottom top max gain(x1) 0.0112 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. max gain(x2) 0.0273 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. Decision Tree Split Search Compare partition gain ratings. max gain(x1) 0.0112 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. bottom top max gain(x2) 0.0273 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 After you determine the best split for every input, the tree algorithm compares each best split’s corresponding gain. The split with the highest gain is regarded as best. The best split rule is used to partition the data. Decision Tree Split Search 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Create a partition rule from the best partition across all inputs. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Copyright © SAS Institute Inc. All rights reserved. Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. Decision Trees 11 Decision Tree Split Search 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. Repeat the process in each subset. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 left right Decision Tree Split Search 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. max gain(x1) 0.0203 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. Decision Tree Split Search 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. max gain(x1) 0.0203 bottom top max gain(x2) 0.019 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Decision Tree Split Search 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. max gain(x1) 0.0203 bottom top max gain(x2) 0.019 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. The split search is repeated within each new leaf. Gain statistics are compared as before. The resulting partition of the predictor variable space is known as the maximal tree. Under the default settings, development of the maximal tree is based exclusively on statistical measures of gain on the data. Decision Trees 13 Decision Tree Split Search 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Copyright © SAS Institute Inc. All rights reserved. Create a second partition rule. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Decision Tree Split Search 1.0 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0.0 Repeat to form a maximal tree. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x1 Copyright © SAS Institute Inc. All rights reserved. Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. 4.1 Decision Trees in SAS Visual Statistics Objectives • Describe decision tree variable roles in SAS Visual Statistics. • Identify the decision tree properties in SAS Visual Statistics. • Cultivate a decision tree. • Assess decision tree performance. Copyright © SAS Institute Inc. All rights reserved. Decision Trees in SAS Visual Statistics • There is only one response variable. It can be either a category or a measure. (Both classification trees and regression trees can be created.) • There can be multiple predictors. • Both category and measure predictors are accommodated. (No interaction terms are allowed.) • Using Interactive mode, you can manually train and prune a decision tree. • You can derive a leaf ID. This ID can be used in other models that are featured in the SAS Visual Statistics functionality. Copyright © SAS Institute Inc. All rights reserved. Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. The decision tree in SAS Visual Statistics uses a modified version of the C4.5 algorithm. Note: To enter Interactive mode, right-click a tree and select Enter Interactive Mode. Note: One difference between trees and other modeling algorithms that are presented in this course is that decision trees are available in SAS Visual Analytics without adding SAS Visual Statistics. However, SAS Visual Statistics does augment the decision tree functionality. Further, some decision tree default settings are modified with the Visual Statistics addition. Categorical-valued and interval-valued response or target variables are accommodated in the SAS Visual Statistics decision tree model. Although multilevel categorical target variables are allowed, one level is chosen as the event level, and other levels are combined into the non-event category. For binary target variables, changing the event level does not affect the hierarchical structure of the decision tree. It does change the assessment plots (confusion matrix, lift, ROC, and misclassification) that are generated for each event level. To do model comparisons (for example, between a logistic regression and a decision tree), you need to make sure that your models target the same outcome. Note: For a measure response variable, choose whether to bin the response variable in the Options pane. This determines whether a classification tree or regression tree is created. Bin the response variable to create a classification tree or keep it unmodified to create a regression tree. Decision Trees in SAS Visual Statistics 15 Decision Tree Roles • Response – only one measure or category variable • Predictors – assign any number of measure and category variables • PartitionID–onlyonepartition variable Copyright © SAS Institute Inc. All rights reserved. Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. continued... • Decision Tree • Event level • Autotune • Missing assignment • Minimum value • Growth strategy • Maximum branches • Maximum levels • Leaf size • Bin response variable • Predictor bins • Bin method Decision Tree Options • Rapid growth • Prune with validation data • Reuse predictors • Number of bins • Prediction cutoff • Statistic percentile • Tolerance Copyright © SAS Institute Inc. All rights reserved. Decision Tree Options • ModelDisplay • Plot layout • (General) Statistic to show • (Decision Tree / Icicle Plot) Statistic to show • Legend visibility • Plot type • Plot to show • Confusion matrix legend visibility Copyright © SAS Institute Inc. All rights reserved. Decision Tree • Event level – enables you to specify the event level. Select Choose to choose the event level. A dialog box is displayed. It enables you to select the event level that you want to model. In this window, select the appropriate radio button and click OK in the Select Event Level window. By Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. default, the event levels are sorted alphabetically. Make sure that you are modeling for the event of interest. • Autotune – Note: For more detailed information about autotuning, see “Autotuning” in SAS® Visual Analytics 8.5: Working with SAS® Visual Data Mining and Machine Learning (https://go.documentation.sas.com/?docsetId=vaobjdmml&docsetTarget=n1ot6nwcbwp4jm n1r7vks7d8g3ri.htm&docsetVersion=8.5&locale=en#n1usfp24hnj2vpn1aj279bcqot5e) in the online documentation. • Missing Assignment – enables you to specify how observations with missing values are included in the model. – None – Observations with missing values are excluded from the model. – Use in search – If the number of observations with missing values is greater than or equal to Minimum value, then missing values are considered a unique measurement level and are included in the model. – As machine smallest – Missing interval values are set to the smallest possible machine value and missing category values are treated as a unique measurement level. – Popular – Observations with missing values are assigned to the sub-node with the most observations. – Similar – Observations with missing values are assigned to the node that is considered most similar by a chi-square test for category responses or an F test for measure responses. Note: The default method for handling missing values varies across model types in SAS Visual Statistics. In contrast to other models, the default for decision trees is Use in search. • Minimum value – • Growth strategy – specifies the parameters that are used to create the decision tree. – Custom – enables you to select the values. – Basic – specifies a simple tree with a maximum of two branches per split and a maximum of six levels. – Advanced – specifies a complex tree with a maximum of four branches per split and a maximum of six levels. – Modeling – specifies a tree with default options in SAS Visual Statistics 7.1. • Maximum branches – specifies the maximum number of branches that are allowed when you split a node. The default is 2 and the maximum is 10. • Maximum levels – specifies the maximum depth of the decision tree. The default is 6 and the maximum is 20. • Leaf size – specifies the minimum number of observations that are allowed in a leaf node. The default is 5. • Bin response variable – • Response bins – Decision Trees in SAS Visual Statistics 17 enables you to specify the hyperparameters that control the autotuning algorithm. The hyperparameters determine how long the algorithm can run, how many times the algorithm can run, and how many model evaluations are allowed. The autotuning algorithm selects the Maximum levels and Predictor bins values that produce the best model. specifies the minimum number of observations that are allowed to have missing values before missing values are treated as a distinct category level. This option is used only when Missing assignment is set to Use in search. specifies whether a measure response variable is binned. When a variable is binned, a decision tree is created. Otherwise, a regression tree is created. specifies the number of bins that are used to categorize a measure response Copyright © 2021, the School of ISTM, UNSW Sydney, Australia. ALL RIGHTS RESERVED. • Predictor bins – specifies the number of bins that are used to categorize a predictor that is a continuous variable. The default is 20. Smaller values lead to a simpler tree that might be less accurate. Larger values lead to a more complex model that takes longer to train and might be overfit. • Bin method – • Rapid growth – enables you to use the information gain ratio and k-means fast search methods for decision tree growth. Also, when this option is enabled, bin ordering is ignored. When the option is disabled, the information gain and greedy search methods are used. That generally produces a larger tree and requires more time to create. Also, when disabled, bin order 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com