程序代写代做代考 data mining information theory algorithm Excel decision tree Bayesian EM623-Week6

EM623-Week6

Carlo Lipizzi
clipizzi@stevens.edu

SSE

2016

Machine Learning and Data Mining
Decision Trees: definitions, algorithms, applications, optimizations and

implementation using R/Rattle

Machine learning and our focus

• Like human learning from past experiences
• A computer does not have “experiences”
• A computer system learns from data, which represent some

“past experiences” of an application domain
• Our focus: learn a target function that can be used to

predict the values of a discrete class attribute, e.g.:
approve or not-approved, and high-risk or low risk

• The task is commonly called: Supervised learning,
classification, or inductive learning

2

The data and the goal

• Data: A set of data records (also called examples, instances or
cases) described by

– k attributes: A1, A2, … Ak.
– a class: Each example is labelled with a pre-defined

class
• Goal: To learn a classification model from the data that can be

used to predict the classes of new (future, or test)
cases/instances

3

An example: data (loan application)

Approved or not

4

An example: the learning task

• Learn a classification model from the data
• Use the model to classify future loan applications into

– Yes (approved) and
– No (not approved)

• What is the class for following case/instance?

5

Supervised vs. Unsupervised Learning

• Supervised learning (classification)
– Supervision: The training data (observations, measurements,

etc.) are accompanied by labels indicating the class of the
observations

– New data is classified based on the training set

• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of data, the task is to establish the existence of

classes or clusters in the data

6

Classification by Decision Tree

• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution

• Decision tree generation consists of two phases
– Tree construction

• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes

– Tree pruning
• Identify and remove branches that reflect noise or outliers

• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree

7

Decision Trees
• One of the simplest and most successful forms of machine learning
• Takes as input a vector of attribute values. Returns a single output value decision

Decision – To wait for a table at a restaurant or not?
Goal – To come up with a function, which gives a boolean output WillWait

Attributes:
• Alternate: Whether there is a suitable alternative restaurant nearby
• Bar: Whether the restaurant has a comfortable waiting lounge
• Fri/Sat: Is it a Friday Saturday
• Hungry: Whether we are hungry
• Patrons: How many people are in the restaurant (values are None, Some, Full)
• The restaurant’s pricing range($, $$, $$$)
• Raining: Whether it is raining outside
• Type: The kind of restaurant(French, Italian, Thai, Burger)
• Reservation: Whether we made a reservation
• WaitEstimate: The wait estimated by the waiter (0-10, 10-30, 30-60 or 60>)

8

Decision Tree – Example

9

Decision Tree

• It represents a human like thinking pattern. We take different
attributes into consideration one by one and arrive at a
conclusion for many problems

• A decision tree reaches a conclusion by performing a series of
tests

• Each internal node in the tree corresponds to a test of the value
of an attribute

• The branches from the nodes represent possible values of the
attributes

• Each leaf node represents the final value to be returned by the
function

• Decision trees are popular for pattern recognition because the
models they produce are easier to understand

10

• All paths …
• start at the root node
• end at a leaf

• Each path represents a decision rule
• joining (AND) of all the tests along that path
• separate paths that result in the same class are disjunctions

(ORs)
• All paths – mutually exclusive

• for any one case – only one path will be followed
• false decisions on the left branch
• true decisions on the right branch

11

Decision Tree

12

From Tree to Rules

13%
9%
9%
8%
8%
8%

6%
6%
6%
6%

38%

35%
22%

16%
14%
17%

15%
13%
16%
15%
14%
14%

4%

22%
20%

18%
22%

23%
16%

24%
23%

18%
20%

15%
18%

14%
13%

14%
13%
14%

9%

6%

18%
19%

17%
19%

19%
16%

17%
19%

17%
16%

15%
18%

18%
16%
20%

16%
19%

15%

0% 20%
31%

22%
15%

40%

34%

60%

18%
26%

80%
15%

9%
11%

100%

Regression
Decision trees

Cluster analysis
Time series
Text mining

Ensemble models
Factor analysis

Neural nets
Random forests

Association rules
Bayesian

Support vector machines (SVM)
Anomaly detection

Proprietary algorithms 6% 10%
Rule induction 4% 10%

Social network analysis 4% 10%
Uplift modeling 4% 10%

Survival analysis 8%
Link analysis 8%

Genetic algorithms 7%
MARS

Most of the time Often Sometimes Rarely

• Regression, decision trees, and cluster analysis continue to form a triad of core algorithms for most
data miners. This has been consistent since the first Data Miner Survey in 2007

• The average respondent reports typically using 12 algorithms. People with more years of
experience use more algorithms, and consultants use more algorithms (13) than people working in
other settings (11).

Question: What algorithms / analytic methods do you TYPICALLY use? (Select all that apply)

The number of algorithms used varies by the
labels people use to describe themselves, with
Data Miners (14) and Data Scientists (14)
using the most, and Software Developers (9)
and Programmers (8) the fewest.

13

Source: Rexer Analytics, 2013

Algorithms

Decision trees – Binary decision trees

• Classification of an input
vector is done by scanning the
tree beginning at the root
node, and ending the leaf

• Each node of the tree
computes an inequality (ex.
BMI<24, yes or no) based on a single input variable • Each leaf is assigned to a particular class Yes No Yes No NoYes BMI<24 14 Decision Trees • Decision trees can be: • Classification trees, when the predicted outcome is the class to which the data belongs • Regression trees, when the predicted outcome can be considered a real number (e.g.: the price of a house, or a patient’s length of stay in a hospital) • The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al. • Decision trees can be used either as supervised and unsupervised learning tools • when in supervised mode, they can be used to create models for future predictions • when in unsupervised, they are pure classifiers 15 Decision trees - Classification and regression trees (CART) • CLASSIFICATION AND REGRESSION TREES (CART) are binary decision trees, which split a single variable at each node • The CART algorithm recursively goes though an exhaustive search of all variables and split values to find the optimal splitting rule for each node 16 Trees families • ID3, or Iterative Dichotomizer, was the first of three Decision Tree implementations developed by Ross Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.) • C4.5, Quinlan's next iteration. The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually known as "pruning"; and (iv) different weights can be applied the features that comprise the training data • CART (Classification And Regression Tree) is often used as a generic acronym for the term Decision Tree. The CART implementation is very similar to C4.5 • CHAID (chi-square automatic interaction detector) is a non-binary decision tree. The decision or split made at each node is still based on a single variable, but can result in multiple branches. The split search algorithm is designed for categorical variables 17 C4.5 and CART • C4.5 trees differ from CART in several aspects, eg: • Tests: • CART: always binary • C4.5: any number of branches • Test selection criterion: • CART: diversity index (Gini) • C4.5: information-based criteria • Pruning: • CART: cross-validated using cost-complexity model • C4.5: single pass based on binomial confidence limits 18 Chi-Squared Automatic Interaction Detector (CHAID) • Continuous variables must be grouped into a finite number of bins to create categories – A reasonable number of “equal population bins” can be created for use with CHAID – ex. If there are 1000 samples, creating 10 equal population bins would result in 10 bins, each containing 100 samples • A Chi2 value is computed for each variable and used to determine the best variable to split on 19 Plan for Construction of a Tree • Selection of the Splits • Decisions when to decide that a node is a terminal node (i.e. not to split it any further) • Assigning a class to each terminal node 20 Impurity of a Node • Need a measure of impurity of a node to help decide on how to split a node, or which node to split • The measure should be at a maximum when a node is equally divided amongst all classes • The impurity should be zero if the node is all one class 21 Impurity Very impure group Minimum impurityLess impure 22 Information Gain • We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned • Information gain tells us how important a given attribute of the feature vectors is • We will use it to decide the ordering of attributes in the nodes of a decision tree 23 Information Gain • Impurity/Entropy (informal) – Measures the level of impurity in a group of examples 24 Measures of Impurity • Misclassification Rate • Information, or Entropy • Gini Index In practice the first is not used for the following reasons: • Situations can occur where no split improves the misclassification rate • The misclassification rate can be equal when one option is clearly better for the next step 25 Entropy • Is a measure of information or uncertainty of a random variable • Entropy comes from information theory. More the uncertainty, the higher the entropy, the more the information content • For example if we toss a coin which always falls on head, we gain no information by tossing it, so zero entropy. However if we toss a fair coin, we are unsure of the outcome. So we get some information out of tossing it 26 • Entropy = pi is the probability of class i Compute it as the proportion of class i in the set • For a fair coin H(Fair) = - (0.5log2 0.5 + 0.5 log2 0.5) = 1 • For a Biased coin H(Biased) = - 1 log2 (1) = 0 27 Entropy !𝑝#𝑙𝑜𝑔'𝑝# � # • What is the entropy of a group in which all examples belong to the same class? entropy = - 1 log21 = 0 • What is the entropy of a group with 50% in either class? entropy = -0.5 log20.5 – 0.5 log20.5 =1 Minimum impurity Maximum impurity not a good training set for learning good training set for learning 28 2-Class Cases: Gini Index • This is the most widely used measure of impurity (at least by CART) • Gini* index is: == j j ji ji ppppi 21)( 29 *: Corrado Gini (1884 – 1965) developed the Gini coefficient, a measure of the income inequality in a society Scaled Impurity functions 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p1 Misclassification rate Gini Index Information 30 Tree Impurity • We define the impurity of a tree to be the sum over all terminal nodes of the impurity of a node multiplied by the proportion of cases that reach that node of the tree • Example: Impurity of a tree with one single node, with both A and B having 400 cases, using the Gini Index: • Proportions of the two cases = 0.5 • Therefore Gini Index = 1 - (0.5)2 - (0.5)2 = 0.5 31 Using the dataset above (available at the Dataset page in the course website as “Salaries.csv”), construct a classification and regression tree to classify Salary, based on the other variables. Use Excel first 32 Decision Tree - Exercise Possible Splits 33 Possible Criteria: Salary <=$45,000 vs >$45,000

Decision Tree – Exercise – Hints 1

Possible Execution

34

Decision Tree – Exercise – Hints 2

Selection of Splits

• We select the split that most decreases the Gini Index/better
entropy reduction. This is done over all possible places for a split
and all possible variables to split

• We keep splitting until the terminal nodes have very few cases
or are all pure

• This is may lead to large trees than will then need to be pruned

35

Overfitting

• Overfitting is a scenario where the decision tree algorithm will
generate a large tree when there is actually no pattern to be
found. This is a common problem with all types of learners

• Reasons for overfitting
• Some attributes have little meaning
• Number of attributes too high
• Small training data set

36

Combating Overfitting – Pruning

• It works by eliminating nodes that are not clearly relevant
• One of the possible and most used way to determine the

relevance of a branch is to measure the gain in entropy
reduction it provides compared to the previous/other branches

• Pruning is cutting the branches providing relative minor
contribution

37

Supervised learning process: two steps

• Learning (training): Learn a model using the training data
• Testing: Test the model using unseen test data to assess the

model accuracy

38

Supervised learning process: Three steps

• A third step can be
added, splitting the
original dataset in 3
parts: Training, Testing
and Validate

• This is the default for
Rattle

39

Learning in Data Mining

• Given
– a data set D
– a task T
– a performance measure M
– a computer system is said to learn from D to perform the

task T if after learning the system’s performance on T
improves as measured by M

• In other words, the learned model helps the system to perform T
better as compared to no learning

40

Fundamental assumption of learning

Assumption: The distribution of training examples is identical to the
distribution of test examples (including future unseen examples)

• In practice, this assumption is often violated to certain degree
• Strong violations will clearly result in poor classification accuracy
• To achieve good accuracy on the test data, training examples

must be sufficiently representative of the test data

41

Model Evaluation

• Evaluation metrics: How can we measure accuracy?
• Use validation test set of class-labeled tuples instead of training

set when assessing accuracy
• Methods for estimating a classifier’s accuracy:

– Holdout method, random subsampling
– Cross-validation
– Bootstrap

• Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves

42

Classifier Evaluation Metrics: Confusion Matrix

Actual class\Predicted
class

buy_computer =
yes

buy_computer =
no

Total

buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000

Total 7366 2634 10000

• TP and TN are the correctly predicted tuples
• May have extra rows/columns to provide totals

Actual class\Predicted class yes no

yes True Positives (TP) False Negatives (FN)
no False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:

43

Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity

• Classifier Accuracy, or
recognition rate: percentage of
test set tuples that are correctly
classified

– Accuracy = (TP + TN)/All

• Error rate: misclassification rate
• 1 – accuracy, or

– Error rate = (FP + FN)/All

• Class Imbalance Problem:
One class may be rare, e.g. fraud,

or HIV-positive
Significant majority of the

negative class and minority of
the positive class

Sensitivity: True Positive
recognition rate

• Sensitivity = TP/P
Specificity: True Negative

recognition rate
• Specificity = TN/N

A\P Y N

Y TP FN P

N FP TN N

P’ N’ All

44

Classifier Evaluation Metrics:
Precision and Recall, and F-measures

• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the
classifier label as positive?

• Perfect score is 1.0
• F measure (F1 or F-score): harmonic mean of precision and

recall,

45

precision = TPTP+FP

recall = TPTP+FN

F = 2×precision×recallprecision+recall

Classifier Evaluation Metrics: Example

– Precision = ?? Recall = ??

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity)

cancer = no 140 9560 9700 98.56 (specificity)

Total 230 9770 10000 96.40 (accuracy)

46

precision = TPTP+FP recall =
TP

TP+FN

Model Selection: ROC Curves

• ROC (Receiver Operating Characteristics)
curves: for visual comparison of
classification models

• Originated from signal detection theory
• Shows the trade-off between the true

positive rate and the false positive rate
• The area under the ROC curve is a

measure of the accuracy of the model
• Diagonal line: for every TP, equally likely to

encounter FP
• The closer to the diagonal line (i.e., the

closer the area is to 0.5), the less accurate
is the model

• Vertical axis represents the
true positive rate

• Horizontal axis rep. the
false positive rate

47

• Rpart (the R algorithm used by Rattle to generate the trees) provides a text view of the tree, containing per
each split:

node), split, n, loss, yval, (yprob)
Where node) is the node number, split is name/condition of the node, n is the number of entities at the

node, loss is the number of entities that are incorrectly classified at that node, yval the default
classification for the node, yprob contains the distribution of classes in that node. The following is an
example: 1) root 256 41 No (0.83984375 0.16015625)

• Using “CP”, that is the threshold complexity parameter. It is an argument that is used to control the maximum
size of the tree selecting a value of the cost complexity parameter. It is always monotonic with the number of
splits. The smaller the value of CP, the more complex will be the tree (the greater the number of splits).

48

• For a regression tree, the relative error (rel error) is the average deviance of the current tree divided by the
average deviance of the null tree

• The cross-validation error (xerror) is based on a 10-fold cross-validation and is measured relative to the
deviance of the null model. The cross-validation error is greater than the relative error

Information from R/Rattle

Using the dataset “weather.csv, construct a classification and
regression tree to classify “RainTomorrow”, based on the other
variables

•Consider Date as “Ident” variable
•Ignore variable RISK_MM
•Consider first a partition 70/30 (no validation set)
•Use first

• Complexity = 0, Min. Split = 2, Min. Bucket = 1
and then use the model metrics to prune it by tuning the

above parameters
•Explain the results

49

Decision Tree – Exercise