408216 Data Mining and Knowledge Engineering
Lecture 3
Data Mining Algorithms for Classification
*
Examine the algorithms used in popular classification schemes
Naïve Bayes
Nearest Neighbour
Decision Trees
Neural Networks
Uses the Bayes theorem for reasoning
Each data feature contributes to a portfolio of evidence
Assumes that all data features are statistically independent of each other – not always true
Models developed with Naive Bayes have good explanatory power, just as with Decision Trees
Robust technique as it can deal with unknown/missing values
Two types of scenarios for missing values:
Missing values in the training dataset:
Can be handled by not including these instances in the probability estimate calculations.
Missing values in the instance to be classified
This needs care, if just one of the attribute values are missing, then the corresponding conditional probability can be set to 1 for ALL class outcomes and result will thus not be biased.
One of the most efficient classification techniques as it makes only one pass through training data
Naïve Bayes works well in practise even though features may not be statistically independent
In cases where a large number of features are dependent on each other, accuracy could drop substantially and a more advanced version called Bayesian Network should be used
Bayesian Networks
P(HeartDisease=“No”|Exercise=“no”, Diet =“Healthy”) =
1 – P(HeartDisease=“Yes”|Exercise=“no”, Diet =“Healthy”) =1- 0.55 = 0.45
P(HeartBurn=“yes”|
Diet=“Healthy”) = 0.2
In general more accurate than Naïve Bayes but not as efficient as we need to learn structure of network – can be done by using an algorithm such as K2 (see reference at end)
*
Normal procedure: top down in recursive divide and conquer fashion
First: attribute is selected for root node and branch is created for each possible attribute value
Then: the instances are split into subsets (one for each branch extending from the node)
Finally: procedure is repeated recursively for each branch, using only instances that reach the branch
Process stops if all instances have the same class
Which is the best attribute?
The one which will result in the smallest tree
Heuristic: choose the attribute that produces the “purest” nodes
Popular purity criterion: information gain
Information gain increases with the average purity of the subsets that an attribute produces
Strategy: choose attribute that results in greatest information gain
Information is measured in bits
Given a probability distribution, the info required to predict an event is the distribution’s entropy
Entropy gives the information required in bits (this can involve fractions of bits!)
Formula for computing the entropy:
Where p1, p2, …., pn denote the probability of occurrence of outcomes for class 1, 2, …, n respectively
Note: not all leaves need to be pure
Some leaf nodes will have instances belonging to more than one class – in this case the leaf node is labelled with the majority class
Splitting stops when data can’t be split any further – when we run out of attributes to split on or the number of instances at the node to be split is less than a specified minimum number
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many data sets
IG is a good method of identifying splits in the tree but it can cause problems for features with high cardinality
In the weather example the Day ID perfectly correlates to the right play decision in historical data
This means that Day ID has the highest IG value
However, it has no power whatsoever to predict decisions, hence no generalization capability!
The solution in this case is to use Gain Ratio
Underfitting and Overfitting
Missing Values
Costs of Classification
500 circular and 500 triangular data points.
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) < 0.5 or
sqrt(x12+x22) > 1
*
*
Two problems that can arise with models developed with Data Mining are: Overfitting and Underfitting
Underfitting occurs when the model has not fully learned all the patterns in the data, resulting in poor prediction accuracy (test accuracy).
Underfitting is generally caused by the inability of the algorithm to find all patterns in the training dataset.
In the case of a Decision Tree method the tree developed is not of sufficient depth and size to learn all the patterns present in the data
With Overfitting the model’s learns the patterns in the training data very well but the model learnt cannot predict newly arriving data well
In other words accuracy on training dataset is high but accuracy drops drastically on newly arriving data – training set accuracy >> test set accuracy
In the case of the Decision tree method the tree developed is too detailed (too large in size)
Overfitting is generally caused by:
Noise (errors in assigning class labels) in the training dataset
Lack of sufficient data to capture certain types of patterns
Overfitting
Decision boundary is distorted by noise point
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region
– Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task
Overfitting results in decision trees that are more complex than necessary
Training error no longer provides a good estimate of how well the tree will perform on previously unseen records
Need new ways for estimating errors
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up fashion
If generalization error improves after trimming, replace sub-tree by a leaf node.
Class label of leaf node is determined from majority class of instances in the sub-tree
Algorithm for top-down induction of decision trees (“ID3”) was developed by Ross Quinlan
ID3 only deals with nominal attributes
Led to development of C4.5, which can deal with numeric attributes, missing values, and noisy data
Similar approach: CART
There are many other attribute selection criteria! (but with almost no difference in accuracy of result.)
Data Mining: Practical Machine Learning Tools and Techniques (3rd edition) / Ian Witten, Eibe Frank; Elsevier, 2011, Chapter 4
Use of Kernel functions for estimating probabilities for Naïve Bayes:
George H John and Pat Langley, Estimating Continuous Distributions in Bayesian Classifiers, available from Google scholar or AUT eLibrary:
K2 algorithm for learning structure of a Bayesian network from training data:
G. Cooper and E. Herskovitz, A Bayesian method for the induction of probabilistic networks from data, Machine Learning, 9 (1992), 330–347.