程序代写代做代考 algorithm scheme decision tree Bayesian network Bayesian data mining 408216 Data Mining and Knowledge Engineering

408216 Data Mining and Knowledge Engineering

Lecture 3
Data Mining Algorithms for Classification

*

Examine the algorithms used in popular classification schemes

Naïve Bayes
Nearest Neighbour
Decision Trees
Neural Networks

Uses the Bayes theorem for reasoning
Each data feature contributes to a portfolio of evidence
Assumes that all data features are statistically independent of each other – not always true

Models developed with Naive Bayes have good explanatory power, just as with Decision Trees

Robust technique as it can deal with unknown/missing values
Two types of scenarios for missing values:

Missing values in the training dataset:
Can be handled by not including these instances in the probability estimate calculations.
Missing values in the instance to be classified
This needs care, if just one of the attribute values are missing, then the corresponding conditional probability can be set to 1 for ALL class outcomes and result will thus not be biased.

One of the most efficient classification techniques as it makes only one pass through training data
Naïve Bayes works well in practise even though features may not be statistically independent
In cases where a large number of features are dependent on each other, accuracy could drop substantially and a more advanced version called Bayesian Network should be used

Bayesian Networks

P(HeartDisease=“No”|Exercise=“no”, Diet =“Healthy”) =
1 – P(HeartDisease=“Yes”|Exercise=“no”, Diet =“Healthy”) =1- 0.55 = 0.45
P(HeartBurn=“yes”|
Diet=“Healthy”) = 0.2
In general more accurate than Naïve Bayes but not as efficient as we need to learn structure of network – can be done by using an algorithm such as K2 (see reference at end)

*

Normal procedure: top down in recursive divide and conquer fashion
First: attribute is selected for root node and branch is created for each possible attribute value
Then: the instances are split into subsets (one for each branch extending from the node)
Finally: procedure is repeated recursively for each branch, using only instances that reach the branch
Process stops if all instances have the same class

Which is the best attribute?

The one which will result in the smallest tree
Heuristic: choose the attribute that produces the “purest” nodes
Popular purity criterion: information gain

Information gain increases with the average purity of the subsets that an attribute produces
Strategy: choose attribute that results in greatest information gain

Information is measured in bits

Given a probability distribution, the info required to predict an event is the distribution’s entropy
Entropy gives the information required in bits (this can involve fractions of bits!)
Formula for computing the entropy:

Where p1, p2, …., pn denote the probability of occurrence of outcomes for class 1, 2, …, n respectively

Note: not all leaves need to be pure
Some leaf nodes will have instances belonging to more than one class – in this case the leaf node is labelled with the majority class
Splitting stops when data can’t be split any further – when we run out of attributes to split on or the number of instances at the node to be split is less than a specified minimum number

Advantages:

Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many data sets

IG is a good method of identifying splits in the tree but it can cause problems for features with high cardinality
In the weather example the Day ID perfectly correlates to the right play decision in historical data
This means that Day ID has the highest IG value
However, it has no power whatsoever to predict decisions, hence no generalization capability!
The solution in this case is to use Gain Ratio

Underfitting and Overfitting

Missing Values

Costs of Classification

500 circular and 500 triangular data points.

Circular points:
0.5  sqrt(x12+x22)  1

Triangular points:
sqrt(x12+x22) < 0.5 or sqrt(x12+x22) > 1

*

*

Two problems that can arise with models developed with Data Mining are: Overfitting and Underfitting
Underfitting occurs when the model has not fully learned all the patterns in the data, resulting in poor prediction accuracy (test accuracy).
Underfitting is generally caused by the inability of the algorithm to find all patterns in the training dataset.
In the case of a Decision Tree method the tree developed is not of sufficient depth and size to learn all the patterns present in the data

With Overfitting the model’s learns the patterns in the training data very well but the model learnt cannot predict newly arriving data well
In other words accuracy on training dataset is high but accuracy drops drastically on newly arriving data – training set accuracy >> test set accuracy
In the case of the Decision tree method the tree developed is too detailed (too large in size)
Overfitting is generally caused by:

Noise (errors in assigning class labels) in the training dataset
Lack of sufficient data to capture certain types of patterns

Overfitting

Decision boundary is distorted by noise point

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region
– Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Overfitting results in decision trees that are more complex than necessary

Training error no longer provides a good estimate of how well the tree will perform on previously unseen records

Need new ways for estimating errors

Post-pruning

Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up fashion
If generalization error improves after trimming, replace sub-tree by a leaf node.
Class label of leaf node is determined from majority class of instances in the sub-tree

Algorithm for top-down induction of decision trees (“ID3”) was developed by Ross Quinlan

ID3 only deals with nominal attributes
Led to development of C4.5, which can deal with numeric attributes, missing values, and noisy data
Similar approach: CART
There are many other attribute selection criteria! (but with almost no difference in accuracy of result.)

Data Mining: Practical Machine Learning Tools and Techniques (3rd edition) / Ian Witten, Eibe Frank; Elsevier, 2011, Chapter 4

Use of Kernel functions for estimating probabilities for Naïve Bayes:
George H John and Pat Langley, Estimating Continuous Distributions in Bayesian Classifiers, available from Google scholar or AUT eLibrary:

K2 algorithm for learning structure of a Bayesian network from training data:
G. Cooper and E. Herskovitz, A Bayesian method for the induction of probabilistic networks from data, Machine Learning, 9 (1992), 330–347.