CS计算机代考程序代写 SQL database data mining decision tree AI algorithm Classification & Prediction: Introduction and Decision Trees

Classification & Prediction: Introduction and Decision Trees

Skip to main content

Print book

Classification & Prediction: Introduction and Decision Trees

Site:

Wattle

Course:

COMP3425/COMP8410 – Data Mining – Sem 1 2021

Book:

Classification & Prediction: Introduction and Decision Trees

Printed by:

Zizuo Xiao

Date:

Saturday, 8 May 2021, 11:06 PM

Table of contents
1. Introduction
2. Classification (Text: 8.1) 2.1. Two steps: Construct and Evaluate
2.2. Evaluation

3. Decision Tree Induction (Text: 8.2)
4. Basic, greedy, decision tree algorithm
5. Attribute Selection Methods (Text 8.2.2) 5.1. Information Gain
5.2. Information Gain for continuous-valued attributes
5.3. Gain Ratio
5.4. Gini index
5.5. Other Attribute Selection Methods

6. Overfitting and tree pruning (Text 8.2.3)
7. Enhancements to the Basic Algorithm (not in text)
8. Extracting Rules from Decision Trees (Text 8.4.2)
9. Practical Exercise: Decision trees

1. Introduction
Most
of this material is derived from the text, Han, Kamber and Pei, Chapter
8, or the corresponding powerpoint slides made available by the
publisher.  Where a source other than the text or its slides was
used for the material, attribution is given. Unless othewise stated,
images are copyright of the publisher, Elsevier.

In this module of work we  aim  to introduce the data mining  problems of classification and prediction, and to understand  the basic classification technique of  decision trees,
so that you can  recognise a problem to which they may
apply;  can apply them to the problem; and can evaluate the quality
of the results.

ACTION: You can check
out the first few minutes of this video for a brief description of what a
classification problem looks like: 

Video Player is loading.
Play VideoPlayMute

Current Time 0:00
/

Duration 0:00
Loaded: 0%

Stream Type LIVE

Seek to live, currently playing liveLIVERemaining Time -0:00
 
1x
Playback Rate

Chapters Chapters

Descriptions descriptions off, selected

Captions captions and subtitles off, selected

Audio Track

Fullscreen
This is a modal window.

Beginning of dialog window. Escape will cancel and close the window.
TextColorWhite
Black
Red
Green
Blue
Yellow
Magenta
Cyan

TransparencyOpaque
Semi-Transparent

BackgroundColorBlack
White
Red
Green
Blue
Yellow
Magenta
Cyan

TransparencyOpaque
Semi-Transparent
Transparent

WindowColorBlack
White
Red
Green
Blue
Yellow
Magenta
Cyan

TransparencyTransparent
Semi-Transparent
Opaque

Font Size50%
75%
100%
125%
150%
175%
200%
300%
400%

Text Edge StyleNone
Raised
Depressed
Uniform
Dropshadow

Font FamilyProportional Sans-Serif
Monospace Sans-Serif
Proportional Serif
Monospace Serif
Casual
Script
Small Caps

Reset restore all settings to the default valuesDone

Close Modal DialogEnd of dialog window.

ACTION: You can watch
this video lecture overview of the classsification and prediction topic
if you find it helpful but all it covers is also in the written notes.

Prerecorded lecture for classification and prediction

2. Classification (Text: 8.1)
Classification builds models that describe interesting classes of data.  The models are called classifiers because, once built, the model may be used to classify unseen data. Sometimes the model itself is more important than its use in ongoing classification  because it provides a compact summary of the data, that is explanatory for humans.

Most commonly classification is binary,
that is, objects are determined to belong  to a class or not. For
example, taxpayers are classified as fraudulent, or not. However, the
generalisation to classfying data into more than two classes is
important.

Classification is  often 
classified as a machine-learning problem  due to its origins in AI
research, although data mining research has developed the scalability to
handle large disk-stored data sets.

Nowadays it is widely used in
application to problems in science, marketing, fraud detection,
performance prediction, medical diagnosis, and fault diagnosis. 

Classification

Used to predict categorical class labels (discrete or nominal) from unlabelled data.
Constructs (or learns) a classifier (or model)  from  training data that includes, for each example in the data,  data values as well as a pre-determined class label.
Uses the model to predict the class label for new,unseen, unlabelled data.

Classification vs Prediction

Although classifiers predict the values of unknown class labels, classification is usually distinguished from the problem of numerical prediction  (commonly called simply prediction) that  builds models  of  continuous-valued functions and so predicts unknown or missing numeric values. We will also study some popular prediction techniques.

Supervised Learning vs Unsupervised learning

Again, we see the AI influence in the language here, where supervised learning
refers to classification as we have defined it — where the training
data (observations, measurements, etc.) is accompanied by labels
indicating the class of the observations, and new data is classified
based on the training set. In this AI-oriented view of classification we
often talk about batch vs incremental learning.
The former is usually an unstated assumption for data mining. In the
latter case the labelled data becomes available to the  learning
algorithm in a sequence  and a working classifier developed
initially from a small amount of data must be continually updated to
account for new data.

On the other hand, for unsupervised
learning there are no class labels in the training data and the
learning algorthm must find some interesting classes, or classifications
with which to classify new data. This is commonly called clustering. We will study some popular clustering techniques later.

So classification can also be defined as supervised learning of categorical variables.

2.1. Two steps: Construct and Evaluate
Step 1:  Training phase or learning step: Build a model from the labelled training set.

Each tuple/sample/record/object/example/instance/feature vector  of the training  dataset  is assumed to belong to a predefined class, as determined by the class label attribute. Ideally, the tuples are a random sample from the full population of data.

The set of tuples used for model construction is the training set:

and each   is an attribute value and  for some and is the class label for  .

Commonly, each   is assumed to belong to exactly one class
In the very common special  case of exactly 2 classes, i.e. binary learning, the training classes are called the positive examples   or P and negative  examples or N.
The model
is represented as classification rules, decision trees, mathematical
formulae, or a “black box”. The model can be viewed as a function that can predict the class label for some unlabelled tuple .
For classification models, the built model may be called a classifier.

Step 2: Use the model to classify unseen objects

Need to estimate the accuracy of the model

The known labels of a set of independent test samples is compared with the classified results for those same samples from the model
Accuracy is the proportion of test set samples that are correctly classified by the model

If the accuracy and all other
evaluation measures are acceptable, apply the model to classify data
objects whose class labels are not known in the world.

Example:

The data classification process:

(a) Learning: Training data is analysed by a classification algorithm. Here, the class label attribute is loan_decision, and the learned model or classifier is represented in the form of classification rules.

(b) Classification: Test data are
used to estimate the accuracy of the classification rules. If the
accuracy is considered acceptable, the rules can be applied to the
classification of new, unlabelled, data tuples.

2.2. Evaluation
Learning algorithms  (or learners)
that build models built for classification and prediction are generally
evaluated in the following ways. These ways are applied to the case
of  inventing new algorithms and wanting to assess them against others, or for selecting
an algorithm for its suitability for a particular learning problem.
Once an algorithm(s) has been selected and a model(s) built, these
overarching principles may be revisited to choose which, if any, to put
into practice.

Accuracy often on benchmark data sets so they can be compared with other learning algorithms

Classifier accuracy: Predicting class label
Predictor accuracy: Guessing value of predicted attributes

Speed and complexity
Time to construct the model (training time)
Time to use the model (classification/prediction time)
Worst case or average case theoretical complexity

Scalability

Efficiency in handling disk-based databases
Potential for speed up by parallel computation

Robustness
Handling noise and outliers

Interpretability
Understanding and insight provided by the model

Other measures
goodness of rules
decision tree size
compactness or simplicity of model

3. Decision Tree Induction (Text: 8.2)
A
decision tree classifies labelled data by applying a sequence of
logical tests on attributes that partition the data into finer and finer
sets. The model that is learnt is a tree of logical tests.

The decision tree is a flowchart-like structure, where

each internal node as well as the topmost root node represents a test on an attribute; commonly the tests can have only two outcomes, in which case the tree is binary

each branch directed out and down from an internal node represents an outcome of the test

each leaf node (or terminal node) represents represents a decision and holds a class label
a path from the root to a leaf traces out the classification for a tuple

Decision tree induction is very popular for classification because:

relatively fast learning speed 
convertible to simple and easy to understand classification rules
can work with SQL queries to access databases while tree-building
comparable classification accuracy with other methods

Exercise

Here is some data for a binary classification problem, with label buys_computer.

ACTION:  Consider what sequence of decisions you would propose to identify “who buys a computer”? Is your tree binary, or not?

And here is a decision tree model to classify the data.

ACTION:
Consider,  how does it differ from yours? Can you always 
design a tree that correctly classifies every object in the training
dataset?

4. Basic, greedy, decision tree algorithm

A typical basic algorithm follows. It is greedy (it makes decisions optimising the  next step context and never backtracks to reconsider).

It is a  recursive, top-down divide-and-conquer approach to build a tree.

Attributes may be nominal, ordinal, or continuous.

At the start, all the training examples are at the root
At a node, test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
Examples at the node are  partitioned to sub-nodes  based on selected attributes
Recurse over subnodes
Paritioning stops when

All samples for a given node belong to the same class; or

There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf; or

There are no samples left

The slightly more generic algorithm sketch below permits n-ary (multiway) trees and can discretise continuous attributes dynamically, according to local context in the tree.

ACTION: Go back to the previous page and build a tree stepping through the algorithm as shown here. 

Is your tree any different this time? What attribute_selection_method did you use?

5. Attribute Selection Methods (Text 8.2.2)
Attribute selection methods are also called splitting rules.

Techniques  to choose a splitting criterion comprised of a splitting attribute and a split point or splitting subset
Aim to have partitions at each branch as pure as possible — i.e. all examples at each sub- node  belong in the same class.

Example

This figure shows three
possibilities for partitioning tuples based on the splitting criterion,
each with examples. Let A be the splitting attribute. (a) If A is
nominal-valued, then one branch is grown for each known value of A. (b)
If A is continuous-valued or ordinal, then two branches are grown,
corresponding to A split_point and A > split_point. (c) If A is nominal and a binary tree must be produced, then the test is of the form A SA, where SA is the splitting subset for A.

Heuristics, (or attribute selection measures) are used to choose the best  splitting criterion.

Information Gain, Gain ratio and Gini index are most popular.

Information gain:

biased towards multivalued attributes

Gain ratio:

tends to prefer unbalanced splits in which one partition is much smaller than the others

Gini index:

biased towards multivalued attributes

has difficulty when number  of classes is large

tends to favour tests that result in equal-sized partitions and purity in both partitions

5.1. Information Gain
Information Gain

This was a very early method that sprang from AI research in ID3, and was refined further to become Gain Ratio in  C4.5.

It selects the attribute to split on with the highest information gain

Let be the probability that an arbitrary tuple in belongs to class , of classes, where is the set of tuples in labelled with class
estimated by

Expected information (entropy) needed to classify a tuple in is defined by

After using attribute to split into partitions,  corresponding to each attribute value for , each one of these partitions being , the information that is still needed to separate the classes is:

Therefore, information gained by branching on attribute is given by

Example (continued from previous)

Consider 2 classes: Class P is buys_computer = “yes”. 
Class N is buys_computer = “no”

For some partition on with examples of P  and
  examples of N, let  be written as .

Using the definition  from above,

we have 

Now consider the first partition on age. We have the following

 

and so

= 0.694

Therefore

Similarly,

So Gain(age) is optimal and we split on age to get

5.2. Information Gain for continuous-valued attributes
Let attribute A be a continuous-valued attribute

To determine the best split point for A

Sort the values of  A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

Of these, the point with the minimum expected information requirement for A, is selected as the split-point for A

Then Split:

D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point

This method can also be used for
ordinal attributes with many values (where treating them  simply as
nominals may cause too much branching).

5.3. Gain Ratio
 

Used in C4.5 (a successor of ID3)  to overcome bias towards attributes with many values
Normalises  information gain

splitInfoA represents the potential information generated by splitting D into v partitions, corresponding to the v outcomes of a test on A.

Now we define

GainRatio(A) = Gain(A)/SplitInfo(A)

 

Example (continued from previous):

 

gain_ratio(income) = 0.029/1.557 = 0.019

The attribute with the maximum gain ratio is selected as the splitting attribute.

5.4. Gini index
Gini index is used in CART and IBM Intelligent miner decision tree learners

All attributes are assumed nominal

If a data set contains examples from classes, gini index, , measures the impurity of and is defined as

  where   is the relative frequency of class in

If a data set is split on attribute into two subsets and , the gini index is defined as the size-weighted sum of the impurity of each partition:

 

To split a node in the tree:

Enumerate all the possible ways of splitting all the possible attributes

The attribute split that provides the smallest ginisplit(D) (i.e  the greatest purity) is chosen to
split the node

Example (continued from previous)

D has 9 tuples in class buys_computer = “yes” and 5 in “no”

Then

Now consider the attribute  income.
Partition D into 10 objects in D1 with income in {low, medium} and 4 objects in D2 with income in {high} 

 

We have

Similarly, giniincome{low,high} is 0.458; and giniincome{medium,high} is 0.450. 

Thus, we split on the {low,medium} (and the other partition is {high}) since it has the lowest Gini index

When attributes are continuous or ordinal, the method for selecting the  midpoint between each pair of adjacent values (described earlier) may be used. 

5.5. Other Attribute Selection Methods
Of
course, researchers have experimented with many other attribute
selection methods that you might come across. Here are some of the most
well-known that you might like to look into further.

CHAID: a popular decision tree algorithm, measure based on χ2 test for independence

C-SEP: performs better than info. gain and gini index in certain cases

G-statistic: has a close approximation to χ2 distribution

MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
The best tree as
the one that requires the fewest number of bits to both (1) encode the
tree, and (2) encode the exceptions to the tree

Multivariate splits (partition based on multiple variable combinations), e.g.
CART: finds multivariate splits based on a linear combination of attributes.

6. Overfitting and tree pruning (Text 8.2.3)
If
you have consistently labelled data, and you allow attributes to be
re-used when growing the tree, and you stop growing when every training
tuple is classified, it is always possible to have a tree that describes
the training set with 100% accuracy. Such a tree typically has very
poor generalisation and is said to overfit the training data.

Overfitting:  An induced tree may overfit the training data

Too many branches, some may reflect anomalies due to noise or outliers

Poor accuracy for unseen samples is observed because anomalies are modelled and objects like the training data are not modelled. 100% accuracy on the training data can be a bad thing for accuracy on unseen data.

There are two typical approaches to avoid overfitting:

Prepruning:
Stop tree construction early ̵ do not split a node if this would result
in the goodness measure falling below a threshold. But it is difficult
to choose an appropriate threshold.

Postpruning:
Remove branches from a “fully grown” tree. This produces a sequence of
progressively pruned trees. Then use a set of data different from the
training data to decide which is the “best pruned tree”

ACTION: Overfitting
is an important but rather qualitative, loosely-defined concept. If you
need more explanation of overfitting, refer to https://en.wikipedia.org/wiki/Overfitting

7. Enhancements to the Basic Algorithm (not in text)
These enhancements may be embedded within the decision tree induction algorithm

Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values

Attribute construction
Create new attributes based on existing ones that are sparsely represented
This reduces fragmentation, repetition, and replication

Continuous target variable
In this case the tree is called a regression tree,
the leaf node classes are represented by their mean values, and 
the tree performs prediction (using that mean value) rather than
classification.

Probabilistic classifier
Instead of majority voting to assign a class label to a leaf node, the proportion of training data objects of each class in the leaf node can be interpreted to as the probability of
the class, and this probability can be assigned to the
classification  for unknown objects falling in that node at
use-time.

ACTION:
There is a nice non-technical overview of decision trees, that 
you might like to read  if you are needing more:

Decision Trees Tutorial

8. Extracting Rules from Decision Trees (Text 8.4.2)
Decision
trees can become large and difficult to interpret. We look at how to
build a rule-based classifier by extracting IF-THEN rules from a
decision tree. In comparison with a decision tree, the IF-THEN rules may
be easier for humans to understand, particularly if the decision tree
is very large.

Rules are easier to understand than large trees
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction

Example:

 

From the decision tree above, we can extract IF-THEN rules by tracing them from the root node to each leaf node:

Properties of extracted rules

Mutually exclusive
We cannot have rule conflicts here because no two rules will be triggered for the same tuple.
We have one rule per leaf, and any tuple can map to only one leaf.

Exhaustive:

There is one rule for each possible attribute–value combination, so that this set of rules does not require a default rule.
The order if rules is irrelevant.

* Rattle can generate rules from a trained decision tree.

9. Practical Exercise: Decision trees
ACTION:
Do this practical exercise.  A video showing the use of the Rattle functions needed is provided to assist you.

 

Practical Exercise: Decision trees