COMP9417 Machine Learning and Data Mining Term 2, 2022
COMP9417 ML & DM Term 2, 2022 1 / 67
Acknowledgements
Copyright By PowCoder代写 加微信 powcoder
Material derived from slides for the book “Machine Learning” by T. Graw-Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides by . Moore
http:www.cs.cmu.edu/~awm/tutorials
Material derived from slides by
http://www.cs.waikato.ac.nz/ml/weka
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
COMP9417 ML & DM Term 2, 2022 2 / 67
This lecture will enable you to describe two kinds of approach typically referred to in machine learning as nonparametric. The first is tree learning, and the second is known as “nearest neighbour”. Both of these can be applied to either classification or regression tasks. Following it you should be able to:
• describe the representation of tree-structured models
• reproduce the top-down decision tree induction (TDIDT) algorithm • define node “impurity” measures of tree learning
• describe learning regression and model trees
• describe overfitting in terms of model complexity
• describe k-nearest neighbour for classification and regression
COMP9417 ML & DM Term 2, 2022 3 / 67
Introduction
What is in Machine Learning ?
Surprisingly difficult to define precisely parametric vs. nonparametric
• Linear models for regression and classification
• Learning is finding good values for a fixed set of parameters
• Parameters fixed by features in the dataset (its dimensionality)
• Other types of models do not have parameters fixed
• Trees learning automatically selects parameters to include or leave out • Nearest Neighbour methods are “model-free” !
• Some more complex methods also can be viewed as nonparametric • Random Forests
• Deep Learning
• Probabilistic Programming • …
COMP9417 ML & DM Term 2, 2022 4 / 67
Introduction
Why use decision trees?
• Trees in some form are probably still the single most popular data mining tool
• Easy to understand
• Easy to implement
• Easy to use
• Computationally efficient (even on big data) to learn and run
• There are some drawbacks, though — e.g., can have high variance
COMP9417 ML & DM Term 2, 2022 5 / 67
nonlinear ”
• They do classification, i.e., predict a categorical output from categorical and/or real inputs
• Tree learning can also be used for predicting a real-valued output, i.e., they can do regression
Introduction
Training Examples
Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
Outlook Sunny Rain Rain Rain Overcast Rain Overcast Rain
Temperature Hot
Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild
Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High
Wind PlayTennis
Weak No Strong No Weak Yes
Weak Yes Strong No Strong Yes
Weak No Weak Yes Weak Yes
Strong Yes Strong Yes Weak Yes Strong No
Term 2, 2022
COMP9417 ML & DM
Introduction
Decision Tree for PlayTennis
Temp Hmm W i r a ..
Input 🙁 Rain Mild High Weak) Classification .,,
High Normal No Yes
Strong Weak No Yes
COMP9417 ML & DM
Term 2, 2022
Introduction
Decision Trees
Decision tree representation:
• Each internal node tests an attribute (feature)
• Each branch corresponds to attribute (feature) value or threshold • Each leaf node assigns a classification value
How would we represent the following expressions ? • ^,_, XOR
• (A^B)_(C^¬D^E)
COMP9417 ML & DM Term 2, 2022 8 / 67
Introduction
Decision Trees
itreef✗Y✗my / X^Y””✗
compressed f f t X = t: data n o ✓ f
|Y=t:truec)Itno✓ vu |Y=f:no t no ✓ no y X = f: no f
t t t true ✓ v v
X_Y no true
X = t: true
| Y = t: true
| Y = f: no
COMP9417 ML & DM Term 2, 2022 9 / 67
Introduction
Decision Trees
| Y = t: false
| Y = f: true
| Y = t: true
| Y = f: false
classifiers unable to represent X O R .
” compression” ofdata
So decision trees are, in some sense, non-linear classifiers (or regressors).
COMP9417 ML & DM Term 2, 2022 10 / 67
Recall : linear
Introduction
Decision Trees
2 of 3 ( version of a n
| Y = t: true
| | Z = t: true
| | Z = f: false
| | Z = t: true
| | Z = f: false
| Y = f: false
“””) M of N threshold function
In general, decision trees represent a disjunction of conjunctions of constraints, or tests, on the attributes values of instances.
COMP9417 ML & DM Term 2, 2022 11 / 67
The Basic Algorithm
Top-Down Induction of Decision Trees (TDIDT)
Main loop:
currently- ” recursive portioning ” a
1 Select A as the “best” decision attribute for next node
2 Assign A as decision attribute for node
3 For each value of A, create new descendant of node
4 Sort training examples to leaf nodes
“findingthe best split ”
splitting the
5 If training examples perfectly classified, Then STOP, Else iterate over
new leaf nodes
Essentially, this is the top-level of the ID3 algorithm1 — the first efficient
symbolic Machine Learning algorithm.
lD3/C4-5 : ART : Breimanetal.
1See: Quinlan (1986).
COMP9417 ML & DM Term 2, 2022 12 / 67
The Basic Algorithm
Which attribute is best?
class distribution
Node class
distribution
Data set :
Left child class
distribution 2 6 38 COMP9417 ML & DM
[29+,35-] A2=? tf tf
[21+,5-] [8+,30-] class
Right chided
distribution
Term 2, 2022
[18+,33-] [11+,2-]
Information Gain
0.0 0.5 1.0 p+
Consider a 2-class distribution, where:
S is a sample of training examples
p is the proportion of positive examples in S p is the proportion of negative examples in S
COMP9417 ML & DM
Term 2, 2022
Information Gain
Entropy measures the “impurity” of S
Entropy(S) ⌘ p log2 p p log2 p
A “pure” sample is one in which all examples are of the same class.
A decision tree node with low impurity means that the path from root to the node represents a combination of attribute-value tests with good classification accuracy.
COMP9417 ML & DM Term 2, 2022 15 / 67
Information Gain
Entropy(S) = expected number of bits needed to encode class ( or ) of randomly drawn member of S (under the optimal, shortest-length code)
Information theory: optimal length code assigns log2 p bits to message
having probability p.
So, expected number of bits to encode or of random member of S:
p ( log2 p )+p ( log2 p Entropy(S) ⌘ p log2 p p
COMP9417 ML & DM
– [ Pk 1092Pk KEK
Term 2, 2022 16 / 67
Entropy(S)
Information Gain
Information Gain
• Gain(S, A) = expected reduction in entropy due to sorting on A
Gain(S, A) ⌘
64 samples p⊕=%4
v = t f¥& v2Values(A) |S|
v=f ¥4 |Sv|Entropy(Sv)
Entropy(S)
[29+,35-] A1=?
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
COMP9417 ML & DM Term 2, 2022 17 / 67
64 samples
[29+,35-] A2=?
Information Gain
Information Gain
Gain(S,A1) = = =
COMP9417 ML & DM
Entropy(S) ✓|St|Entropy(St)+ |Sf|Entropy(Sf)◆ |S| |S|
((26( 21 log2(21) 5 log2( 5 ))) +
64 26 26 26 26
(38( 8 log2( 8 ) 30 log2(30)))) 64 38 38 38 38
0.9936 ( 0.2869 + 0.4408 ) 0.2658
Term 2, 2022
Information Gain
Information Gain
Gain(S, A2) = 0.9936 ( 0.7464 + 0.0828 ) = 0.1643
COMP9417 ML & DM Term 2, 2022 19 / 67
Information Gain
Information Gain
So we choose A1, since it gives a larger expected reduction in entropy.
COMP9417 ML & DM Term 2, 2022 20 / 67
Information Gain
Attribute selection – impurity measures more generally
Estimate class probability of class k at node m of the tree as pˆmk = |Smk|. |Sm |
Classify at node m by predicting the majority class, pˆmk(m). Misclassification error:
Entropy for K class values:
X pˆmk log pˆmk
CART (Breiman et al. (1984)) uses the “Gini index”:
X pˆ m k ( 1 pˆ m k ) k=1
COMP9417 ML & DM Term 2, 2022 21 / 67
Information Gain
Attribute selection – impurity measures more generally
Why not just use accuracy, or misclassification error ? In practice, not found to work as well as others
Entropy and Gini index are more sensitive to changes in the node probabilities than misclassification error.
Entropy and Gini index are differentiable, but misclassification error is not (Hastie et al. (2009)).
COMP9417 ML & DM Term 2, 2022 22 / 67
Tree Learning More Generally
TDIDT search space is all trees !
Applies greedy search to maximise information gain . . .
? 2- node tree
A2 3-node +–+ –
+–+ – 4-node A4
COMP9417 ML & DM Term 2, 2022 23 / 67
Tree Learning More Generally
Inductive Bias of TDIDT
Note hypothesis space H is complete (contains all finite discrete-valued
functions w.r.t attributes)
• So H can represent the power set of instances X !
!Unbiased? INDUCTIVE BIAS ? Not really…
Learning theory: His set of
all possible models .
• Preference for short trees, and for those with high information gain
attributes near the root
• Inductive bias is a preference for some hypotheses, rather than a
restriction of hypothesis space H
• An incomplete search of a complete hypothesis space versus a
complete search of an incomplete hypothesis space
• Occam’s razor: prefer the shortest hypothesis that fits the data
• Inductive bias: approximately, “prefer shortest tree”
COMP9417 ML & DM Term 2, 2022 24 / 67
Overfitting and How To Avoid It
Why does overfitting occur?
0, = F-, – e,
• Greedy search can make mistakes. It can end up in local minima — so a sub-optimal choice earlier might result in a better solution later (i.e., pick a test whose information gain is less than the best one)
• But there is also another kind of problem. Training error is an optimistic estimate of the true error of the model, and this optimism increases as the training error decreases
• Suppose we could quantify the “optimism” of a learning algorithm . . .
• Say we have two models h1 and h2 with training errors e1 and e2 and
optimism o1 and o2. True Est Opt
• Let the true error of each be E1 =e1 +o1 and E2 =e2 +o2
• Ife
training data
• So, a search method based purely on training data estimates may end up overfitting the training data
COMP9417 ML & DM Term 2, 2022 25 / 67
Overfitting and How To Avoid It
Overfitting in Decision Tree Learning
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
COMP9417 ML & DM Term 2, 2022 26 / 67
0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
On training data
On test data
sub-types (grow full tree tree bottom – up ( leaves to root )
prune back )
Consider ☐
for replacement by a leaf node , etc.
Overfitting and How To Avoid It
Avoiding Overfitting
How can we avoid overfitting? Estimating{Training : e (m )
For tree learning the answer is pruning
Two main approaches
keypoint: ffn)≥elm) flu/= d-elm). . .
• pre-pruning stop growing when further data splits are not useful
• post-pruning grow full tree, then remove sub-trees which may be
overfitting
Post-pruning more common:
• can prune using a pessimistic estimate of error
• measure training set error
• adjust by a factor dependent on “confidence” hyperparameter
• or can use cross-validation to estimate error during pruning
COMP9417 ML & DM Term 2, 2022 27 / 67
pessimistic
.. flm) = ) ◦ ( )
low med high
°26 /4⊖,⊖4⊖
/ 4-0 ⇒ 5+0
For node D: training set error
f– 0-36 pessimistic estimate e = 0.46
Sub- free measures
low : med : high :
f- = f- = f-.
e- 0-47 e = 0-72 e = 0-47
Sub – free
6–2 :b 0.51
combined proportionally 2*(44*047) + (74*072) ≈
Since 0-51 > 0-46 , we replace subtree at ☐ by a leaf
Overfitting and How To Avoid It
Avoiding Overfitting – Post-pruning
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
Effect of reduced-error pruning on tree learning to avoid overfitting.
COMP9417 ML & DM Term 2, 2022 28 / 67
On training data
On test data On test data (during pruning)
0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
Learning Non-linear Regression Models with Trees
Regression trees ” numeric prediction ”
interpretable
• Differences to decision trees:
• Splitting criterion: minimizing intra-subset variation
• Pruning criterion: based on numeric error measure
• Leaf node predicts average class values of training instances reaching
• Can approximate piecewise constant functions • Easy to interpret
• More sophisticated version: model trees
COMP9417 ML & DM Term 2, 2022 29 / 67
Learning Non-linear Regression Models with Trees
A Regression Tree and its Prediction Surface
binary splits
e.g. mean of the y
values in Ri
“Elements of Statistical Learning” Hastie, Tibshirani & Friedman (2001)
COMP9417 ML & DM Term 2, 2022 30 / 67
Learning Non-linear Regression Models with Trees
Regression Tree on sine dataset
1- Dimensional data
floc) = sinx
Data: fbi/1- “noise”
COMP9417 ML & DM
Term 2, 2022 31 / 67
Learning Non-linear Regression Models with Trees
Regression Tree on CPU dataset
COMP9417 ML & DM Term 2, 2022 32 / 67
Learning Non-linear Regression Models with Trees
Tree learning as variance reduction
• Variance of a Boolean (i.e., Bernoulli) variable with success probability p is p(1 p).
• Can interpret goal of tree learning as minimising the class variance in the leaves.
• In regression problems we can define the variance in the usual way: Var(Y)= 1 X(y y)2
If a split partitions the set of target values Y into mutually exclusive sets {Y1, . . . , Yl}, the weighted average variance is then
1 l |Y| Xy2 X j y2j
maximise the weighted average of squared means in the children.
l |Y| Var({Y1,…,Yl})=X j Var(Yj)=…=
|Y|y2Y j=1 |Y| The first term is constant for a given set Y and so we want to
COMP9417 ML & DM Term 2, 2022 33 / 67
Learning Non-linear Regression Models with Trees
Learning a regression tree
Imagine you are a collector of vintage Hammond tonewheel organs. You have been monitoring an online auction site, from which you collected some data about interesting transactions:
# Model Condition
1. B3 excellent no 4513
2. T202 fair yes 625
3. A100 good no 1051
4. T202 good no 270
5. M102 good yes 870
6. A100 excellent no 1770
7. T202 fair no 99
8. A100 good yes 1900
9. E112 fair no 77
COMP9417 ML & DM
Term 2, 2022
Learning Non-linear Regression Models with Trees
Learning a regression tree
From this data, you want to construct a regression tree that will help you determine a reasonable price for your next purchase.
There are three features, hence three possible splits:
Model = [A100, B3, E112, M102, T202]
[1051, 1770, 1900][4513][77][870][99, 270, 625]
Condition = [excellent, good, fair]
[1770, 4513][270, 870, 1051, 1900][77, 99, 625]
Leslie=[yes,no] [625,870,1900][77,99,270,1051,1770,4513]
The means of the first split are 1574, 4513, 77, 870 and 331, and the weighted average of squared means is 3.21 · 106. The means of the second split are 3142, 1023 and 267, with weighted average of squared means 2.68 · 106; for the third split the means are 1132 and 1297, with weighted average of squared means 1.55 · 106. We therefore branch on Model at the top level. This gives us three single-instance leaves, as well as three A100s and three T202s.
COMP9417 ML & DM Term 2, 2022 35 / 67
Learning Non-linear Regression Models with Trees
Learning a regression tree
For the A100s we obtain the following splits:
Condition=[excellent,good,fair] [1770][1051,1900][] Leslie = [yes, no] [1900][1051, 1770]
Without going through the calculations we can see that the second split results in less variance (to handle the empty child, it is customary to set its variance equal to that of the parent). For the T202s the splits are as follows:
Condition=[excellent,good,fair] [][270][99,625] Leslie = [yes, no] [625][99, 270]
Again we see that splitting on Leslie gives tighter clusters of values. The learned regression tree is depicted on the next slide.
COMP9417 ML & DM Term 2, 2022 36 / 67
Learning Non-linear Regression Models with Trees
A regression tree
=A100 =B3 =E122 =M102 =T202
Leslie f(̂x)=4513 f(̂x)=77 f(̂x)=870 Leslie
=yes =no =yes =no
f(̂x)=1900 f(̂x)=1411 f(̂x)=625 f(̂x)=185
A regression tree learned from the Hammond organ dataset.
COMP9417 ML & DM Term 2, 2022 37 / 67
Learning Non-linear Regression Models with Trees
Model trees
” local regression models ☒ ” within regions”
• Like regression trees but with linear regression functions at each node
• Linear regression applied to instances that reach a node after full tree -has been built
• Only a subset of the attributes is used for LR
• Attributes occurring in subtree (+maybe attributes occurring in path
to the root)
• Fast: overhead for Linear Regression (LR) not large because usually only a small subset of attributes is used in tree
computation
COMP9417 ML & DM Term 2, 2022 38 / 67
Learning Non-linear Regression Models with Trees
Learning Non-linear Regression Models with Trees
Two uses of features (Flach (2012))
Suppose we want to approximate y = cos πx on the interval 1 x 1. A linear approximation is not much use here, since the best fit would be
However, if we split the x-axis in two intervals 1 x < 0 and 0 x 1,
A small model tree
a piecewise linear
target " Ta
approximation
≥0 of a non-linear
ŷ = −2x+1
we could find reasonable linear approximations on each interval.
We can achieve this by using x both as a splitting feature and as a regression variable (next slide).
Term 2, 2022
COMP9417 ML & DM
Learning Non-linear Regression Models with Trees
Model Tree on CPU dataset
Term 2, 2022
COMP9417 ML & DM
Finding thresholds for numeric
1 2 Examples 4
splits : 14/181 27/''I 32/36/21
feature≥ Problem :
F o rd features
11/14/18/ 2' 127/32/" {121 {3i.no}
which threshold to pick for test ? ⊖⊖⊕⊕⊕⊖⊕ 18< ≥18
n examples 0(n d)
COMP9417 ML & DM
Term 2, 2022
Nearest neighbour classification
Nearest neighbour classification
COMP9417 ML & DM Term 2, 2022 42 / 67
Nearest neighbour classification
Nearest neighbour classification
" model - free "
• Related to the simplest form of learning: rote learning or memorization
• Training instances are searched for instance that most closely resembles new or que
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com