Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Fundamentals of Machine Learning for
Predictive Data Analytics
Chapter 4: Information-based Learning Sections 4.4, 4.5
Copyright By PowCoder代写 加微信 powcoder
and Namee and Aoife D’Arcy
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Alternative Feature Selection Metrics Handling Continuous Descriptive Features Predicting Continuous Targets
Noisy Data, Overfitting and Tree Pruning
Model Ensembles
Boosting Bagging
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Alternative Feature Selection Metrics
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Entropy based information gain, preferences features with many values.
One way of addressing this issue is to use information gain ratio which is computed by dividing the information gain of a feature by the amount of information used to determine the value of the feature:
GR (d, D) = (1)
− (P(d = l) × log2(P(d = l))) l∈levels(d)
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
IG (STREAM, D) = IG (SLOPE, D) = IG (ELEVATION, D) =
0.3060 0.5774 0.8774
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
H (STREAM, D)
= − P(STREAM = l) × log2 (P(STREAM = l))
l ∈’true’, ’false’
= − 4/7 × log2(4/7) + 3/7 × log2(3/7) = 0.9852 bits
H (SLOPE, D) = −
l ∈ ’moderate’,
P(SLOPE = l) × log2 (P(SLOPE = l))
= − 1/7 × log2(1/7) + 1/7 × log2(1/7) + 5/7 × log2(5/7) = 1.1488 bits
H (ELEVATION, D)
= − P(ELEVATION = l) × log2 (P(ELEVATION = l))
’low’, l ∈’medium’, ’high’, ’highest’
= − 1/7 × log2(1/7) + 2/7 × log2(2/7) + 3/7 × log2(3/7) + 1/7 × log2(1/7) = 1.8424 bits
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
GR (STREAM, D) = 0.3060 = 0.3106 0.9852
GR (SLOPE, D) = 0.5774 = 0.5026 1.1488
GR (ELEVATION, D) = 0.8774 = 0.4762 1.8424
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
flat moderate steep
low medium high
true false
Figure: The vegetation classification decision tree generated using information gain ratio.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Another commonly used measure of impurity is the Gini index:
Gini(t,D)=1− P(t=l)2 (2) l∈levels(t)
The Gini index can be thought of as calculating how often you would misclassify an instance in the dataset if you classified it based on the distribution of classifications in the dataset.
Information gain can be calculated using the Gini index by replacing the entropy measure with the Gini index.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Gini (VEGETATION, D)
= 1 − P(VEGETATION = l)2
’chapparal’, l ∈ ’riparian’,
2 2 2 2 2 =1−(3/7)+/7 +/7
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: Partition sets (Part.), entropy, Gini index, remainder (Rem.), and information gain (Info. Gain) by feature
Split by Feature
STREAM SLOPE
Level Part. ’true’ D1 ’false’ D2 ’flat’ D3 ’moderate’ D4 ’steep’ D5 ’low’ D6 ’medium’ D7 ’high’ D8 ’highest’ D9
Instances d2 , d3 , d6 , d7
Gini Index Rem.
Info. Gain
0.1054 0.2531
0.625 0.5476 d1, d4,d5 0.4444
d2 0 0.4 d1,d3, d4, d6, d7 0.56
d3,d4 0.5 0.3333 0.3198
d1 , d5 , d7 0.4444 d6 0
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Handling Continuous Descriptive Features
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
The easiest way to handle continuous valued descriptive features is to turn them into boolean features by defining a threshold and using this threshold to partition the instances based their value of the continuous descriptive feature.
How do we set the threshold?
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
1 The instances in the dataset are sorted according to the continuous feature values.
2 The adjacent instances in the ordering that have different classifications are then selected as possible threshold points.
3 The optimal threshold is found by computing the information gain for each of these classification transition boundaries and selecting the boundary with the highest information gain as the threshold.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Once a threshold has been set the dynamically created new boolean feature can compete with the other categorical features for selection as the splitting feature at that node.
This process can be repeated at each node as the tree grows.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: Dataset for predicting the vegetation in an area with a continuous ELEVATION feature (measured in feet).
ID STREAM 1 false 2 true
4 false 5 false 6 true 7 true
SLOPE steep moderate steep steep flat steep steep
ELEVATION 3 900 300 1 500 1 200 4450 5 000 3 000
VEGETATION chapparal riparian riparian chapparal conifer conifer chapparal
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: Dataset for predicting the vegetation in an area sorted by the continuous ELEVATION feature.
ID STREAM 2 true
4 false 3 true
7 true 1 false 5 false 6 true
SLOPE moderate steep steep steep steep flat steep
ELEVATION 300 1 200 1 500 3 000 3 900 4450 5 000
VEGETATION riparian chapparal riparian chapparal chapparal conifer conifer
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: Partition sets (Part.), entropy, remainder (Rem.), and information gain (Info. Gain) for the candidate ELEVATION thresholds: ≥750, ≥1 350, ≥2 250 and ≥4 175.
Split by Threshold
≥750 ≥1 350 ≥2 250 ≥4 175
Part. Instances D1 d2
D2 d4, d3, d7, d1, d5, d6 D3 d2,d4
D4 d3, d7, d1, d5, d6 D5 d2 , d4 , d3
Partition Entropy 0.0 1.4591
Rem. 1.2507
Info. Gain
1.0 1.3728 0.1839 1.5219
0.9183 0.9650 0.5917 1.0
0.6935 0.8631
D6 d7, d1, d5, d6
D7 d2, d4, d3, d7, d1
D8 d5,d6 0.0
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
<4,175 ≥4,175
Vegetation
Vegetation
Figure: The vegetation classification decision tree after the dataset has been split using ELEVATION ≥ 4 175.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
<4,175 ≥4,175
true false
<2,250 ≥2,250
Figure: The decision tree that would be generated for the vegetation classification dataset listed in Table 3 [17] using information gain.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Predicting Continuous Targets
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Regression trees are constructed so as to reduce the variance in the set of training examples at each of the leaf nodes in the tree
We can do this by adapting the ID3 algorithm to use a measure of variance rather than a measure of classification impurity (entropy) when selecting the best attribute
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
The impurity (variance) at a node can be calculated using the following equation:
ni = 1 t i − ̄t 2
var (t, D) = n − 1 (3)
We select the feature to split on at a node by selecting the feature that minimizes the weighted variance across the resulting partitions:
d[best]=argmin |Dd=l|×var(t,Dd=l) (4) d∈d l∈levels(d) |D|
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
uuuu E Target
uuuu E Underfitting
uuuu E Goldilocks
uh uh uh uh E Overfitting
Figure: (a) A set of instances on a continuous number line; (b), (c), and (d) depict some of the potential groupings that could be applied to these instances.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: A dataset listing the number of bike rentals per day.
WORK DAY false false true false true true
RENTALS 800 826 900 2 100 4740 4 900
ID SEASON WORK DAY 7 summer false
8 summer true
9 summer true
10 autumn false 11 autumn false 12 autumn true
RENTALS 3000 5800 6200 2 910 2880 2 820
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: The partitioning of the dataset in Table 5 [25] based on SEASON and WORK DAY features and the computation of the weighted variance for each partitioning.
Split by Feature
’spring’ ’summer’ ’autumn’ ’true’ ’false’
Part. Instances
D1 d1, d2, d3
D2 d4, d5, d6
D3 d7, d8, d9
D4 d10,d11,d12
D5 d3,d5,d6,d8,d9,d12 D6 d1, d2, d4, d7, d10, d11
|D| var (t,D)
0.25 2 692 0.25 2 472 533 13 0.25 3 040 000 0.25 2 100
0.50 40263461 3
0.50 1 077 280
Weighted Variance
1 379 331 13
25518131 3
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
winter spring summer autumn
Work Day Rentals
Figure: The decision tree resulting from splitting the data in Table 5 [25] using the feature SEASON.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
true false
true false
ID Rentals Pred.
Figure: The final decision tree induced from the dataset in Table 5 [25]. To illustrate how the tree generates predictions, this tree lists the instances that ended up at each leaf node and the prediction (PRED.) made by each leaf node.
ID Rentals Pred.
8 5,800 9 6,200
ID Rentals Pred.
12 2,820 2,820
ID Rentals Pred.
10 2,910 11 2,880
ID Rentals Pred.
ID Rentals Pred.
5 4,740 6 4,900
ID Rentals Pred.
4 2,100 2,100
ID Rentals Pred.
7 3,000 3,000
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Noisy Data, Overfitting and Tree Pruning
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
In the case of a decision tree, over-fitting involves splitting the data on an irrelevant feature.
The likelihood of over-fitting occurring increases as a tree gets deeper because the resulting classifications are based on smaller and smaller subsets as the dataset is partitioned after each feature test in the path.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Pre-pruning: stop the recursive partitioning early. Pre-pruning is also known as forward pruning.
Common Pre-pruning Approaches
1 early stopping
2 χ2 pruning
Post-pruning: allow the algorithm to grow the tree as much as it likes and then prune the tree of the branches that cause over-fitting.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Common Post-pruning Approach
Using the validation set evaluate the prediction accuracy achieved by both the fully grown tree and the pruned copy of the tree. If the pruned copy of the tree performs no worse than the fully grown tree the node is a candidate for pruning.
Performance on Training Set Performance on Validation Set
0 50 100 150 200 Training Iteration
Misclassification Rate
0.1 0.2 0.3 0.4 0.5
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: An example validation set for the post-operative patient routing task.
CORE- STABLE- ID TEMP TEMP
1 high true 2 low true 3 high false 4 high false 5 low false 6 low true
female icu female icu male icu female icu male icu
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
male female
Stable-Temp
true false
Figure: The decision tree for the post-operative patient routing task.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Stable-Temp
male female
[gen] true
Stable-Temp
true false
Figure: The iterations of reduced error pruning for the decision tree in Figure 7 [34] using the validation set in Table 7 [33]. The subtree that is being considered for pruning in each iteration is highlighted in black. The prediction returned by each non-leaf node is listed in square brackets. The error rate for each node is given in round brackets.
Stable-Temp
true false
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Advantages of pruning:
Smaller trees are easier to interpret
Increased generalization accuracy when there is noise in the training data (noise dampening).
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Model Ensembles
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Rather than creating a single model they generate a set of models and then make predictions by aggregating the outputs of these models.
A prediction model that is composed of a set of models is called a model ensemble.
In order for this approach to work the models that are in the ensemble must be different from each other.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
There are two standard approaches to creating ensembles:
1 boosting
2 bagging.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Boosting
Boosting works by iteratively creating models and adding them to the ensemble.
The iteration stops when a predefined number of models have been added.
When we use boosting each new model added to the ensemble is biased to pay more attention to instances that previous models miss-classified.
This is done by incrementally adapting the dataset used to train the models. To do this we use a weighted dataset
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Boosting
Weighted Dataset
Each instance has an associated weight wi ≥ 0,
Initially set to n1 where n is the number of instances in the
After each model is added to the ensemble it is tested on the training data and the weights of the instances the model gets correct are decreased and the weights of the instances the model gets incorrect are increased.
These weights are used as a distribution over which the dataset is sampled to created a replicated training set, where the replication of an instance is proportional to its weight.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Boosting
During each training iteration the algorithm:
1 Induces a model and calculates the total error, ε, by summing the weights of the training instances for which the predictions made by the model are incorrect.
2 Increases the weights for the instances misclassified using: 1
w[i]←w[i]× 2×ε (5)
3 Decreases the weights for the instances correctly
classified:
w[i]←w[i]× 2×(1−ε) (6)
4 Calculate a confidence factor, α, for the model such that α increases as ε decreases:
α = 2 × loge ε (7)
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Boosting
Once the set of models have been created the ensemble makes predictions using a weighted aggregate of the predictions made by the individual models.
The weights used in this aggregation are simply the confidence factors associated with each model.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Bagging
When we use bagging (or bootstrap aggregating) each model in the ensemble is trained on a random sample of the dataset known as bootstrap samples.
Each random sample is the same size as the dataset and sampling with replacement is used.
Consequently, every bootstrap sample will be missing some of the instances from the dataset so each bootstrap sample will be different and this means that models trained on different bootstrap samples will also be different
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Bagging
When bagging is used with decision trees each bootstrap sample only uses a randomly selected subset of the descriptive features in the dataset. This is known as subspace sampling.
The combination of bagging, subspace sampling, and decision trees is known as a random forest model.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Bagging
ID F1 F2 F3 Target
1---- 2---- 3---- 4----
Bagging and Subspace Sampling
ID F1 F3 Target
1--- 1--- 2--- 3---
ID F2 F3 Target
2--- 2--- 4--- 4---
ID F1 F3 Target
1--- 3--- 3--- 4---
Machine Learning Machine Learning Machine Learning Algorithm Algorithm Algorithm
MODEL ENSEMBLE
Figure: The process of creating a model ensemble using bagging and subspace sampling.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Bagging
Which approach should we use? Bagging is simpler to implement and parallelize than boosting and, so, may be better with respect to ease of use and training time. Empirical results indicate:
boosted decision tree ensembles were the best performing model of those tested for datasets containing up to 4,000 descriptive features.
random forest ensembles (based on bagging) performed better for datasets containing more that 4,000 features.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
The decision tree model makes predictions based on sequences of tests on the descriptive feature values of a query
The ID3 algorithm as a standard algorithm for inducing decision trees from a dataset.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Decision Trees: Advantages
interpretable.
handle both categorical and continuous descriptive features.
has the ability to model the interactions between descriptive features (diminished if pre-pruning is employed)
relatively, robust to the curse of dimensionality. relatively, robust to noise in the dataset if pruning is used.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Decision Tress: Potential
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com