CS代写 ELEVATION 300 1 200 1 500 3 000 3 900 4450 5 000

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Fundamentals of Machine Learning for
Predictive Data Analytics
Chapter 4: Information-based Learning Sections 4.4, 4.5

Copyright By PowCoder代写 加微信 powcoder

and Namee and Aoife D’Arcy

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Alternative Feature Selection Metrics Handling Continuous Descriptive Features Predicting Continuous Targets
Noisy Data, Overfitting and Tree Pruning
Model Ensembles
Boosting Bagging

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Alternative Feature Selection Metrics

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Entropy based information gain, preferences features with many values.
One way of addressing this issue is to use information gain ratio which is computed by dividing the information gain of a feature by the amount of information used to determine the value of the feature:
GR (d, D) = 􏰈 (1)
− (P(d = l) × log2(P(d = l))) l∈levels(d)

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
IG (STREAM, D) = IG (SLOPE, D) = IG (ELEVATION, D) =
0.3060 0.5774 0.8774

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
H (STREAM, D)
= − 􏰈 P(STREAM = l) × log2 (P(STREAM = l))
l ∈􏰋’true’,􏰌 ’false’
= − 􏰃􏰃4/7 × log2(4/7)􏰄 + 􏰃3/7 × log2(3/7)􏰄􏰄 = 0.9852 bits
H (SLOPE, D) = − 􏰈
l ∈ ’moderate’,
P(SLOPE = l) × log2 (P(SLOPE = l)) 􏰊
= − 􏰃􏰃1/7 × log2(1/7)􏰄 + 􏰃1/7 × log2(1/7)􏰄 + 􏰃5/7 × log2(5/7)􏰄􏰄 = 1.1488 bits
H (ELEVATION, D)
= − 􏰈 P(ELEVATION = l) × log2 (P(ELEVATION = l))
’low’,  l ∈’medium’, ’high’,  ’highest’ 
= − 􏰃􏰃1/7 × log2(1/7)􏰄 + 􏰃2/7 × log2(2/7)􏰄 + 􏰃3/7 × log2(3/7)􏰄 + 􏰃1/7 × log2(1/7)􏰄􏰄 = 1.8424 bits

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
GR (STREAM, D) = 0.3060 = 0.3106 0.9852
GR (SLOPE, D) = 0.5774 = 0.5026 1.1488
GR (ELEVATION, D) = 0.8774 = 0.4762 1.8424

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
flat moderate steep
low medium high
true false
Figure: The vegetation classification decision tree generated using information gain ratio.

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Another commonly used measure of impurity is the Gini index:
Gini(t,D)=1− 􏰈 P(t=l)2 (2) l∈levels(t)
The Gini index can be thought of as calculating how often you would misclassify an instance in the dataset if you classified it based on the distribution of classifications in the dataset.
Information gain can be calculated using the Gini index by replacing the entropy measure with the Gini index.

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Gini (VEGETATION, D)
= 1 − 􏰈 P(VEGETATION = l)2
􏰉’chapparal’,􏰊 l ∈ ’riparian’,
2 􏰃2 􏰄2 􏰃2 􏰄2􏰆 =1−(3/7)+/7 +/7

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: Partition sets (Part.), entropy, Gini index, remainder (Rem.), and information gain (Info. Gain) by feature
Split by Feature
STREAM SLOPE
Level Part. ’true’ D1 ’false’ D2 ’flat’ D3 ’moderate’ D4 ’steep’ D5 ’low’ D6 ’medium’ D7 ’high’ D8 ’highest’ D9
Instances d2 , d3 , d6 , d7
Gini Index Rem.
Info. Gain
0.1054 0.2531
0.625 0.5476 d1, d4,d5 0.4444
d2 0 0.4 d1,d3, d4, d6, d7 0.56
d3,d4 0.5 0.3333 0.3198
d1 , d5 , d7 0.4444 d6 0

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Handling Continuous Descriptive Features

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
The easiest way to handle continuous valued descriptive features is to turn them into boolean features by defining a threshold and using this threshold to partition the instances based their value of the continuous descriptive feature.
How do we set the threshold?

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
1 The instances in the dataset are sorted according to the continuous feature values.
2 The adjacent instances in the ordering that have different classifications are then selected as possible threshold points.
3 The optimal threshold is found by computing the information gain for each of these classification transition boundaries and selecting the boundary with the highest information gain as the threshold.

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Once a threshold has been set the dynamically created new boolean feature can compete with the other categorical features for selection as the splitting feature at that node.
This process can be repeated at each node as the tree grows.

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: Dataset for predicting the vegetation in an area with a continuous ELEVATION feature (measured in feet).
ID STREAM 1 false 2 true
4 false 5 false 6 true 7 true
SLOPE steep moderate steep steep flat steep steep
ELEVATION 3 900 300 1 500 1 200 4450 5 000 3 000
VEGETATION chapparal riparian riparian chapparal conifer conifer chapparal

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: Dataset for predicting the vegetation in an area sorted by the continuous ELEVATION feature.
ID STREAM 2 true
4 false 3 true
7 true 1 false 5 false 6 true
SLOPE moderate steep steep steep steep flat steep
ELEVATION 300 1 200 1 500 3 000 3 900 4450 5 000
VEGETATION riparian chapparal riparian chapparal chapparal conifer conifer

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Table: Partition sets (Part.), entropy, remainder (Rem.), and information gain (Info. Gain) for the candidate ELEVATION thresholds: ≥750, ≥1 350, ≥2 250 and ≥4 175.
Split by Threshold
≥750 ≥1 350 ≥2 250 ≥4 175
Part. Instances D1 d2
D2 d4, d3, d7, d1, d5, d6 D3 d2,d4
D4 d3, d7, d1, d5, d6 D5 d2 , d4 , d3
Partition Entropy 0.0 1.4591
Rem. 1.2507
Info. Gain
1.0 1.3728 0.1839 1.5219
0.9183 0.9650 0.5917 1.0
0.6935 0.8631
D6 d7, d1, d5, d6
D7 d2, d4, d3, d7, d1
D8 d5,d6 0.0

Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
<4,175 ≥4,175 Vegetation Vegetation Figure: The vegetation classification decision tree after the dataset has been split using ELEVATION ≥ 4 175. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary <4,175 ≥4,175 true false <2,250 ≥2,250 Figure: The decision tree that would be generated for the vegetation classification dataset listed in Table 3 [17] using information gain. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Predicting Continuous Targets Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Regression trees are constructed so as to reduce the variance in the set of training examples at each of the leaf nodes in the tree We can do this by adapting the ID3 algorithm to use a measure of variance rather than a measure of classification impurity (entropy) when selecting the best attribute Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary The impurity (variance) at a node can be calculated using the following equation: 􏰇 ni = 1 􏰁 t i − ̄t 􏰂 2 var (t, D) = n − 1 (3) We select the feature to split on at a node by selecting the feature that minimizes the weighted variance across the resulting partitions: d[best]=argmin 􏰈 |Dd=l|×var(t,Dd=l) (4) d∈d l∈levels(d) |D| Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary uuuu E Target uuuu E Underfitting uuuu E Goldilocks uh uh uh uh E Overfitting Figure: (a) A set of instances on a continuous number line; (b), (c), and (d) depict some of the potential groupings that could be applied to these instances. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Table: A dataset listing the number of bike rentals per day. WORK DAY false false true false true true RENTALS 800 826 900 2 100 4740 4 900 ID SEASON WORK DAY 7 summer false 8 summer true 9 summer true 10 autumn false 11 autumn false 12 autumn true RENTALS 3000 5800 6200 2 910 2880 2 820 Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Table: The partitioning of the dataset in Table 5 [25] based on SEASON and WORK DAY features and the computation of the weighted variance for each partitioning. Split by Feature ’spring’ ’summer’ ’autumn’ ’true’ ’false’ Part. Instances D1 d1, d2, d3 D2 d4, d5, d6 D3 d7, d8, d9 D4 d10,d11,d12 D5 d3,d5,d6,d8,d9,d12 D6 d1, d2, d4, d7, d10, d11 |D| var (t,D) 0.25 2 692 0.25 2 472 533 13 0.25 3 040 000 0.25 2 100 0.50 40263461 3 0.50 1 077 280 Weighted Variance 1 379 331 13 25518131 3 Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary winter spring summer autumn Work Day Rentals Figure: The decision tree resulting from splitting the data in Table 5 [25] using the feature SEASON. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary true false true false ID Rentals Pred. Figure: The final decision tree induced from the dataset in Table 5 [25]. To illustrate how the tree generates predictions, this tree lists the instances that ended up at each leaf node and the prediction (PRED.) made by each leaf node. ID Rentals Pred. 8 5,800 9 6,200 ID Rentals Pred. 12 2,820 2,820 ID Rentals Pred. 10 2,910 11 2,880 ID Rentals Pred. ID Rentals Pred. 5 4,740 6 4,900 ID Rentals Pred. 4 2,100 2,100 ID Rentals Pred. 7 3,000 3,000 Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Noisy Data, Overfitting and Tree Pruning Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary In the case of a decision tree, over-fitting involves splitting the data on an irrelevant feature. The likelihood of over-fitting occurring increases as a tree gets deeper because the resulting classifications are based on smaller and smaller subsets as the dataset is partitioned after each feature test in the path. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Pre-pruning: stop the recursive partitioning early. Pre-pruning is also known as forward pruning. Common Pre-pruning Approaches 1 early stopping 2 χ2 pruning Post-pruning: allow the algorithm to grow the tree as much as it likes and then prune the tree of the branches that cause over-fitting. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Common Post-pruning Approach Using the validation set evaluate the prediction accuracy achieved by both the fully grown tree and the pruned copy of the tree. If the pruned copy of the tree performs no worse than the fully grown tree the node is a candidate for pruning. Performance on Training Set Performance on Validation Set 0 50 100 150 200 Training Iteration Misclassification Rate 0.1 0.2 0.3 0.4 0.5 Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Table: An example validation set for the post-operative patient routing task. CORE- STABLE- ID TEMP TEMP 1 high true 2 low true 3 high false 4 high false 5 low false 6 low true female icu female icu male icu female icu male icu Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary male female Stable-Temp true false Figure: The decision tree for the post-operative patient routing task. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Stable-Temp male female [gen] true Stable-Temp true false Figure: The iterations of reduced error pruning for the decision tree in Figure 7 [34] using the validation set in Table 7 [33]. The subtree that is being considered for pruning in each iteration is highlighted in black. The prediction returned by each non-leaf node is listed in square brackets. The error rate for each node is given in round brackets. Stable-Temp true false Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Advantages of pruning: Smaller trees are easier to interpret Increased generalization accuracy when there is noise in the training data (noise dampening). Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Model Ensembles Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Rather than creating a single model they generate a set of models and then make predictions by aggregating the outputs of these models. A prediction model that is composed of a set of models is called a model ensemble. In order for this approach to work the models that are in the ensemble must be different from each other. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary There are two standard approaches to creating ensembles: 1 boosting 2 bagging. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Boosting Boosting works by iteratively creating models and adding them to the ensemble. The iteration stops when a predefined number of models have been added. When we use boosting each new model added to the ensemble is biased to pay more attention to instances that previous models miss-classified. This is done by incrementally adapting the dataset used to train the models. To do this we use a weighted dataset Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Boosting Weighted Dataset Each instance has an associated weight wi ≥ 0, Initially set to n1 where n is the number of instances in the After each model is added to the ensemble it is tested on the training data and the weights of the instances the model gets correct are decreased and the weights of the instances the model gets incorrect are increased. These weights are used as a distribution over which the dataset is sampled to created a replicated training set, where the replication of an instance is proportional to its weight. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Boosting During each training iteration the algorithm: 1 Induces a model and calculates the total error, ε, by summing the weights of the training instances for which the predictions made by the model are incorrect. 2 Increases the weights for the instances misclassified using: 􏰅1􏰆 w[i]←w[i]× 2×ε (5) 3 Decreases the weights for the instances correctly classified: w[i]←w[i]× 2×(1−ε) (6) 4 Calculate a confidence factor, α, for the model such that α increases as ε decreases: α = 2 × loge ε (7) Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Boosting Once the set of models have been created the ensemble makes predictions using a weighted aggregate of the predictions made by the individual models. The weights used in this aggregation are simply the confidence factors associated with each model. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Bagging When we use bagging (or bootstrap aggregating) each model in the ensemble is trained on a random sample of the dataset known as bootstrap samples. Each random sample is the same size as the dataset and sampling with replacement is used. Consequently, every bootstrap sample will be missing some of the instances from the dataset so each bootstrap sample will be different and this means that models trained on different bootstrap samples will also be different Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Bagging When bagging is used with decision trees each bootstrap sample only uses a randomly selected subset of the descriptive features in the dataset. This is known as subspace sampling. The combination of bagging, subspace sampling, and decision trees is known as a random forest model. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Bagging ID F1 F2 F3 Target 1---- 2---- 3---- 4---- Bagging and Subspace Sampling ID F1 F3 Target 1--- 1--- 2--- 3--- ID F2 F3 Target 2--- 2--- 4--- 4--- ID F1 F3 Target 1--- 3--- 3--- 4--- Machine Learning Machine Learning Machine Learning Algorithm Algorithm Algorithm MODEL ENSEMBLE Figure: The process of creating a model ensemble using bagging and subspace sampling. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Bagging Which approach should we use? Bagging is simpler to implement and parallelize than boosting and, so, may be better with respect to ease of use and training time. Empirical results indicate: boosted decision tree ensembles were the best performing model of those tested for datasets containing up to 4,000 descriptive features. random forest ensembles (based on bagging) performed better for datasets containing more that 4,000 features. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary The decision tree model makes predictions based on sequences of tests on the descriptive feature values of a query The ID3 algorithm as a standard algorithm for inducing decision trees from a dataset. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Decision Trees: Advantages interpretable. handle both categorical and continuous descriptive features. has the ability to model the interactions between descriptive features (diminished if pre-pruning is employed) relatively, robust to the curse of dimensionality. relatively, robust to noise in the dataset if pruning is used. Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary Decision Tress: Potential 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com