程序代写代做代考 algorithm data science Introduction to information system

Introduction to information system

Forests and other Ensemble Methods…

Gerhard Neumann

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M Data Science

Recap: Regression and Classification Trees

Grow a binary tree

• At each node, “split” the data into two “daughter” nodes.

• Splits are chosen using a splitting criterion.

• Bottom nodes are “terminal” nodes.

For regression:

• the predicted value at a node is the average response variable for all observations

in the node.

For classification:

• the predicted class is the most common class in the node (majority vote).

• Can also get estimated probability of membership in each of the classes

Recap: A classification tree

Recap: A regression tree

Predict (log) prostate specific antigen from

• Log cancer volumne

• Log prostate weight

Recap: Splitting criterion (thats the first mathy slide…)

• Regression: Minimum residual sum of squares

– where and are the average label values in the left and right subtree

– Split such that variance is subtrees is minimized

Recap: Splitting criterion (thats the second mathy slide…)

• Classification: Minimum entropy in subtrees

– where is the entropy in the left sub-tree

– and is the proportion of class k in left tree

• Entropy is a measure of uncertainty

– Split such that class-labels in sub-trees are „pure“

Bias-Variance Tradeoff

What we have seen is a general

concept called the Bias-Variance

Tradeoff

• Bias: What is the error if I would

know the optimal parameters of

my model (assuming infinite

data)?

• Variance: What is the error that I

do because I only have limited

data?

Recap: Model Selection

• We can evaluate a model using:

– Test-set method:

• High variance

• A large number of samples are

„wasted“ for evaluation

• Simple and fast

– K-Fold Cross-Validation:

• More robust – Average value

• Samples are used more

efficiently

• Slow

Standard Model-Selection Loop

How do we select a model?

• For each model (algorithm + parameter setting)

– Evaluate using test set or crossvalidation

• Pick the best model

• Relearn using all the data

How do we evaluate the final performance?

• We can not use the cross-validation score as it has been used for selecting the model. Bias!

• We need an independent test set that has not been used for training nor model-selection!

• Often, the bias is small and this necessity is ignored for simplicity

Today‘s agenda

Esemble Methods
• Ensemble = more then one classifier/regressor

• Why can that be useful?

• Bagging:
– Train multiple classifier/regressors

– Reduce variance while bias stays the same

– E.g.: Random Forests

• Boosting
– Train multiple classifiers/regressors

– Adjust Importance of training data

– Reduces bias and variance

Bagging Predictors

Key Idea: Use multiple trees to improve accuracy

Key Idea: Use multiple trees to improve accuracy

How do we get variability in the trees?

Bagging (Bootstrap Aggregating)

Breiman, “Bagging Predictors”, Machine Learning, 1996.

Fit classification or regression models to bootstrap samples from the data

and combine by voting (classification) or averaging (regression).

Bagging (Bootstrap Aggregating)

• A bootstrap sample is chosen at random with replacement from the data. Some

observations end up in the bootstrap sample more than once, while others are not

included (“out of bag”).

• Bagging reduces the variance of the base learner but has limited effect on the bias.

– I.e. no overfitting: The more trees the better

• It’s most effective if we use strong base learners that have very little bias but high

variance (unstable). E.g. trees.

• Both bagging and boosting are examples of “ensemble learners” that were popular

in machine learning in the ‘90s.

Bagging CART

Random Forests

Random Forests

Grow a forest of many trees. (R default is 500)

Grow each tree on an independent bootstrap sample from the training data.

• Sample N cases at random with replacement.

At each node:

1. Select max_features variables at random out of all M possible variables

(independently for each node).

2. Find the best split on the selected max_features variables.

Grow the trees to maximum depth.

Vote/average the trees to get predictions for new data.

Why does that work?

Intuition: Why randomization?

• Increase variability of the single trees

• A single tree is less likely to over-specialize

• The trees are less likely to overfit

Random Regression Forests

Num Trees = 1 Num Trees = 10 Num Trees = 50

We can represent almost continuous functions!

Random Forests

Random Forests

Advantages

• Applicable to both regression and classification problems. Yes

• Handle categorical predictors naturally. Yes

• Computationally simple and quick to fit, even for large problems. Yes

• No formal distributional assumptions (non-parametric). Yes

• Can handle highly non-linear interactions and classification boundaries. Yes

• Automatic variable selection. Yes

• Very easy to interpret if the tree is small. No

Random Forests

Improve on CART with respect to:

• Accuracy – Random Forests is competitive with the best known machine learning

methods

• Instability – if we change the data a little, the individual trees may change but the

forest is relatively stable because it is a combination of many trees.

Random Forests and the Kinect

Random Forests and the Kinect

Random Forests and the Kinect

Use computer graphics to generate plenty of data

Shotton, et. al., Real-Time Human Pose Recognition in Parts from a Single Depth Image, CVPR 2011

Boosting

Main Idea

• Boosting refers to a general and provably effective method of producing a very

accurate classifier by combining rough and moderately inaccurate rules of

thumb.

• It is based on the observation that finding many rough rules of thumb can be a lot

easier than finding a single, highly accurate classifier.

• To begin, we define an algorithm for finding the rules of thumb, which we call a weak

learner.

• The boosting algorithm repeatedly calls this weak learner, each time feeding it a

different distribution over the training data (in Adaboost).

• Each call generates a weak classifier and we must combine all of these into a

single classifier that, hopefully, is much more accurate than any one of the rules.

Boosting Pseudo Code

Idea: given a weak learner, run it multiple times on (reweighted) training data, then let

the learned classifiers vote

On each iteration t:

• weight each training example by how incorrectly it was classified

• Learn a hypothesis –

• A strength for this hypothesis –

Final classifier:

• A linear combination of the votes of the different classifiers weighted by their strength

Boosting Intuition

Boosting Intuition

Boosting Intuition

Boosting Intuition

Boosting Intuition

Adaboost

• The coefficient depends on the error of the weak classifier

– Smaller error, higher coefficient

• The weights of the samples are:

– Increased with a factor of if incorrectly classified

– Decreased with a factor of if correctly classified

Toy Example: Decision Stumps

Toy Example: Round 1

Toy Example: Round 2

Toy Example: Round 3

Toy Example: Final Hypothesis

Boosting Results – Digit Classification

Application – Face Detection

Training Data

• 5000 faces

– All frontal

• 300 million non-faces

– 9500 non-face images

Application – Detecting Faces

• Problem: find faces in photograph or movie

• Weak classifiers: detect light/dark rectangle in image

• Many clever tricks to make it very fast and accurate

Summary: What we have learned today

• Averaging over multiple trees reduces variance and improves performance

• Variability in the trees:

– Bagging

– Randomized splits

• Bagging:

– Resample data points

– Weight of each classifier is the same

– Only variance reduction

• Boosting:

– Reweights data points (modifies their distribution)

– Weight is dependent on classifier’s accuracy

– Both bias and variance reduced – learning rule becomes more complex with iterations