程序代写代做代考 algorithm decision tree data science Introduction to information system

Introduction to information system

Trees and Forests…

Gerhard Neumann

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M Data Science

Step back… Bigger Picture

Supervised Learning: We want to learn the
relation from the input to the output data

• Regression: Continuous output labels

– Linear regression, Polynomial Regression,
kNN (?)

• Classification: Discrete / Nominal output
labels

– Logistic Regression, Decision Trees, Naive
Bayes, kNN

Recap… Bigger Picture

Unsupervised Learning

There are no „labels“. We want to learn the

„structure“ of the input data

• Clustering: Find clusters (similar data

points) in the data

– kMeans

• Dimensionality Reduction: How can we

represent high dimensional data with only a

few variables?

– Not in this lecture…

Model Selection

For most of these algorithms, we have to solve a
model selection problem

• Choose complexity of the function class

– One of the hardest problems in machine learning

– Too complex: Overfitting

• Fit noise in the data

• Unspecified behavior between data points,
not enough data

– Too simple: Underfitting

• We can not represent the underlying function

• Lets look at some examples…

Recap: Polynomial Regression

Instead of fitting a line, we can also fit a d-th order polynomial

• Still linear in the parameters beta

• We can use linear regression (see week 4)

Selecting the order of the polynom

Model Selection for polynomial regression

• Evaluate the model‘s performance on:

– Test-set

– K-fold cross validation (see Semester A)

• The error on the training set is not an
indication for a good fit!!

• Overfitting:

– Training error goes down

– Test error goes up

• Underfitting:

– Training + Test error are high

Recap: KNN

• Find the k-Nearest Neighbors.

• Use the class with the highest occurance

as output labels

• Different complexity with different k

– Small k: Complex decision

boundaries, high complexity

– Large k: Much smoother decision

boundaries, lower complexity

k = 25

k = 1

Model-Selection for k…

• Overfitting:

– Training accuracy goes up

– Test accuracy goes down

• Underfitting:

– Training + Test accuarcy are low

Bias-Variance Tradeoff

What we have seen is a general

concept called the Bias-Variance

Tradeoff

• Bias: What is the error if I would

know the optimal parameters of

my model (assuming infinite

data)?

• Variance: What is the error that I

do because I only have limited

data?

Bias-Variance Tradeoff

Each dot represents a fitted model

on a slight variation of the data

Classification and Regression Trees (CART)

Today‘s agenda

Learn more about tree-based methods

• For regression: Regression Tree

• For classification: Decision Tree

• Almost the same algorithms!

Random Forest:

• Why using multiple trees?

• Why use randomization?

Good news for today…. Only 2 slides of math

Classification and Regression Trees

Pioneers:

• Morgan and Sonquist (1963).

• Breiman, Friedman, Olshen, Stone (1984). CART

• Quinlan (1993). C4.5

Regression and Classification Trees

Grow a binary tree

• At each node, “split” the data into two “daughter” nodes.

• Splits are chosen using a splitting criterion.

• Bottom nodes are “terminal” nodes.

For regression:

• the predicted value at a node is the average response variable for all observations

in the node.

For classification:

• the predicted class is the most common class in the node (majority vote).

• Can also get estimated probability of membership in each of the classes

A classification tree

A regression tree

Predict (log) prostate specific antigen from

• Log cancer volumne

• Log prostate weight

Splitting criterion (thats the first mathy slide…)

• Regression: Minimum residual sum of squares

– where and are the average label values in the left and right subtree

– Split such that variance is subtrees is minimized

Splitting criterion (thats the second mathy slide…)

• Classification: Minimum entropy in subtrees

– where is the entropy in the left sub-tree

– and is the proportion of class k in left tree

• Entropy is a measure of uncertainty

– Split such that class-labels in sub-trees are „pure“

Finding the best horizontal split

RSS

Finsing the best vertical split

Creating the root node

Finding the best split in the left node

Finding the best split in the left node

Building the regression tree…

Finding the best split in the right node…

Skipping some steps… final result

Classification Tree

Classification Tree

Classification Tree

Classification Tree

Classification Tree

When do we stop?

There are many stopping criterias, the 2 main ones are:

Stop if:

• Minimum number of samples per node

• Maximum depth

… has been reached

Both criterias again influence the complexity of the tree !

Controlling the tree complexity

Small number of samples per leaf:
• Tree is very sensitive to noise

Controlling the tree complexity

Small number of samples per leaf:
• Tree is very sensitive to noise

Controlling the tree complexity

Small number of samples per leaf:
• Tree is very sensitive to noise

Large number of samples per leaf:

• Tree not expressive enough

Model-Selection for Regression Trees

Evaluate error on test-set

• Overfitting for min_samples = 1

• Underfitting for min_sampes > 2

Classification and Regression Trees

Advantages

• Applicable to both regression and classification problems.

• Handle categorical predictors naturally.

• Computationally simple and quick to fit, even for large problems.

• No formal distributional assumptions (non-parametric).

• Can handle highly non-linear interactions and classification boundaries.

• Automatic variable selection.

• Very easy to interpret if the tree is small.

Classification and Regression Trees (CART)

Disadvantages

• Accuracy – current methods, such as support vector machines and ensemble

classifiers often have 30% lower error rates than CART.

• Instability – if we change the data a little, the tree picture can change a lot. So the

interpretation is not as straightforward as it appears.

Nowadays, we can do better! Random Forests!

Bagging Predictors

Key Idea: Use multiple trees to improve accuracy

Key Idea: Use multiple trees to improve accuracy

How do we get variability in the trees?

Bagging (Bootstrap Aggregating)

Breiman, “Bagging Predictors”, Machine Learning, 1996.

Fit classification or regression models to bootstrap samples from the data

and combine by voting (classification) or averaging (regression).

Bagging (Bootstrap Aggregating)

• A bootstrap sample is chosen at random with replacement from the data. Some

observations end up in the bootstrap sample more than once, while others are not

included (“out of bag”).

• Bagging reduces the variance of the base learner but has limited effect on the bias.

– I.e. no overfitting: The more trees the better

• It’s most effective if we use strong base learners that have very little bias but high

variance (unstable). E.g. trees.

• Both bagging and boosting are examples of “ensemble learners” that were popular

in machine learning in the ‘90s.

Bagging CART

Random Forests

Random Forests

Grow a forest of many trees. (R default is 500)

Grow each tree on an independent bootstrap sample from the training data.

• Sample N cases at random with replacement.

At each node:

1. Select m variables at random out of all M possible variables

(independently for each node).

2. Find the best split on the selected m variables.

Grow the trees to maximum depth (classification).

Vote/average the trees to get predictions for new data.

Why does that work?

Intuition: Why randomization?

• Increase variability of the single trees

• A single tree is less likely to over-specialize

• The trees are less likely to overfit

Random Regression Forests

Num Trees = 1 Num Trees = 10 Num Trees = 50

We can represent almost continuous functions!

Random Forests

Random Forests

Advantages

• Applicable to both regression and classification problems. Yes

• Handle categorical predictors naturally. Yes

• Computationally simple and quick to fit, even for large problems. Yes

• No formal distributional assumptions (non-parametric). Yes

• Can handle highly non-linear interactions and classification boundaries. Yes

• Automatic variable selection. Yes

• Very easy to interpret if the tree is small. No

Random Forests

Improve on CART with respect to:

• Accuracy – Random Forests is competitive with the best known machine learning

methods

• Instability – if we change the data a little, the individual trees may change but the

forest is relatively stable because it is a combination of many trees.

Random Forests and the Kinect

Random Forests and the Kinect

Random Forests and the Kinect

Use computer graphics to generate plenty of data

Shotton, et. al., Real-Time Human Pose Recognition in Parts from a Single Depth Image, CVPR 2011

Summary: What we have learned today

• CART: Binary decision trees

• Can be efficiently used for classification and regression

• Complexity can be set by minimum samples per leaf

• Averaging over multiple trees reduces variance and improves performance

• Variability in the trees:

– Bootstrap

– Randomized splits