CSCI 567: Machine Learning
Copyright By PowCoder代写 加微信 powcoder
Lecture 8, Oct 27
Administrivia
• HW4 will be released in parts. All of it is due together in about 3 weeks.
• Part 1 on Decision trees and ensemble methods will be released
• Project details will be released early next week.
• Groups of 4 (start forming groups).
• Today’s plan:
• Decision trees
• Ensemble methods
Decision trees
Decision trees
• Introduction & definition
• Learning the parameters
• Measures of uncertainty
• Recursively learning the tree & some variants
We have seen different ML models for classification/regression:
• linear models, nonlinear models induced by kernels, neural networks
Decision tree is another popular one:
• nonlinear in general
• works for both classification and regression; we focus on classification
• one key advantage is good interpretability
• ensembles of trees are very effective
Decision trees
Many decisions are made based on some tree structure
Medical treatment Salary in a company
Tree terminology
Tree terminology
Input: x = (x1, x2)
Output: f(x) determined naturally by traversing the tree
• start from the root
• test at each node to decide which child to visit next
• finally the leaf gives the prediction f(x)
For example, f((!1 ! 1, !2 + 1)) = B
Complex to formally write down, but easy to represent pictorially or as code.
Decision boundary
Parameters
Parameters to learn for a decision tree:
• The structure of the tree, such as the depth, #branches, #nodes, etc.
Some of these are considered as hyperparameters. The structure of a tree is
not fixed in advance, but learned from data.
• The test at each internal node:
Which feature(s) to test on? If the feature is continuous, what threshold
(!1, !2, . . .)?
• The value/prediction of the leaves (A, B, . . .)
Decision trees
• Introduction & definition
• Learning the parameters
• Measures of uncertainty
• Recursively learning the tree & some variants
Learning the parameters (optimization?)
So how do we learn all these parameters?
Empirical risk minimization (ERM): find the parameters that minimize some loss.
However, doing exact ERM is too expensive for trees.
• for T nodes, there are roughly (#features)T possible decision trees (need to decide which
feature to use on each node).
• enumerating all these configurations to find the one that minimizes some loss is too computa-
tionally expensive.
• since most of the parameters are discrete (#branches, #nodes, feature at each node, etc.) can-
not really use gradient based approaches.
Instead, we turn to some greedy top-down approach.
A running example Russell & Norvig, AIMA
• predict whether a customer will wait to get a table at some restaurant
• 12 training examples
• 10 features (all discrete)
First step: How to build the root?
Which feature should we test at the root? Examples:
Which split is better?
• intuitively “patrons” is a better feature since it leads to “more certain” children
• how to quantify this intuition?
Decision trees
• Introduction & definition
• Learning the parameters
• Measures of uncertainty
• Recursively learning the tree & some variants
The uncertainity of a node should be a function of the distribution of the classes within the node.
Example: a node with 2 positive and 4 negative examples can be summarized by a distribution P
with P (Y = +1) = 1/3 and P (Y = !1) = 2/3
One classic measure of the uncertainity of a distribution is its (Shannon) entropy:
P (Y = k) logP (Y = k)
Measure of uncertainty of a node
H(P ) = EY!P
P (Y = k) log
P (Y = k) logP (Y = k)
• the base of log can be 2, e or 10
• always non-negative
• it’s the smallest codeword length to encode symbols drawn from P
• maximized if P is uniform (max = lnC): most uncertain case
• minimized if P focuses on one class (min = 0): most certain case
• e.g. P = (1, 0, . . . , 0)
• 0 log 0 is defined naturally as limz”0+ z log z = 0
Properties of entropy
Examples of computing entropy
Examples of computing entropy
Entropy in each child if root tests on “patrons”
So how good is choosing “patrons” overall?
Very naturally, we take the weighted average of entropy:
! 0.9 = 0.45
Suppose we split based on a discrete feature A, the uncertainty can be measured by the
conditional entropy:
P (A = a)H(Y | A = a)
P (Y | A = a) logP (Y | A = a)
“fraction of examples at node A = a” ” “entropy at node A = a”
Pick the feature that leads to the smallest conditional entropy.
Measure of uncertainty of a split
The conditional entropy is 2
! 1 = 1 > 0.45
So splitting with “patrons” is better than splitting with “type”.
In fact by similar calculation “patrons” is the best split among all features.
We are now done with building the root (this is also called a stump).
Deciding the root
Decision trees
• Introduction & definition
• Learning the parameters
• Measures of uncertainty
• Recursively learning the tree & some variants
Split each child in the same way.
• but no need to split children “none” and
“some”: they are pure already and will
be our leaves
• for “full”, repeat, focusing on those 6
Repeat recursively
Repeat recursively
Putting it together
DecisionTreeLearning(Examples)
• if Examples have the same class, return a leaf with this class
• else if Examples is empty, return a leaf with majority class of parent
find the best feature A to split (e.g. based on conditional entropy)
Tree! a root with test on A
For each value a of A:
Child! DecisionTreeLearning(Examples with A = a)
add Child to Tree as a new branch
• return Tree
Popular decision tree algorithms (e.g. C4.5, CART, etc) are all based on this framework.
• replace entropy by Gini impurity:
P (Y = k)(1! P (Y = k))
meaning: how often a randomly chosen example would be incorrectly classified if we
predict according to another randomly picked example
• if a feature is continuous, we need to find a threshold that leads to minimum condi-
tional entropy or Gini impurity. Think about how to do it efficiently.
If the dataset has no contradiction (i.e. same x but different y), the training error of our decision tree algorithm
is always zero, and hence the model can overfit.
To prevent overfitting:
• restrict the depth or #nodes (e.g. stop building the tree when the depth reaches some threshold).
• do not split a node if the #examples at the node is smaller than some threshold.
• other approaches as well, all make use of a validation set to tune these hyperparameters
You’ll explore this in HW4.
Regularization
Ensemble methods
Acknowledgement
We borrow some of the content from Stanford’s CS229
slides on Ensemble Methods, by :
https://cs229.stanford.edu/lectures-spring2022/cs229-
boosting_slides.pdf
https://cs229.stanford.edu/lectures-spring2022/cs229-boosting_slides.pdf
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
• Can handle large datasets
• Can handle mixed predictors
(continuous, discrete, qualitative)
• Can ignore redundant variables
• Can easily handle missing data
• Easy to interpret if small
• Prediction performance is often poor
(because it does not generalize well)
• Large trees are hard to interpret
Decision Trees Recap
Ensemble Methods for Decision Trees
Issue: A single decision tree is very unstable, small variations in the data can lead to
very different trees (since differences can propagate along the hierarchy).
They are high variance models, which can overfit.
But they have many advantages (e.g. very fast, robust to data variations).
Q: How can we lower the variance?
A: Let’s learn multiple trees!
How to ensure they don’t all just learn the same thing??
Key idea: Combine multiple classifiers to form a learner with better performance than any
of them individually (“wisdom of the crowd”)
Bagging (Breiman, 1996)
Bootstrap Aggregating: lowers variance
Ingredients:
Bootstrap sampling: get different splits/subsets of the data
Aggregating: by averaging
Procedure:
→ Get multiple random splits/subsets of the data
→ Train a given procedure (e.g. decision tree) on each subset
→ Average the predictions of all trees to make predictions on test data
Leads to estimations with reduced variance.
Collect T subsets each of some fixed size (say m) by sampling with replacement from
training data.
Let ft(x) be the classifier (such as a decision tree) obtained by training on the subset
t ! {1, . . . , T}. Then the aggregrated classifier fagg(x) is given by:
ft(x) for regression,
= Majority Vote{ft(x)}
t=1 for classification.
Why majority vote? “Wisdom of the crowd”
Why majority vote? “Wisdom of the crowd”
Suppose I ask each of you: “Will the stock market go up tomorrow?”
Suppose each of you has a 60% chance of being correct, and all of you make
independent predictions (probability of any 1 person being correct is independent of
probability of any one else being correct).
What is Probability(Majority vote of 100 people being correct)?
Let BinCDF(!, #, $) be the CDF at value ! of the Binomial distribution corresponding to # trials
and each trial having probability $ of success
Probability(Majority vote of 100 people being correct) = 1 – BinCDF(50,100,0.6)
Bagging: example
Figure from ESL
! = 1 ! = 2
! = 3 ! = 4 ! = 5
• Reduces overfitting (i.e., variance)
• Can work with any type of classifier (here focus on trees)
• Easy to parallelize (can train multiple trees in parallel)
• But loses on interpretability to single decision tree (true for all ensemble methods..)
Bagging: summary
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Issue with bagging: Bagged trees are still too correlated
Each is trained on large enough random sample of data and often end up not being
sufficiently different.
How to decorrelate the trees further?
Simple technique: When growing a tree on a bootstrapped (i.e. subsampled) dataset,
before each split select ! ≤ # of the # input variables at random as candidates for
splitting.
When ! = # → same as Bagging
When ! < # → Random forests
! is a hyperparameter, tuned via a validation set.
Random forests
Random forests are very popular!
Wikipedia: Random forests are frequently used as "blackbox" models in businesses, as they
generate reasonable predictions across a wide range of data while requiring little
configuration.
• When you have large number of features, yet very small number of relevant features:
Prob(selecting a relevant feature among ! selected features) is very small
• Lacks expressive power compared to other ensemble methods we’ll see next..
Random forests
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Recall that the bagged/random forest classifier is given by
fagg(x) = sign
where each {ft}
t=1 belongs to the function class F (such as a decision tree), and is trained
in parallel.
Instead of training the {ft}
t=1 in parallel, what if we sequentially learn which models to
use from the function class F so that they are together as accurate as possible?
More formally, what is the best classifier sign (h(x)), where
!tft(x) for !t ! 0 and ft " F .
Boosting is a way of doing this.
• is a meta-algorithm, which takes a base algorithm (classification algorithm, regres-
sion algorithm, ranking algorithm, etc) as input and boosts its accuracy
• main idea: combine weak “rules of thumb” (e.g. 51% accuracy) to form a highly
accurate predictor (e.g. 99% accuracy)
• works very well in practice (especially in combination with trees)
• has strong theoretical guarantees
We will continue to focus on binary classification.
Email spam detection:
• given a training set like:
– (“Want to make money fast? ...”, spam)
– (“Viterbi Research Gist ...”, not spam)
• first obtain a classifier by applying a base algorithm, which can be a rather sim-
ple/weak one, like decision stumps:
– e.g. contains the word “money” ! spam
• reweigh the examples so that “difficult” ones get more attention
– e.g. spam that doesn’t contain the word “money”
• obtain another classifier by applying the same base algorithm:
– e.g. empty “to address” ! spam
• repeat ...
• final classifier is the (weighted) majority vote of all weak classifiers
Boosting: example
A base algorithm A (also called weak learning algorithm/oracle) takes a training set S
weighted by D as input, and outputs classifier f ! A(S,D)
• this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic
regression, neural nets, etc)
• many algorithms can deal with a weighted training set (e.g. for algorithm that mini-
mizes some loss, we can simply replace “total loss” by “weighted total loss”)
• even if it’s not obvious how to deal with weight directly, we can always resample
according to D to create a new unweighted dataset
Base algorithm
Boosting: Idea
The boosted predictor is of the form fboost(x) = sign(h(x)), where,
!tft(x) for !t ! 0 and ft " F .
The goal is to minimize "(h(x), y) for some loss function ".
Q: We know how to find the best predictor in F on some data, but how do we find the best weighted
combination h(x)?
A: Minimize the loss by a greedy approach, i.e. find !t, ft(x) one by one for t = 1, . . . , T .
Specifically, let ht(x) =
!!f! (x). Suppose we have found ht!1(x), how do we find !t, ft(x)?
Find the !t, ft(x) which minimizes the loss "(ht(x), y).
Different loss function " give different boosting algorithms.
"(h(x), y) =
(h(x)# y)2 $ Least squares boosting,
exp(#h(x)y) $ AdaBoost.
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Figure from PML
AdaBoost minimizes exponential loss by a greedy approach, that is, find !t, ft(x) one by
one for t = 1, . . . , T .
Recall ht(x) =
!!f! (x). Suppose we have found ht!1, what should ft be? Greed-
ily, we want to find !t, ft(x) to minimize
exp (!yiht(xi)) =
exp (!yiht!1(xi)) exp (!yi!tft(xi))
= constt ·
Dt(i) exp (!yi!tft(xi))
where the last step is by defining the weights
exp (!yiht!1(xi))
constt is a normalizing constant to make
Dt(i) = 1.
So the goal becomes finding !t, ft(x) ! F that minimize
Dt(i) exp ("yi!tft(xi))
i:yi !=ft(xi)
i:yi=ft(xi)
!t + (1" "t)e
"!t (where "t =
n:yi !=ft(xi)
Dt(i) is weighted error of ft)
!t " e"!t) + e"!t
Therefore, we should find ft(x) to minimize its the weighted classification error "t (what
we expect the base algorithm to do intuitively).
When ft(x) (and thus !t) is fixed, we then find "t to minimize
! e!!t) + e!!t
Exercise: verify that the solution is given by:
Hint: ex is a convex function of x.
How do we update the weights for the next step? The definition of Dt+1(i) is actually
recursive,
exp (!yiht(xi))
exp (!yiht!1(xi))
· exp (!yi!tft(xi))
· exp (!yi!tft(xi))
=" Dt+1(i) # Dt(i) exp(!!tyift(xi)) =
!!t if ft(xi) = yi
Given a training set S and a base algorithm A, initialize D1 to be uniform
For t = 1, . . . , T
• obtain a weak classifier ft(x)! A(S,Dt)
• calculate the weight !t of ft(x) as
(!t > 0# “t < 0.5)
where "t =
i:ft(xi) !=yi
Dt(i) is the weighted error of ft(x).
• update distributions
Dt+1(i) $ Dt(i)e
"!tyift(xi) =
"!t if ft(xi) = yi
Output the final classifier fboost = sgn
t=1 !tft(x)
AdaBoost: Full algorithm
Put more weight on difficult to classify instances and less on those already handled well
New weak learners are added sequentially that focus their training on the more difficult patterns
Adaboost: Example
10 data points in R2
The size of + or - indicates the weight, which starts from uniform D1
Toy ExampleToy ExampleToy ExampleToy ExampleToy Example
weak classifiers = vertical or horizontal half-planes
Base algorithm is decision stump:
Observe that no stump can predict very accurately for this dataset.
Put more weight on difficult to classify instances and less on those already handled well
New weak learners are added sequentially that focus their training on the more difficult patterns
Round 1Round 1Round 1Round 1Round 1
Adaboost: Example
3 misclassified (circled): !1 = 0.3 ! "1 =
D2 puts more weights on these misclassified points.
Put more weight on difficult to classify instances and less on those already handled well
New weak learners are added sequentially that focus their training on the more difficult patternsRound 2Round 2Round 2Round 2Round 2
Adaboost: Example
3 misclassified (circled): !2 = 0.21 ! "2 = 0.65.
D3 puts more weights on these misclassified points.
Put more weight on difficult to classify instances and less on those already handled well
New weak learners are added sequentially that focus their training on the more difficult patterns
Round 3Round 3Round 3Round 3Round 3
Adaboost: Example
again 3 misclassified (circled): !3 = 0.14 ! "3 = 0.92.
Put more weight on difficult to classify instances and less on those already handled well
New weak learners are added sequentially that focus their training on the more difficult patterns
All data points are now classified correctly, even though each weak classifier makes 3 mistakes.
Final ClassifierFinal ClassifierFinal ClassifierFinal ClassifierFinal Classifier
+ 0.92+ 0.650.42sign=
Adaboost: Example
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com