Introduction to Machine Learning Random Forests
Prof. Kutty
Today’s Agenda
Copyright By PowCoder代写 加微信 powcoder
• Decision Trees
– (Shannon)Entropy
– Learningdecisiontrees
• Ensemble Methods
– BootstrapSampling – Bagging
– RandomForests
Decision Trees: Definition
Each internal node tests an attribute/feature xi
Each branch assigns a value xi = v or xi > v
Each leaf assigns a class (binary or discrete) yˆi
Goal: find the smallest decision tree that minimizes error
Applications of Decision Trees
E.g., Risk stratification for 1-yr mortality in acute HF [Arenja, 2011]
Human interpretable
Algorithm:
Temperature Humidity Wind PlayTquenidndiistch
• start with an empty tree
• split on a feature • recurse
Day Outlook
Learning Decision Trees
Improved Algorithm:
nny D2 1 Su
Hot High Weak No Hot High Strong No
D3 4 Rain D5 Rain D6 Rain D7 8 9 10 Rain D11 12 13 14 Rain
Cool Mild Cool Mild Mild Mild Hot Mild
High Weak Yes
Mild Cool Cool
High Weak Yes Normal Weak Yes
High Strong Yes Normal Weak Yes High Strong No
High Weak No
Strong Yes Normal Weak Yes
Normal Weak Yes Normal Strong Yes
• start with an empty tree
• split on the best feature (that would lead to smallest tree) • recurse
Entropy and Conditional Entropy
uncertainty in the class label ”
binaryclassification
!(“) = −’Pr(” = *”)logPr(” = *”) “#$
uncertainty in the class label “, given that a specific feature # took a particular value %!
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$
uncertainty in the class label “, if you knew the value of a specific feature # Conditional Entropy
weighted average of the uncertainty in label Y giventhat x took on each of values that it could take
!”# =’Pr(#=%!)!”#=%! !#$
Conditional Entropy
% Gk classproblem
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$
Goal: Compute ! 12 13
Lille Short Pr Li Yes
Pr Lie No log 3
Le shortlogPrLiY Les
Le Short log Pr Lien
t 10,2 Medium
Conditional Entropy
Goal: Compute ! 12 13 HLilLe Short
41 Li l Le O
3log t 105 t
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$
Conditional Entropy
Goal: Compute ! 12 13
!”# =’Pr(#=%!)!”#=%! !#$
Conditional Entropy
Goal: Compute ! 12 13
41 Li l Le O
!”# =’Pr(#=%!)!”#=%! !#$
Information Gain (aka mutual information)
#$(“,!) = *(!) − *(!|”)
Intuitively, amount of information we learn about the class label !, given a specific feature ”
• Suppose the feature ” and class label ! are independent random
• Suppose the class label ! is completely determined by the feature ” aXYMCT HEYxO
!”# =’Pr(#=%!)!”#=%! !#$
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$
Learning Decision Trees
Goal: find the smallest decision tree that minimizes error
A Greedy Approach:
• start with an empty tree
• split on the “best” feature • recurse
max Information Gain
arg max &'()!, +) !
=argmax.(+)−.+)!) !
=argmin. + )!) !
Learning Decision Trees
Goal: find the smallest decision tree that minimizes error
A Greedy Approach:
• start with an empty tree
• split on the “best” feature • recurse
until when?
Learning Decision Trees: Stopping Criterion
1 When all records have the same label (assumes no noise)
2 If all records have identical features (no further splits possible)
3 If all attributes have zero IG
Ignores potential interaction between features (e.g., xor)
Why might this this approach be a bad idea?
HyI Hlya1 1
training error
50 best training error of
o IG x y 0
Hey in IG x2 y
Learning Decision Trees: Pseudocode
BuildTree(DS) if
elseif (x(i) == x) for all examples in DS
(y(i) == y) for all examples in DS
return majority label x = argmin H(y|x)
for each value v of xs
Day Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5 D6 D7
Rain Rain High Weak No Hot High Strong No Hot High Weak Yes Mild High Weak Yes Cool Normal Weak Yes
DSv = {examples in DS where xs = v} BuildTree(DSv)
Cool Normal Cool Normal
Strong No Strong Yes
D8 High Weak No
D9 Normal Weak Yes
D10 Rain Mild Normal Weak Yes
Learning Decision Trees: Pseudocode
BuildTree(DS)
if (y(i) == y) for all examples in DS
return y elseif
return majority label x = argmin H(y|x)
(x(i) == x) for all examples in DS
BuildTree(DSv) Boros
for each value v of xs
DSv = {examples in DS where xs = v}
Day Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5 D6 D7
Rain Mild Rain Cool Rain Cool
High Weak No High Strong No High Weak Yes High Weak Yes
Normal Weak Yes
Normal Normal
Strong No Strong Yes
D8 High Weak No
D9 Normal Weak Yes
D10 Rain Mild Normal Weak Yes
Learning Decision Trees: Pseudocode
BuildTree(DS)
if (y(i) == y) for all examples in DS
elseif (x(i) == x) for all examples in DS
return majority label
xs = argminx H(y|x) for each value v of xs
DSv = {examples in DS where xs = v} BuildTree(DSv)
Day Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5 D6 D7
High Weak Yes Normal Weak Yes
High Weak No
High Strong No
High Weak Yes
Normal Normal
Strong No Strong Yes
D8 High Weak No
D9 Normal Weak Yes
D10 Rain Mild Normal Weak Yes
Learning Decision Trees
How do we avoid overfitting?àsimpler tree
1 set a max depth
2 measure performance on validation data if growing tree results in worse
performance stop
3 post-prune, grow entire/full tree and then greedily remove nodes that
affect validation error the least
Image from: http://www.eecs.wsu.edu/~cook/dm/lectures/l4/node17.html
Sources of Error
• Our goal is to minimize generalization error
• Sources of error
– bias (structural error)
– variance (estimation error) – noise
Bias-Variance tradeoff
• to reduce bias, need larger hypothesis class F
• however, if we have noisy/small dataset,
this may increase variance
• Sources of noise:
– noisy labels
• e.g.,hatedmoviebutclickedthe‘like’button
– noisy features
• e.g.,sensorerror;typoinemail(inspamclassificationexample)
Ensemble Methods
Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
Outlook Rain Rain Overcast Rain
Temperature Humidity Wind PlayTennis BuildTree(DS)
Goal: decrease variance without increasing bias
Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild
High Weak No High Strong No High Weak Yes High Weak Yes
Normal Weak Yes Normal Strong No Normal Strong Yes
High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes
High Strong Yes Normal Weak Yes High Strong No
if (y(i) == y) for all examples in DS return y
elseif (x(i) == x) for all examples in DS return majority label
xs = argminx H(y|x) for each value v of xs
DSv = {examples in DS where xs = v}
BuildTree(DS ) v
Idea: average across multiple models to reduce variance.
let’s learn multiple decision trees and average their prediction But how?
Idea 1: Train Decision Trees on Split DataSet
Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
Outlook Temperature Humidity Wind PlayTennis
High Weak No High Strong No High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Rain Cool Normal
Weak Yes Strong No Strong Yes
Weak No Weak Yes Weak Yes
Cool Normal Mild High Cool Normal
D11 D12 D13 D14
Mild Normal Mild High
Hot Normal Mild High
Strong Yes Strong Yes Weak Yes Strong No
Idea 2: Bootstrap Sampling
Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5
High Weak No High Strong No High Weak Yes
original dataset
Rain Rain Rain Overcast Rain
the original dataset uniformly at random cu a r and win
replacement
Based on an example by
D6 D7 D8 D9 D10 D11 D12 D13 D14
Mild High Weak Yes Cool Normal Weak Yes Cool Normal Strong No Cool Normal Strong Yes Mild High Weak No Cool Normal Weak Yes Mild Normal Weak Yes Mild Normal Strong Yes Mild High Strong Yes
Hot Normal Weak Yes
sampled dataset
pick darapoints from
Sn(2) Sn(3) DT(2) DT(3) …
Bootstrap aggregating
parameter hyper
Sn(1) DT(1)
Sn(k) DT(k)
classify by majority vote
Sn(1) DT(1)
Sn(2) Sn(3) DT(2) DT(3) …
Sn(B) DT(B)
as B → ∞, misclassification rate → 0
Assumptions:
• each decision tree has a misclassification rate of better than chance
• classifiers are independent
need not be decision trees
Prediction error rarely goes to zero!
• Bagging reduces variance (estimation error)
– bias (structural error) may still remain
• Independence of classifiers is a strong assumption
Can we further decorrelate the Decision Trees learnt?
Random Forests
• Introduces two sources of randomness – bagging
– random feature subset
• at each node, best split is chosen from random subset of k < d features
Slide credit
Random Forest Procedure
for b = 1, ..., B
draw bootstrap sample Sn(b) of size n from Sn grow decision tree DT(b)
output ensemble {DT(1), ..., DT(B)} [subprocedure for growing DT(b)]
until stopping criteria are met, recursively repeat following steps for each node of tree:
1. select k features at random from d features
2. pick best feature to split on (using IG)
3. split node into children
Slide credit
Random Forest Procedure
for b = 1, ..., B
draw bootstrap sample Sn(b) of size n from Sn grow decision tree DT(b)
output ensemble {DT(1), ..., DT(B)} [subprocedure for growing DT(b)]
until stopping criteria are met, recursively repeat following steps
for each node of tree:
1. select k features at random from d features
2. pick best feature to split on (using IG)
3. split node into children
another option:
• compute IG for all features.
• pick top m
• select at random from these
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com