CS代考 Introduction to Machine Learning Random Forests

Introduction to Machine Learning Random Forests
Prof. Kutty

Today’s Agenda

• Decision Trees
– (Shannon)Entropy
– Learningdecisiontrees
• Ensemble Methods
– BootstrapSampling – Bagging
– RandomForests

Decision Trees: Definition
Each internal node tests an attribute/feature xi
Each branch assigns a value xi = v or xi > v
Each leaf assigns a class (binary or discrete) yˆi
Goal: find the smallest decision tree that minimizes error

Applications of Decision Trees
E.g., Risk stratification for 1-yr mortality in acute HF [Arenja, 2011]
Human interpretable

Algorithm:
Temperature Humidity Wind PlayTquenidndiistch
• start with an empty tree
• split on a feature • recurse
Day Outlook
Learning Decision Trees
Improved Algorithm:
nny D2 1 Su
Hot High Weak No Hot High Strong No
D3 4 Rain D5 Rain D6 Rain D7 8 9 10 Rain D11 12 13 14 Rain
Cool Mild Cool Mild Mild Mild Hot Mild
High Weak Yes
Mild Cool Cool
High Weak Yes Normal Weak Yes
High Strong Yes Normal Weak Yes High Strong No
High Weak No
Strong Yes Normal Weak Yes
Normal Weak Yes Normal Strong Yes
• start with an empty tree
• split on the best feature (that would lead to smallest tree) • recurse

Entropy and Conditional Entropy
uncertainty in the class label ”
binaryclassification
!(“) = −’Pr(” = *”)logPr(” = *”) “#$
uncertainty in the class label “, given that a specific feature # took a particular value %!
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$
uncertainty in the class label “, if you knew the value of a specific feature # Conditional Entropy
weighted average of the uncertainty in label Y giventhat x took on each of values that it could take
!”# =’Pr(#=%!)!”#=%! !#$

Conditional Entropy
% Gk classproblem
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$
Goal: Compute ! 12 13
Lille Short Pr Li Yes
Pr Lie No log 3
Le shortlogPrLiY Les
Le Short log Pr Lien
t 10,2 Medium

Conditional Entropy
Goal: Compute ! 12 13 HLilLe Short
41 Li l Le O
3log t 105 t
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$

Conditional Entropy
Goal: Compute ! 12 13
!”# =’Pr(#=%!)!”#=%! !#$

Conditional Entropy
Goal: Compute ! 12 13
41 Li l Le O
!”# =’Pr(#=%!)!”#=%! !#$

Information Gain (aka mutual information)
#$(“,!) = *(!) − *(!|”)
Intuitively, amount of information we learn about the class label !, given a specific feature ”
• Suppose the feature ” and class label ! are independent random
• Suppose the class label ! is completely determined by the feature ” aXYMCT HEYxO
!”# =’Pr(#=%!)!”#=%! !#$
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$

Learning Decision Trees
Goal: find the smallest decision tree that minimizes error
A Greedy Approach:
• start with an empty tree
• split on the “best” feature • recurse
max Information Gain
arg max &'()!, +) !
=argmax.(+)−.+)!) !
=argmin. + )!) !

Learning Decision Trees
Goal: find the smallest decision tree that minimizes error
A Greedy Approach:
• start with an empty tree
• split on the “best” feature • recurse
until when?

Learning Decision Trees: Stopping Criterion
1 When all records have the same label (assumes no noise)
2 If all records have identical features (no further splits possible)
3 If all attributes have zero IG
Ignores potential interaction between features (e.g., xor)
Why might this this approach be a bad idea?
HyI Hlya1 1
training error
50 best training error of
o IG x y 0
Hey in IG x2 y

Learning Decision Trees: Pseudocode
BuildTree(DS) if
elseif (x(i) == x) for all examples in DS
(y(i) == y) for all examples in DS
return majority label x = argmin H(y|x)
for each value v of xs
Day Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5 D6 D7
Rain Rain High Weak No Hot High Strong No Hot High Weak Yes Mild High Weak Yes Cool Normal Weak Yes
DSv = {examples in DS where xs = v} BuildTree(DSv)
Cool Normal Cool Normal
Strong No Strong Yes
D8 High Weak No
D9 Normal Weak Yes
D10 Rain Mild Normal Weak Yes

Learning Decision Trees: Pseudocode
BuildTree(DS)
if (y(i) == y) for all examples in DS
return y elseif
return majority label x = argmin H(y|x)
(x(i) == x) for all examples in DS
BuildTree(DSv) Boros
for each value v of xs
DSv = {examples in DS where xs = v}
Day Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5 D6 D7

Rain Mild Rain Cool Rain Cool

High Weak No High Strong No High Weak Yes High Weak Yes
Normal Weak Yes
Normal Normal
Strong No Strong Yes
D8 High Weak No
D9 Normal Weak Yes
D10 Rain Mild Normal Weak Yes

Learning Decision Trees: Pseudocode
BuildTree(DS)
if (y(i) == y) for all examples in DS
elseif (x(i) == x) for all examples in DS
return majority label
xs = argminx H(y|x) for each value v of xs
DSv = {examples in DS where xs = v} BuildTree(DSv)
Day Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5 D6 D7
High Weak Yes Normal Weak Yes
High Weak No
High Strong No
High Weak Yes
Normal Normal
Strong No Strong Yes
D8 High Weak No
D9 Normal Weak Yes
D10 Rain Mild Normal Weak Yes

Learning Decision Trees
How do we avoid overfitting?àsimpler tree
1 set a max depth
2 measure performance on validation data if growing tree results in worse
performance stop
3 post-prune, grow entire/full tree and then greedily remove nodes that
affect validation error the least
Image from: http://www.eecs.wsu.edu/~cook/dm/lectures/l4/node17.html

Sources of Error
• Our goal is to minimize generalization error
• Sources of error
– bias (structural error)
– variance (estimation error) – noise

Bias-Variance tradeoff
• to reduce bias, need larger hypothesis class F
• however, if we have noisy/small dataset,
this may increase variance
• Sources of noise:
– noisy labels
• e.g.,hatedmoviebutclickedthe‘like’button
– noisy features
• e.g.,sensorerror;typoinemail(inspamclassificationexample)

Ensemble Methods
Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
Outlook Rain Rain Overcast Rain
Temperature Humidity Wind PlayTennis BuildTree(DS)
Goal: decrease variance without increasing bias
Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild
High Weak No High Strong No High Weak Yes High Weak Yes
Normal Weak Yes Normal Strong No Normal Strong Yes
High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes
High Strong Yes Normal Weak Yes High Strong No
if (y(i) == y) for all examples in DS return y
elseif (x(i) == x) for all examples in DS return majority label
xs = argminx H(y|x) for each value v of xs
DSv = {examples in DS where xs = v}
BuildTree(DS ) v
Idea: average across multiple models to reduce variance.
let’s learn multiple decision trees and average their prediction But how?

Idea 1: Train Decision Trees on Split DataSet
Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
Outlook Temperature Humidity Wind PlayTennis

High Weak No High Strong No High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Rain Cool Normal
Weak Yes Strong No Strong Yes
Weak No Weak Yes Weak Yes
Cool Normal Mild High Cool Normal
D11 D12 D13 D14

Mild Normal Mild High
Hot Normal Mild High
Strong Yes Strong Yes Weak Yes Strong No

Idea 2: Bootstrap Sampling
Outlook Temperature Humidity Wind PlayTennis
D1 D2 D3 D4 D5

High Weak No High Strong No High Weak Yes
original dataset
Rain Rain Rain Overcast Rain
the original dataset uniformly at random cu a r and win
replacement
Based on an example by
D6 D7 D8 D9 D10 D11 D12 D13 D14
Mild High Weak Yes Cool Normal Weak Yes Cool Normal Strong No Cool Normal Strong Yes Mild High Weak No Cool Normal Weak Yes Mild Normal Weak Yes Mild Normal Strong Yes Mild High Strong Yes
Hot Normal Weak Yes
sampled dataset
pick darapoints from

Sn(2) Sn(3) DT(2) DT(3) …
Bootstrap aggregating
parameter hyper
Sn(1) DT(1)
Sn(k) DT(k)
classify by majority vote

Sn(1) DT(1)
Sn(2) Sn(3) DT(2) DT(3) …
Sn(B) DT(B)
as B → ∞, misclassification rate → 0
Assumptions:
• each decision tree has a misclassification rate of better than chance
• classifiers are independent
need not be decision trees

Prediction error rarely goes to zero!
• Bagging reduces variance (estimation error)
– bias (structural error) may still remain
• Independence of classifiers is a strong assumption

Can we further decorrelate the Decision Trees learnt?

Random Forests
• Introduces two sources of randomness – bagging
– random feature subset
• at each node, best split is chosen from random subset of k < d features Slide credit Random Forest Procedure for b = 1, ..., B draw bootstrap sample Sn(b) of size n from Sn grow decision tree DT(b) output ensemble {DT(1), ..., DT(B)} [subprocedure for growing DT(b)] until stopping criteria are met, recursively repeat following steps for each node of tree: 1. select k features at random from d features 2. pick best feature to split on (using IG) 3. split node into children Slide credit Random Forest Procedure for b = 1, ..., B draw bootstrap sample Sn(b) of size n from Sn grow decision tree DT(b) output ensemble {DT(1), ..., DT(B)} [subprocedure for growing DT(b)] until stopping criteria are met, recursively repeat following steps for each node of tree: 1. select k features at random from d features 2. pick best feature to split on (using IG) 3. split node into children another option: • compute IG for all features. • pick top m • select at random from these 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts