Introduction to Machine Learning Decision Trees
Prof. Kutty
Announcements
Copyright By PowCoder代写 加微信 powcoder
• Midterm Conflict forms are out; see announcement for deadline
• Project 1 is due on Wednesday
• Lecture 8 available in 3 parts asynchronous recording
Today’s Agenda
• Recap: Linear Regression with Squared Loss
• Section 1: Feature Selection
• Section 2: Decision Trees – Interpretability
– Expressive power
– Learning decision trees
Recap: Linear Regression
Linear Regression
A linear regression function is simply a linear function of the feature vector:
f ( x ̄ ; θ , b ) = θ · x ̄ + b
Learning task:
Choose parameters in response to training set
Sn ={x ̄(i),y(i)}ni=1 x ̄∈Rd y∈R ryOxtb
SGD with Squared Loss
Reminder: Squared loss Loss(z) = %! $
! = 0 , % ̅ ( ” ) = 0&
while convergence criteria are not met
randomly shuffle points for i = 1, …,n
“̅(“#$)=”̅(“)−%”∇’&Loss()*(++ − “̅⋅,̅+ )
“# − %̅⋅’̅# ∇”!
SGD with Squared Loss
! = 0 , % ̅ ( ” ) = 0&
while convergence criteria are not met
randomly shuffle points for i = 1, …,n
✓(k+1) = ✓(k) + ⌘k(y(i) ✓(k) · x ̄(i))x ̄(i) ̄ ̄ ̄
Exact Solution for Regression
The parameter value computed as
exactly minimizes Empirical Risk with Squared Loss
✓⇤ = (XT X) 1XT y ̄
to go and solve for optimal
1 Xn ( y ( i ) ( ✓ ̄ · x ̄ ( i ) ) ) 2 Rn(✓) = n i=1 2
X = [ x ̄ ( 1 ) , . . . , x ̄ ( n ) ] T
dimension: n x d
y ̄ = [y(1),…,y(n)]T dimension: n x 1
If an exact solution exists, why use SGD?
short answer: efficiency
rule of thumb:
• low dimensional problemsàclosed form solutions might work ok
• high dimensional problemsàSGD might work better
Ridge regression Closed form solution
(y(i) (✓ · x ̄(i)))2
penalty Xn ||✓||2 1
Jn, (✓)= 2 +ni=1
rJ (✓)| ⇤=0 ̄n, ̄ ̄
% ̅ ∗ = ‘ ′ I + X % X & ‘ X % ,& invertible as long as ) > 0
Section 1: Feature Selection
Feature Selection
Motivation
• When you have few examples and a large number of features (i.e., d>>n) it becomes very easy to overfit your training data
• How can we remove uninformative features?
Different FS Approaches: 1Filter
2Wrapper 3Embedded
Feature Selection
Embedded Methods:
Incorporate variable selection as part of the training process
̄2 Xn min||✓|| + C
⇠i s.t. y(i)(✓ ̄· x ̄(i) + b) 1 ⇠i ⇠i 0
for i = 1,…,n ⇠i s.t. y(i)(✓ ̄· x ̄(i) + b) 1 ⇠i
for i = 1,…,n
̄ min||✓||1 + C
̄ ||✓||1 =
When C is sufficiently small, the L1-norm penalty will shrink some parameters to exactly zeroàimplicit (or embedded) feature selection
L2 regularization
L1 regularization
Feature Selection
Filter Approach:
• rank features according to some metric (independent of learning algorithm)
• filter out features that fall below a certain threshold
rx(3) ,y .
E.g., correlation with output (i.e., label)
Pearson’s correlation sample means
i=1(xj x ̃j)(y(i) y ̃)
rxj ,y = qPn (x(i) x ̃j )2qPn (y(i) y ̃)2 i=1 j i=1
Feature Selection
Wrapper Approach:
• utilizes learning algorithm to score subsets according to predictive power
• learning algorithm is “wrapped” in a search algorithm
Feature Selection
Filter Approach
• performed only once
• ignores the performance of the model
Wrapper Approach
• ability to take into account feature dependencies
• considers performance of model
• computationally expensive
Section 2: Decision Trees
Interpretability of Linear decision boundaries
• Problem:predictwhetheratargetuserlikesatarget movie
– Features: percentage of comedic scenes, percentage of
scenes where your favorite actor appears – Labels: like/dislike
what does “̅ mean?
Interpretability and Kernels
Problem: can’t recover model parameters, classifier becomes a black box
image source: http://www.eric-kim.net
Kernel trick
maps data to a higher dim. space in which there exists a separating hyperplane corresponds to a non-linear decision boundary in the original space)
• never have to explicitly compute feature mappings
Decision Trees for Classification
Human interpretable
Play quidditch?
Supervised Learning with Decision Trees
Day Outlook Temperature Humidity Wind PlayTquenidndiistch
High Weak No High Strong No High Weak Yes High Weak Yes
Normal Weak Yes Normal Strong No Normal Strong Yes
High Weak No Normal Weak Yes Normal Weak Yes
D1 D2 D3
D4 Rain Mild
D5 Rain Cool
D6 Rain Cool
D7 D8 D9
• Feature space –
D10 Rain Mild
D12 High Strong Yes
• Label Space . is discrete or continuous – e.g., / = {Yes, No}
Construct decision tree in response to training set
D14 Rain Mild High Strong No
Strong Yes
– each feature vector x in . e.g. [sunny, hot, high, weak]T D13 Normal Weak Yes
Decision Trees: Definition
Each internal node tests an attribute/feature xi
Human interpretable
Each branch assigns a value xi = v or xi > v Each leaf assigns a class (binary or discrete) yˆi
Decision Trees
Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild
High Weak No
Note 1: feature space could be continuous
Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
Outlook Sunny Rain Rain Rain Overcast Rain Overcast Rain
Play quidditch
Temperature Humidity Wind PlayTennis
High Strong No
High Weak Yes
High Weak Yes
Normal Weak Yes
60 No4r5mal H99ig.5h No4r0mal No4r8mal No5r5mal H9i8gh No6r7mal H9i7gh
Str2o3ng Yes
W1ea5k No W1ea2k Yes We9ak Yes
Str3o0ng Yes Str4o0ng Yes We2ak Yes Str3o0ng No
Note 2: decision trees can represent classification and regression problems
Decision Trees: Example 1
Decision Trees: Example 2
Your turn!
Goal: find the smallest decision tree that minimizes error
Decision Trees: Challenge
Goal: find the smallest decision tree that minimizes error
• NP-hard in general • Idea: use heuristics
– greedy approach to picking ‘best’ attribute to split on
Decision Trees: Greedy approach
Idea: Pick a split that leads to more certainty in classification
Choice 1: split on x1 R
Choice 2: split on x2
Measuring uncertainty: (Shannon) Entropy
Entropy of the training examples Sn
for binary classification
Entropy(Sn) = p log2 p p log2 p proportion of positive examples
proportion of negative examples
Note: assume by convention 0 log 0 = 0
(Shannon) Entropy for binary classification
Entropy of the training examples Sn
Entropy(Sn) = p log2 p p log2 p
Entropy Sn 417
in general (multi-class classification):
É Pi log Pi
proportion of datapoints mat belong
in class i
Note: assume by convention 0 log 0 = 0
Entropy and ‘peakyness’
more higher
123456 123456
uncertain entropy
less lower
# occurences
# occurences
Entropy and Conditional Entropy
uncertainty in the class label ”
!(“) = −’Pr(” = *”)logPr(” = *”) “#$
1211 y TZI
uncertainty in the class label “, given that a specific feature # took a particular value %!
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$
H Y I see 1 Pr Y T 22 1 to Pr Y T 122 1 Pr Y F Nz 1 logPrey Flat log I 10 0 92
uncertainty in the class label “, if you knew the value of a specific feature # Conditional Entropy
!”# =’Pr(#=%!)!”#=%! !#$
Information Gain (aka mutual information)
!”($,&) = )(&) − )(&|$)
Intuitively, amount of information we learn about the class label /, given a specific feature 0
• Suppose the feature 0 and class label / are independent random variables HCYIX
argngx IG x y argyax Hey H Y x argyin
!”# =’Pr(#=%!)!”#=%! !#$
! “#=%! =−’Pr “=*” #=%! logPr(“=*”|#=%!) “#$
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com