CMPUT 366 F20: Supervised Learning IV
James Wright & Vadim Bulitko
November 17, 2020
CMPUT 366 F20: Supervised Learning IV
1
Lecture Outline
Decision trees Linear regression
PM 7.3
CMPUT 366 F20: Supervised Learning IV 2
Decision Trees
A (binary) decision tree is a tree in which:
every internal node is labeled with a condition (Boolean function of an example) every internal node has two children, one labeled true and one labeled false every leaf node is labeled with a point estimate on the target or a probability distribution over possible target values
CMPUT 366 F20: Supervised Learning IV 3
Decision Trees: An Example
CMPUT 366 F20: Supervised Learning IV 4
Learning Decision Trees
How should an agent choose a decision tree?
Bias: which decision trees are preferable to others? Search: How can we search the space of decision trees?
search space is prohibitively large for a complete search local search: choose features to branch on one by one
CMPUT 366 F20: Supervised Learning IV 5
Tree Learner
CMPUT 366 F20: Supervised Learning IV 6
Tree Learner
CMPUT 366 F20: Supervised Learning IV 7
Tree Learner
CMPUT 366 F20: Supervised Learning IV 8
Stopping Criterion
Basic criteria:
No more conditions Cs
No more examples Es
All examples have the same label
Additional criteria:
Minimum child size: do not split a node if there would be too few examples in one
of the children
Minimum number of examples: do not split a node with too few examples Sufficient improvement: do not split a node unless it improves some metric sufficiently
Maximum depth: do not split if the depth reaches a maximum
CMPUT 366 F20: Supervised Learning IV 9
Tree Leaves
Point estimates to go into tree leaves: Most frequent target value
Median target value Mean target value
What point estimate optimally classifies the leaf’s examples? depends on the error function (see Proposition 7.1 in PM):
0/1 error → most frequent target value absolute error → median target value squared error → mean target value
CMPUT 366 F20: Supervised Learning IV 10
Split Conditions
Boolean features can be used directly
non-Boolean features:
partition domain into subsets
one branch for each domain element
CMPUT 366 F20: Supervised Learning IV 11
Choosing Split Conditions
Greedy/myopic selection:
choose the condition that would result in the lowest training error if it were the only split
Example: choose a split condition with the highest information gain
CMPUT 366 F20: Supervised Learning IV 12
Decision Trees: An Example
CMPUT 366 F20: Supervised Learning IV 13
Choosing Split Conditions: PM example 7.9
Log likelihood: log P(E|h) = log P(e|h) = log P(e|h) e∈E e∈E
want to maximize to get hML Log loss: − e∈E log P(e|h)
|E|
want to minimize to get hML
Our leaf predictors/hypotheses will be distributions:
P(reads), P(skips) = 1 − P(reads) Will use the base-2 logarithm: log2
CMPUT 366 F20: Supervised Learning IV 14
Choosing Split Conditions: PM example 7.9
P(reads) = P(skips) = 9/18 = 0.5
Logloss:−e∈ElogP(e|h) =−log0.5+···+log0.5 =−18log0.5 =−log0.5=1
|E| 18 18
CMPUT 366 F20: Supervised Learning IV
15
Choosing Split Conditions: PM example 7.9
Split on Author:
Subset of E when Author = known is
{e1, e4, e5, e6, e9, e10, e12, e13, e14, e15, e16, e17}
for half of them the target is reads so the optimal prediction is P(reads) = P(skips) = 6/12 = 0.5
Subset of E when Author = unknown is {e2, e3, e7, e8, e11, e18} for half of them the target is reads so the optimal prediction is P(reads) = P(skips) = 3/6 = 0.5
Logloss:−e∈ElogP(e|h) =−12log6/12+6log3/6 =−18log0.5 =−log0.5=1 |E| 18 18
CMPUT 366 F20: Supervised Learning IV
16
Choosing Split Conditions: PM example 7.9
Split on Thread:
Subset of E when Thread = new is {e1, e2, e5, e8, e10, e12, e14, e15, e17, e18}
the optimal prediction is P(reads) = 7/10, P(skips) = 3/10
Subset of E when Thread = followup is {e3, e4, e6, e7, e9, e11, e13, e16}
the optimal prediction is P(reads) = 2/8, P(skips) = 6/8 Logloss:−e∈ElogP(e|h) =−3log3/10+7log7/10+2log2/8+6log6/8 ≈0.85
|E| 18 CMPUT 366 F20: Supervised Learning IV
17
Choosing Split Conditions: PM example 7.9
Split on Length:
Subset of E when Length = long is {e1, e3, e4, e6, e9, e10, e12}
the optimal prediction is P(reads) = 0/7, P(skips) = 7/7
Subset of E when Length = short is {e2, e5, e7, e8, e11, e13, e14, e15, e16, e17, e18}
the optimal prediction is P(reads) = 9/11, P(skips) = 2/11 Logloss:−e∈ElogP(e|h) =−7log7/7+9log9/11+2log2/11 ≈0.417
|E| 18 CMPUT 366 F20: Supervised Learning IV
18
Linear Regression
Linear regression is the problem of fitting a linear function to a set of training examples
both input and target features must be numeric Linear function of the input features:
Yˆw(e) = =
w0 +w1X1(e)+···+wnXn(e) n
wiXi(e) i=0
CMPUT 366 F20: Supervised Learning IV
19
Linear Regression: finding w
For some loss functions (e.g., squared error), linear regression has a
closed-form solution to find w which minimize the error
We can also use gradient descent
gradient descent is an iterative method to find the minimum of a function
for minimizing error: wi ← wi − η ∂error (E, w)
illustration
squared error: error(E) = e∈E(Y(e) − Yˆ(e))2
squared error for Yˆw: error(E, w) = e∈E(Y(e) − ni=0 wiXi(e))2
∂error (E, w) = −2(Y(e) − Yˆw(e))Xi(e) ∂wi e∈E
∂wi
CMPUT 366 F20: Supervised Learning IV
20
Types of Gradient Descent
Full: update weights after a full sweep through the training data E: wi ←wi −η∂error(E,w)
∂wi
for squared error: ∂error (E, w) = −2(Y(e) − Yˆw(e))Xi(e)
∂wi e∈E
guaranteed to converge to a global minimum of error for convex error functions
and small enough η
Incremental: update weights after each training sample e:
wi ←wi −η∂error(e,w) ∂wi
for squared error: ∂error (e, w) = −2(Y(e) − Yˆw(e))Xi(e) ∂wi
no convergence guarantees as w changes between examples can select examples at random: stochastic gradient descent
Batched: the error is computed over a batch of examples batch = {e} → incremental gradient descent
batch = E → full gradient descent
CMPUT 366 F20: Supervised Learning IV 21
Incremental Gradient Descent for Yˆw
CMPUT 366 F20: Supervised Learning IV 22
Types of Error Functions for Gradient Descent
Can use gradient descent with any error function that: is differentiable (almost) everywhere
absolute error is not differentiable at 0 but we can define it to be 0 there the derivative gives useful information (i.e., is not 0 almost everywhere)
0/1 error does not work
Gradient descent plays an important role in Deep Learning
CMPUT 366 F20: Supervised Learning IV 23
Linear Classification
For binary targets represented by {0, 1} and numeric input features, we can use linear function to estimate the probability of the class
but we need to constrain the output to lie within [0, 1]
instead of outputting results of the linear combination directly, send it through an activation function f : R → [0, 1]:
Discuss why
n Yˆw(e)=f wiXi(e)
i=0
1 whenx≥0 step0(x) = 0 when x < 0
is not a good activation function for gradient descent
CMPUT 366 F20: Supervised Learning IV 24
Logistic Regression
A common activation function is the sigmoid or logistic function: f(x) = Differentiable: f′(x) = f(x)(1 − f(x))
1 0.8 0.6 0.4 0.2 0
-10 -5 0 5 10
Learning weights of a sigmoid-capped linear function is called logistic regression
1−x 1+e
CMPUT 366 F20: Supervised Learning IV
25
Stochastic Gradient Descent for Logistic Regression (for Log Loss)
CMPUT 366 F20: Supervised Learning IV 26
Back to Email Classification Example
CMPUT 366 F20: Supervised Learning IV 27
Limits of Linear Classifiers
Can linear classifiers learn any binary concept?
A binary concept (i.e., two classes) is linearly separable if there exists a
hyperplane separating class 0 from class 1
A linear classifier can learn any linearly separable concepts (and only them) What about the three concepts – OR, AND, XOR – below:
Try to linearly separate + examples from − examples for XOR. What does it say about creating a linear classifier for XOR?
CMPUT 366 F20: Supervised Learning IV 28
Logical AND Expressed by a Linear Classifier
-0.7 + 0.5*x + 0.5*x step (-0.7 + 0.5*x + 0.5*x ) sigmoid(-0.7 + 0.5*x + 0.5*x ) >= 0.5 12 012 12
11 111
0.2
0.9 0.9
0.1 0.8 0.8 0.8
0
0.7 0.7
0.6 -0.1 0.6 0.5 -0.2 0.5 0.4 -0.3 0.4
0.9 0.8 0.7
0.8
0.6
0.4
0.2
0
0.3 0.2 0.1
-0.4 -0.5 -0.6
0.3 0.2 0.1
0.6 0.6 0.5 0.4 0.4 0.3 0.2 0.1
000 -0.7 0
0 0.10.20.30.40.50.60.70.80.9 1 0 0.10.20.30.40.50.60.70.80.9 1
0
0.10.20.30.40.50.60.70.80.9 1
0.2
xxx 111
CMPUT 366 F20: Supervised Learning IV 29
x 2
x 2
x 2
Logical XOR Cannot be Expressed by a Linear Classifier
CMPUT 366 F20: Supervised Learning IV 30
Logical XOR Expressed by a Multi-layer Neural Network
CMPUT 366 F20: Supervised Learning IV 31
Why Didn’t It Work?
h =x – x y =h +h 112 12
11 11
0.9
0.8
0.7
0.6
0.5 0 0.4
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0.5
0.5
0
-0.5
-1
0.3
0.2
0.1 00
-1
0 0.10.20.30.40.50.60.70.80.9 1
xx 11
h =x – x
2 2 1 y >= 0.5
-0.5
11 11
0.9
0.8
0.7
0.6
0.5 0 0.4
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0.5
0.5
0
-0.5
-1
0.3
0.2
0.1 00
-1
0 0.10.20.30.40.50.60.70.80.9 1
xx 11
-0.5
0
0.10.20.30.40.50.60.70.80.9 1
0
0.10.20.30.40.50.60.70.80.9 1
CMPUT 366 F20: Supervised Learning IV 32
xx 22
xx 22
A Combination of Linear Functions is Linear
Need non-linearity
Can use a rectifier linear unit (reLU): f(x) = max{0, x}
10 8 6 4 2 0
-10 -5 0 5 10
CMPUT 366 F20: Supervised Learning IV 33
Logical XOR Expressed by a Non-Linear Multi-layer Neural Network
h = x – x relu(h ) y = relu(h ) + relu(h ) 112112
111 111
0.9 0.9
0.8 0.8 0.5
0.7 0.7 0.6 0.6 0.5 00.5
0.8
0.9 0.8 0.7
0.8
0.6
0.4
0.2
0
0.4 0.3 0.2 0.1
0.6 0.6 0.5 0.4 0.4 0.4
-0.5
0.3 0.2 0.1
0.3 0.2 0.1
000 -1 0
0 0.10.20.30.40.50.60.70.80.9 1 0 0.10.20.30.40.50.60.70.80.9 1
0
0.10.20.30.40.50.60.70.80.9 1
0.2
xxx 111
h = x – x relu(h )
2 2 1 2 y >= 0.5
111 111
0.9 0.9
0.8 0.8 0.5
0.7 0.7 0.6 0.6 0.5 00.5
0.8
0.9 0.8 0.7
0.8
0.6
0.4
0.2
0
0.4 0.3 0.2 0.1
0.6 0.6 0.5 0.4 0.4 0.4
-0.5
0.3 0.2 0.1
0.3 0.2 0.1
000 -1 0
0 0.10.20.30.40.50.60.70.80.9 1 0 0.10.20.30.40.50.60.70.80.9 1
0
0.10.20.30.40.50.60.70.80.9 1
xxx 111
0.2
CMPUT 366 F20: Supervised Learning IV 34
xx 22
xx 22
xx 22