程序代写代做代考 decision tree deep learning Bayesian algorithm go CMPUT 366 F20: Supervised Learning III

CMPUT 366 F20: Supervised Learning III
James Wright & Vadim Bulitko
November 5, 2020
CMPUT 366 F20: Supervised Learning III
1

Lecture Outline
Recap from Tuesday
PM 7.1-7.2
Decision trees Linear regression
PM 7.3
CMPUT 366 F20: Supervised Learning III 2

Minimizing Cost
The learning algorithm chooses its hypothesis f by 1. itserror(orloss)onthetrainingdata
2. somepreferenceoverthespaceofhypotheses(i.e.,thebias)
Together they represent a cost function defined over the hypothesis space
the learning algorithm is attempting to minimize the cost Consider them in turn
CMPUT 366 F20: Supervised Learning III 3

0/1 Error
The 0/1 error for a dataset E of examples and hypothesis f is the number of
examples for which the prediction was incorrect:
􏰅1 when f(e) ̸= Y(e) 0 otherwise
e∈E
Works well for binary or categorical data
Does not work well for real-valued data:
usually cannot predict the exact value
0/1 error does not take magnitude of the misprediction into account
􏰆
CMPUT 366 F20: Supervised Learning III 4

Absolute Error
The absolute error for a dataset E of examples and hypothesis f is the sum of absolute distances between the prediction and the actual target:
􏰆|f(e)−Y(e)| e∈E
Works well for cardinal and possibly ordinal data takes into account the magnitude of misprediction
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning III 5

Squared Error
The squared error for a dataset E of examples and hypothesis f is the sum of squared distances between the prediction and the actual target:
􏰆(f(e)−Y(e))2 e∈E
Works well for cardinal data
takes into account the magnitude of misprediction
large errors contribute non-linearly more than small errors
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning III 6

Worst-case Error
The worst-case error for a dataset E of examples and hypothesis f is absolute distance between the prediction and the actual target in the worst case:
max |f(e) − Y(e)| e∈E
Works well for cardinal data
takes into account the magnitude of misprediction but only on a single sample (i.e., the worst case)
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning III 7

A Comparison
CMPUT 366 F20: Supervised Learning III 8

Norms
Given a vector x ∈ Rn, its p-norm is
∥x∥p = 􏰆|xi|p
􏰃n 􏰄1/p i=1
Absolute error is the 1-norm of the difference between true Y and predicted Yˆ: 􏰆􏰀ˆ􏰀􏰆ˆ
􏰀􏰀Y(e) − Y(e)􏰀􏰀 = ∥Y(e) − Y(e)∥1 e∈E
e∈E
Squared error is the square of the 2-norm of the difference between true Y and
predicted Yˆ:
e∈E CMPUT 366 F20: Supervised Learning III
e∈E
􏰆􏰁 ˆ 􏰂2 􏰆􏰁 ˆ 􏰂2
Y(e) − Y(e)
=
∥Y(e) − Y(e)∥2
9

Probabilistic Predictions
Rather than predicting exactly what a target value will be, many common algorithms predict a probability distribution over possible values
especially for classification tasks
One-hot encoding is the common data representation for this scheme: target features of training examples have a single 1 for the true value target values predicted by f are probabilities that sum to 1
CMPUT 366 F20: Supervised Learning III 10

Likelihood
The likelihood for a dataset E of examples and hypothesis f is the probability of independently observing the examples according to the probabilities assigned by the hypothesis:
P(E|f) = 􏰇 P(e|f) e∈E
This has a clear interpretation in a Bayesian sense
Numerical stability issues: product of probabilities shrinks exponentially floating point underflows almost immediately
CMPUT 366 F20: Supervised Learning III 11

MAP & ML
We may want the hypothesis f which is most likely given the data observed: fMAP = arg max P(f|E)
f
maximum a posteriori (MAP) hypothesis Using the Bayes theorem:
fMAP = arg max P(f|E) = arg max P(E|f)P(f) = arg max P(E|f)P(f) f fP(E)f
A hypothesis that maximimizes P(E|f) is maximum likelihood (ML) hypothesis P(f) expresses our bias over hypotheses (independent of E)
Without the bias term P(f) an ML hypothesis is the same as a MAP hypothesis
CMPUT 366 F20: Supervised Learning III 12

An Example: Medical Diagnosis: Cold or No Cold
Test is positive when cold is present 98% of the time P(positive|cold) = 0.98, P(negative|cold) = 0.02
Test is negative when cold is absent 97% of the time P(positive|¬cold) = 0.03, P(negative|¬cold) = 0.97
Prior probability of cold is 0.8% P(cold) = 0.008, P(¬cold) = 0.992
Suppose the test comes back positive: E = {e} = {positive} What is fML?
fML = arg maxf∈{cold,¬cold} P(E|f) = arg maxf∈{cold,¬cold} P(positive|f) P(positive|cold) = 0.98, P(positive|¬cold) = 0.03
What is fMAP?
fMAP = arg maxf∈{cold,¬cold} P(f|E) = arg maxf∈{cold,¬cold} P(E|f)P(f) P(positive|cold)P(cold) = 0.98 · 0.008 = 0.0078, P(positive|¬cold)P(¬cold) = 0.03 · 0.992 = 0.298
CMPUT 366 F20: Supervised Learning III 13

Log-Likelihood
The log-likelihood for a dataset E of examples and hypothesis f is the log-probability of independently observing the examples according to the probabilities assigned by the hypothesis:
log P(E|f) = log 􏰇 P(e|f) = 􏰆 log P(e|f) e∈E e∈E
The underflow issue is remedied:
sum of logs shrinks more slowly than product of probabilities
Log is monotonic so maximizing log-likelihood imposes the same preference over hypotheses as maximizing likelihood:
P(E|f1)>P(E|f2) ⇐⇒ logP(E|f1)>logP(E|f2)
Log loss is negative of log likelihood divided by the number of examples
CMPUT 366 F20: Supervised Learning III 14

Baseline Predictors
Constant
no input features: f(X) is the same for any X
Special case: majority class in classification tasks
f(X) gives the most common Y found in the training set, regardless of X
good performance on unbalanced classification data poor performance on balanced classification data
CMPUT 366 F20: Supervised Learning III 15

Best Predictor with no Input Features
Suppose V is the multiset of Y(e) for e ∈ E
Then the best predictor with no input features: 0/1 error: the most frequent value in V
absolute error: median of V
squared error: mean of V
worst case error: max V+min V 2
CMPUT 366 F20: Supervised Learning III
16

Best Predictor with no Input Features: Binary Classifier
Suppose all Y(e) are 0s and 1s n0 isthenumberofe∈Esuch
that Y(e) = 0
n1 isthenumberofe∈Esuch that Y(e) = 1
Then the best predictor with no input features:
CMPUT 366 F20: Supervised Learning III 17

Summary
Supervised learning is constructing a hypothesis function from training examples
maps from input features to target features classification: discrete target features regression: real-valued target features
Preferences among hypotheses beyond their training-data performance are called bias
an important component of learning
A cost function combines prediction error and bias
supervised learning aims to find a hypothesis which minimizes cost different measures for prediction error for different cases
Trivial predictors can be used as baselines
CMPUT 366 F20: Supervised Learning III 18

Decision Trees
A (binary) decision tree is a tree in which:
every internal node is labeled with a condition (Boolean function of an example)
every internal node has two children, one labeled true and one labeled false every leaf node is labeled with a point estimate on the target
a value of the target
a probability distribution over possible target values
CMPUT 366 F20: Supervised Learning III 19

Decision Trees: An Example
CMPUT 366 F20: Supervised Learning III 20

Learning Decision Trees
How should an agent choose a decision tree?
Bias: which decision trees are preferable to others? Search: How can we search the space of decision trees?
search space is prohibitively large for a complete search local search: choose features to branch on one by one
CMPUT 366 F20: Supervised Learning III 21

Tree Learner
CMPUT 366 F20: Supervised Learning III 22

Tree Learner
CMPUT 366 F20: Supervised Learning III 23

Tree Learner
CMPUT 366 F20: Supervised Learning III 24

Stopping Criterion
Basic criteria:
No more conditions Cs
No more examples Es
All examples have the same label
Additional criteria:
Minimum child size: do not split a node if there would be too few examples in one
of the children
Minimum number of examples: do not split a node with too few examples Sufficient improvement: do not split a node unless it improves some metric sufficiently
Maximum depth: do not split if the depth reaches a maximum
CMPUT 366 F20: Supervised Learning III 25

Tree Leaves
Point estimates to go into tree leaves: Most frequent target value
Median target value Mean target value
What point estimate optimally classifies the leaf’s examples? depends on the error function (see Proposition 7.1 in PM):
0/1 error → most frequent target value absolute error → median target value squared error → mean target value
CMPUT 366 F20: Supervised Learning III 26

Split Conditions
Boolean features can be used directly
non-Boolean features:
partition domain into subsets
one branch for each domain element
CMPUT 366 F20: Supervised Learning III 27

Choosing Split Conditions
Greedy/myopic selection:
choose the condition that would result in the lowest training error if it were the only split
Example: choose a split condition with the highest information gain
CMPUT 366 F20: Supervised Learning III 28

Decision Trees: An Example
CMPUT 366 F20: Supervised Learning III 29

Choosing Split Conditions: PM example 7.9
Log likelihood: log P(E|h) = log 􏰉 P(e|h) = 􏰈 log P(e|h) e∈E e∈E
want to maximize to get hML Log loss: − 􏰈e∈E log P(e|h)
|E|
want to minimize to get hML
Our leaf predictors/hypotheses will be distributions:
P(reads), P(skips) = 1 − P(reads) Will use the base-2 logarithm: log2
CMPUT 366 F20: Supervised Learning III 30

Choosing Split Conditions: PM example 7.9
P(reads) = P(skips) = 9/18 = 0.5
Logloss:−􏰈e∈ElogP(e|h) =−log0.5+···+log0.5 =−18log0.5 =−log0.5=1
|E| 18 18
CMPUT 366 F20: Supervised Learning III
31

Choosing Split Conditions: PM example 7.9
Split on Author:
Subset of E when Author = known is
{e1, e4, e5, e6, e9, e10, e12, e13, e14, e15, e16, e17}
for half of them the target is reads so the optimal prediction is P(reads) = P(skips) = 6/12 = 0.5
Subset of E when Author = unknown is {e2, e3, e7, e8, e11, e18} for half of them the target is reads so the optimal prediction is P(reads) = P(skips) = 3/6 = 0.5
Logloss:−􏰈e∈ElogP(e|h) =−12log6/12+6log3/6 =−18log0.5 =−log0.5=1 |E| 18 18
CMPUT 366 F20: Supervised Learning III
32

Choosing Split Conditions: PM example 7.9
Split on Thread:
Subset of E when Thread = new is {e1, e2, e5, e8, e10, e12, e14, e15, e17, e18}
the optimal prediction is P(reads) = 7/10, P(skips) = 3/10
Subset of E when Thread = followup is {e3, e4, e6, e7, e9, e11, e13, e16}
the optimal prediction is P(reads) = 2/8, P(skips) = 6/8 Logloss:−􏰈e∈ElogP(e|h) =−3log3/10+7log7/10+2log2/8+6log6/8 ≈0.85
|E| 18 CMPUT 366 F20: Supervised Learning III
33

Choosing Split Conditions: PM example 7.9
Split on Length:
Subset of E when Length = long is {e1, e3, e4, e6, e9, e10, e12}
the optimal prediction is P(reads) = 0/7, P(skips) = 7/7
Subset of E when Length = short is {e2, e5, e7, e8, e11, e13, e14, e15, e16, e17, e18}
the optimal prediction is P(reads) = 9/11, P(skips) = 2/11 Logloss:−􏰈e∈ElogP(e|h) =−7log7/7+9log9/11+2log2/11 ≈0.417
|E| 18 CMPUT 366 F20: Supervised Learning III
34

Linear Regression
Linear regression is the problem of fitting a linear function to a set of training examples
both input and target features must be numeric Linear function of the input features:
Yˆw(e) = =
w0 +w1X1(e)+···+wnXn(e) n
􏰆 wiXi(e) i=0
CMPUT 366 F20: Supervised Learning III
35

Linear Regression: finding w
For some loss functions (e.g., squared error), linear regression has a
closed-form solution to find w which minimize the error
We can also use gradient descent
gradient descent is an iterative method to find the minimum of a function
for minimizing error: wi ← wi − η ∂error (E, w)
illustration
squared error: error(E) = 􏰈e∈E(Y(e) − Yˆ(e))2
squared error for Yˆw: error(E, w) = 􏰈e∈E(Y(e) − 􏰈ni=0 wiXi(e))2
∂error (E, w) = 􏰈 −2(Y(e) − Yˆw(e))Xi(e) ∂wi e∈E
∂wi
CMPUT 366 F20: Supervised Learning III
36

Types of Gradient Descent
Full: update weights after a full sweep through the training data E: wi ←wi −η∂error(E,w)
∂wi
for squared error: ∂error (E, w) = 􏰈 −2(Y(e) − Yˆw(e))Xi(e)
∂wi e∈E
guaranteed to converge to a global minimum of error for convex error functions
and small enough η
Incremental: update weights after each training sample e:
wi ←wi −η∂error(e,w) ∂wi
for squared error: ∂error (e, w) = −2(Y(e) − Yˆw(e))Xi(e) ∂wi
no convergence guarantees as w changes between examples can select examples at random: stochastic gradient descent
Batched: the error is computed over a batch of examples batch = {e} → incremental gradient descent
batch = E → full gradient descent
CMPUT 366 F20: Supervised Learning III 37

Incremental Gradient Descent for Yˆw
CMPUT 366 F20: Supervised Learning III 38

Types of Error Functions for Gradient Descent
Can use gradient descent with any error function that: is differentiable (almost) everywhere
absolute error is not differentiable at 0 but we can define it to be 0 there the derivative gives useful information (i.e., is not 0 almost everywhere)
0/1 error does not work
Gradient descent plays an important role in Deep Learning
CMPUT 366 F20: Supervised Learning III 39

Linear Classification
For binary targets represented by {0, 1} and numeric input features, we can use linear function to estimate the probability of the class
but we need to constrain the output to lie within [0, 1]
instead of outputting results of the linear combination directly, send it through an activation function f : R → [0, 1]:
Discuss why
􏰃n􏰄 Yˆw(e)=f 􏰆wiXi(e)
i=0
􏰅1 whenx≥0 step0(x) = 0 when x < 0 is not a good activation function for gradient descent CMPUT 366 F20: Supervised Learning III 40 Logistic Regression A common activation function is the sigmoid or logistic function: f(x) = Differentiable: f′(x) = f(x)(1 − f(x)) 1 0.8 0.6 0.4 0.2 0 -10 -5 0 5 10 Learning weights of a sigmoid-capped linear function is called logistic regression 1−x 1+e CMPUT 366 F20: Supervised Learning III 41 Stochastic Gradient Descent for Logistic Regression (for Log Loss) CMPUT 366 F20: Supervised Learning III 42 Back to Email Classification Example CMPUT 366 F20: Supervised Learning III 43 Limits of Linear Classifiers Can linear classifiers learn any binary concept? A binary concept (i.e., two classes) is linearly separable if there exists a hyperplane separating class 0 from class 1 A linear classifier can learn any linearly separable concepts (and only them) What about the three concepts – OR, AND, XOR – below: Try to linearly separate + examples from − examples for XOR. What does it say about creating a linear classifier for XOR? CMPUT 366 F20: Supervised Learning III 44