CMPUT 366 F20: Supervised Learning II
James Wright & Vadim Bulitko
November 3, 2020
CMPUT 366 F20: Supervised Learning II
1
Lecture Outline
Supervised Learning
PM 7.1-7.2
CMPUT 366 F20: Supervised Learning II 2
How to Specify Desired Behaviour of an Agent
Desired action for each state → supervised learning Desired goal state(s) → search
Good/bad feedback per action → reinforcement learning Survival/death → evolution
CMPUT 366 F20: Supervised Learning II 3
Supervised Learning
A supervised learning problem consists of A set of input features X1,…,Xn
A set of target features Y1,…,Yk
A set of training examples whose Xi and Yj used by a supervised learning algorithm to construct a mapping f from input features X to target features Y
A set of test examples whose Xi are run through f to predict the corresponding Yj
Notation in PM: examples ei ∈ E
Xi(e) the input feature i of example e
Yj(e) the output feature j of example e
Yˆm(e) the output feature m of e predicted by f
Types:
Classification: Yi are discrete Regression: Yi are real-valued
CMPUT 366 F20: Supervised Learning II 4
Regression Example
The goal is to predict the value of target Y based on features X Both X and Y are real-valued
Exact values of both targets and features may not be in the training set
Example on the right:
e8 is an interpolation problem: X is within the range of the
training example values
e9 is an extrapolation problem: X is outside the range of the training example values
CMPUT 366 F20: Supervised Learning II 5
Regression Example
8
6
4
2
0 01234567
X
CMPUT 366 F20: Supervised Learning II 6
Y
Regression Example
8
6
4
2
0 01234567
X
CMPUT 366 F20: Supervised Learning II 7
Y
Data Representation
For real-valued features, we typically record the feature values For discrete features, there are multiple options:
Binary features: can code false, true as 0, 1 or −1, 1
Non-binary features: can record numeric values for each possible value
cardinal values: differences are meaningful
ordinal values: order is meaningful
categorical values: neither differences nor order meaningful
Vector of indicator variables: one per feature value, exactly one is true (known as one-hot encoding)
CMPUT 366 F20: Supervised Learning II 8
Classification Example
An agent wants to learn a person’s preference for the length of holidays Holiday can be for 1,2,3,4,5 or 6 days
Two possible representations:
CMPUT 366 F20: Supervised Learning II 9
Generalization
What does it mean for supervised learning to perform well?
We want f to make correct predictions on all data find underlying patterns in the training data
encode these patterns in f
hope the patterns remain true for all data
We estimate generalization performance by evaluating f on a test set
The learning algorithm has access to features (X) of the test data but not the targets (Y)
Measuring performance only on training data is (usually) insufficient the learner can just memorize X(e) and Y(e)
would be helpless on e outside of the training data
CMPUT 366 F20: Supervised Learning II 10
Generalization Example
Consider the following data set: {(i, e)} where i is a positive integer and e is true when i is even and false when i is odd
the training data is a collection of 10 such pairs where i is drawn randomly between 1 and 1000
a training sample (i, true) is called positive
a training sample (i, false) is called negative
Consider two binary classifiers P and N:
P classifies all positive samples in the training data as true and all others as false N classifies all negative samples in the training data as false and all others as true
Both P and N accurately classify all training data
but are wrong on other data as they do not capture the notion of even/odd
CMPUT 366 F20: Supervised Learning II 11
Bias
The hypothesis is the function f that we learn
The hypothesis space is the set of possible hypotheses
The preference for one hypothesis over another beyond its performance on training data is called bias
without bias we would have no preference for any hypothesis beyond its training-data performance
training-data performance by itself is typically insufficient for good generalization (e.g., P and N classifiers)
example bias: preference for simple hypotheses (e.g., Ockham’s razor) which bias works best for generalization is an empirical question
CMPUT 366 F20: Supervised Learning II 12
Learning as Search
Given training data, a hypothesis space, an error measurement and a bias, learning can be reduced to search
The learning process searches the hypothesis space trying to find the hypothesis that best fits the training data given the bias
the hypothesis space is too large to search exhaustively local search (PM 4.7) methods are used instead
start with a random hypothesis
attempt to improve it by tweaking it; sometimes make random tweaks pick a new random hypothesis once in a while
CMPUT 366 F20: Supervised Learning II 13
Minimizing Cost
The learning algorithm chooses its hypothesis f by 1. itserror(orloss)onthetrainingdata
2. somepreferenceoverthespaceofhypotheses(i.e.,thebias)
Together they represent a cost function defined over the hypothesis space
the learning algorithm is attempting to minimize the cost Consider them in turn
CMPUT 366 F20: Supervised Learning II 14
0/1 Error
The 0/1 error for a dataset E of examples and hypothesis f is the number of
examples for which the prediction was incorrect:
1 when f(e) ̸= Y(e) 0 otherwise
e∈E
Works well for binary or categorical data
Does not work well for real-valued data:
usually cannot predict the exact value
0/1 error does not take magnitude of the misprediction into account
CMPUT 366 F20: Supervised Learning II 15
Absolute Error
The absolute error for a dataset E of examples and hypothesis f is the sum of absolute distances between the prediction and the actual target:
|f(e)−Y(e)| e∈E
Works well for cardinal and possibly ordinal data takes into account the magnitude of misprediction
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning II 16
Squared Error
The squared error for a dataset E of examples and hypothesis f is the sum of squared distances between the prediction and the actual target:
(f(e)−Y(e))2 e∈E
Works well for cardinal data
takes into account the magnitude of misprediction
large errors contribute non-linearly more than small errors
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning II 17
Worst-case Error
The worst-case error for a dataset E of examples and hypothesis f is absolute distance between the prediction and the actual target in the worst case:
max |f(e) − Y(e)| e∈E
Works well for cardinal data
takes into account the magnitude of misprediction but only on a single sample (i.e., the worst case)
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning II 18
A Comparison
CMPUT 366 F20: Supervised Learning II 19
Norms
Given a vector x ∈ Rn, its p-norm is
∥x∥p = |xi|p
n 1/p i=1
Absolute error is the 1-norm of the difference between true Y and predicted Yˆ: ˆˆ
Y(e) − Y(e) = ∥Y(e) − Y(e)∥1 e∈E
e∈E
Squared error is the square of the 2-norm of the difference between true Y and
predicted Yˆ:
e∈E CMPUT 366 F20: Supervised Learning II
e∈E
ˆ 2 ˆ 2
Y(e) − Y(e)
=
∥Y(e) − Y(e)∥2
20
Probabilistic Predictions
Rather than predicting exactly what a target value will be, many common algorithms predict a probability distribution over possible values
especially for classification tasks
One-hot encoding is the common data representation for this scheme: target features of training examples have a single 1 for the true value target values predicted by f are probabilities that sum to 1
CMPUT 366 F20: Supervised Learning II 21
Likelihood
The likelihood for a dataset E of examples and hypothesis f is the probability of independently observing the examples according to the probabilities assigned by the hypothesis:
P(E|f) = P(e|f) e∈E
This has a clear interpretation in a Bayesian sense
Numerical stability issues: product of probabilities shrinks exponentially floating point underflows almost immediately
CMPUT 366 F20: Supervised Learning II 22
MAP & ML
We may want the hypothesis f which is most likely given the data observed: fMAP = arg max P(f|E)
f
maximum a posteriori (MAP) hypothesis Using the Bayes theorem:
fMAP = arg max P(f|E) = arg max P(E|f)P(f) = arg max P(E|f)P(f) f fP(E)f
A hypothesis that maximimizes P(E|f) is maximum likelihood (ML) hypothesis P(f) expresses our bias over hypotheses (independent of E)
Without the bias term P(f) an ML hypothesis is the same as a MAP hypothesis
CMPUT 366 F20: Supervised Learning II 23
An Example: Medical Diagnosis: Cold or No Cold
Test is positive when cold is present 98% of the time P(positive|cold) = 0.98, P(negative|cold) = 0.02
Test is negative when cold is absent 97% of the time P(positive|¬cold) = 0.03, P(negative|¬cold) = 0.97
Prior probability of cold is 0.8% P(cold) = 0.008, P(¬cold) = 0.992
Suppose the test comes back positive: E = {e} = {positive} What is fML?
fML = arg maxf∈{cold,¬cold} P(E|f) = arg maxf∈{cold,¬cold} P(positive|f) P(positive|cold) = 0.98, P(positive|¬cold) = 0.03
What is fMAP?
fMAP = arg maxf∈{cold,¬cold} P(f|E) = arg maxf∈{cold,¬cold} P(E|f)P(f) P(positive|cold)P(cold) = 0.98 · 0.008 = 0.0078, P(positive|¬cold)P(¬cold) = 0.03 · 0.992 = 0.298
CMPUT 366 F20: Supervised Learning II 24
Log-Likelihood
The log-likelihood for a dataset E of examples and hypothesis f is the log-probability of independently observing the examples according to the probabilities assigned by the hypothesis:
log P(E|f) = log P(e|f) = log P(e|f) e∈E e∈E
The underflow issue is remedied:
sum of logs shrinks more slowly than product of probabilities
Log is monotonic so maximizing log-likelihood imposes the same preference over hypotheses as maximizing likelihood:
P(E|f1)>P(E|f2) ⇐⇒ logP(E|f1)>logP(E|f2)
Log loss is negative of log likelihood divided by the number of examples
CMPUT 366 F20: Supervised Learning II 25
Baseline Predictors
Constant
no input features: f(X) is the same for any X
Special case: majority class in classification tasks
f(X) gives the most common Y found in the training set, regardless of X
good performance on unbalanced classification data poor performance on balanced classification data
CMPUT 366 F20: Supervised Learning II 26
Best Predictor with no Input Features
Suppose V is the multiset of Y(e) for e ∈ E
Then the best predictor with no input features: 0/1 error: the most frequent value in V
absolute error: median of V
squared error: mean of V
worst case error: max V+min V 2
CMPUT 366 F20: Supervised Learning II
27
Best Predictor with no Input Features: Binary Classifier
Suppose all Y(e) are 0s and 1s n0 isthenumberofe∈Esuch
that Y(e) = 0
n1 isthenumberofe∈Esuch that Y(e) = 1
Then the best predictor with no input features:
CMPUT 366 F20: Supervised Learning II 28
Summary
Supervised learning is constructing a hypothesis function from training examples
maps from input features to target features classification: discrete target features regression: real-valued target features
Preferences among hypotheses beyond their training-data performance are called bias
an important component of learning
A cost function combines prediction error and bias
supervised learning aims to find a hypothesis which minimizes cost different measures for prediction error for different cases
Trivial predictors can be used as baselines
CMPUT 366 F20: Supervised Learning II 29