Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L18 — Algorithm-Independent Stuff
Algorithm-Independent Issues
• Empirical Error Estimates
– Hold out
– Cross-validation
• Leave-one out
• K-fold cross-validation
– Boostrap estimates
2
Sampling Issues
• Suppose we have training and test datasets Dtrain and Dtest respectively.
• We pick some model for a discriminant function h(x ; q)
where x is the input feature, and q is the set of parameters that specify h, such
as
– class priors, parameters in class-conditional density models (means,
covariance matrices), total covariance, weights in a neural network, radii in
kernel density estimates …
From the training data Dtrain we form estimates of these parameters, and hence
an estimate of the discriminant function
• This discriminant function defines a classifier function – e.g. the function that
returns the class labels {0,1} when given the input feature x
);()ˆ;()(ˆ trainDxhxhxh q
0)ˆ;(,0
0)ˆ;(,1
),(ˆ
q
q
xh
xh
Dxf train
3
Sampling Issues
• Next we would like to know the error rate for the classifier, the
probability that our classifier does not agree with the true class label
l(x) on the next (independent and previously not seen) sample
• However we can’t compute this, so we use another data set to
estimate it by counting errors
this is called the holdout estimate of error rate.
dxxlfPxpD
train
)|ˆ()()( E
test
error
N
i
itraini
test
testiitesttrain
N
N
lDxf
N
NilxDD
test
1
)),,(ˆ(1
1
})…1),,{(,(ˆ E
4
Sampling Issues
• The estimated classifier
is a random variable dependent on the particular training set.
• Its estimated error rate
is a random variable, dependent on the particular test set.
• So how do we compare different classifiers on a given
problem?
),(ˆ trainDxf
),(ˆ testtrain DDE
5
Empirical Error Rate Estimates
We estimate the performance of a classifier by counting errors on a
finite test set.
Suppose the true error rate is E. Then the number of errors made on a
sample of N objects follows a binomial distribution
The average number of errors is E[Nerrors]=N E so
The variance is
and can be substantial for N relatively small, or E near ½.
N
N
N
datatestonerrorsof errors
#
Ê
errorserrors NNN
errors
errors N
N
NP
)1(( EE)
N
NN
N
errors
N
)1(
)1()var()ˆvar(
22
11 EEEEE
EE ]ˆ[E
6
Error Estimates
Problems with holdout method:
• Usually have only one dataset. Partition it into training (Dtrain ) and
test (Dtest). This gives ONE measurement of the error rather than
the true error rate.
Since the empirical error estimate is unbiased it’s clear
• We’d like to use as much of the data as possible for training, as
this gives a more accurate (lower variance) estimator of the
classifier.
)()],(ˆ[ traintesttrainD DDDE test
EE
7
Error Estimates
We’d like to use as much of the data as possible for training, as
this gives a more accurate (lower variance) estimator of the classifier.
One approach is leave-one-out.
Start with N data samples.
1. Choose one sample and remove it.
2. Design classifier based on remaining N-1 samples
3. Test on single removed sample.
but this increases variance of error rate estimate.
8
Cross Validation
Both the hold-out and the leave-one out provide a single
measurement. We really want an average over datasets,
but we have only one dataset!
Solution
Generate many splits into training and test sets.
Measure the empirical error rate on each of these splits,
and average.
9
Leave One Out Cross-Validation
• Start with N data samples.
1. Choose one sample and remove it.
2. Design classifier based on remaining N-1 samples
3. Test on single removed sample.
Repeat 1-3 for all N different choices of single-sample test
sets. Average the error rate of all splits.
• Leave-one-out is expensive for any technique that requires iterative training
(neural networks, mixture model fitting) since you must learn N different models.
• However, leave-one-out is cheap for memory-based techniques like k-NN, kernel
methods etc.
• All classifiers have very similar training sets – similar to total training set.
(Bias of error estimate is low) )],(ˆ
testtrain
DDE
10
K-Fold Cross Validation
• Divide data into k disjoint sets of equal size N/k.
• Train the classifier k times, each time with a different set held out
to estimate the performance.
• Estimate error rate as mean of the rate measured for each of the
k splits. (Reduction in amount of training data biases the error
rate upward. Variance is lower than in leave-one-out.)
• Cross-validation (leave-one-out, and k-fold) are useful for picking
hyper-parameters and architectures
– Number of components in a Gaussian mixture model.
– Radius of kernel in kernel density estimates.
– Number of neighbors in k-NN.
– Number of layers and hidden neurons in a neural network.
11
Resampling
• Cross-validation attempts to approximate averages over training and test
sets. It is a means of ameliorating the variance of estimates due to limited
data set size.
It is one example of a resampling technique.
• Bootstrap – another resampling technique, allows even more data sets to
be averaged.
Bootstrap data set
– Start with our data set of N samples D.
– Randomly select N samples, with replacement select a sample at
random, copy it into the new dataset, and return the sample to the
original bucket of data. (On average, .632xN distinct samples.)
– Generates independent datasets drawn from the empirical density
N
i
iN
deltaDiracxxxp
1
1 )()()(ˆ
12
Bootstrap Error Estimate
Generate B bootstrap datasets Db, b=1, … , B. Train a classifier to
each of the bootstrap datasets – denote these classifiers
Evaluate each of the bootstrap classifier on the original complete data
set – less the samples present in that particular bootstrap training set
Since have, on average, only 0.632xN distinct samples, error rate has
bias similar to 2-fold cross-validation. The “.632 estimator” is
designed to alleviate this bias
)(ˆ xf
b
))(,(ˆˆ
‘
1
‘
11 bb
b
N
i
NBboot
DDDD
EE
)(368.0ˆ632.0ˆ ,632. traintrainboot DDE EE
13
Bootstrap Aggregates
• Committee machines, or aggregates, use several component
classifiers and vote them for a final decision. If the errors
between the individual component classifiers are uncorrelated
(and this can take some work), then they may be expected to
cancel out during the voting.
• Bootstrap Aggregation – or bagging – constructs the
component classifiers by training on bootstrap replicates.
14