程序代写代做代考 algorithm Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L18 — Algorithm-Independent Stuff

Algorithm-Independent Issues

• Empirical Error Estimates

– Hold out

– Cross-validation

• Leave-one out

• K-fold cross-validation

– Boostrap estimates

2

Sampling Issues

• Suppose we have training and test datasets Dtrain and Dtest respectively.

• We pick some model for a discriminant function h(x ; q)

where x is the input feature, and q is the set of parameters that specify h, such

as

– class priors, parameters in class-conditional density models (means,

covariance matrices), total covariance, weights in a neural network, radii in

kernel density estimates …

From the training data Dtrain we form estimates of these parameters, and hence

an estimate of the discriminant function

• This discriminant function defines a classifier function – e.g. the function that

returns the class labels {0,1} when given the input feature x

);()ˆ;()(ˆ trainDxhxhxh  q




0)ˆ;(,0

0)ˆ;(,1
),(ˆ

q

q

xh

xh
Dxf train

3

Sampling Issues

• Next we would like to know the error rate for the classifier, the
probability that our classifier does not agree with the true class label
l(x) on the next (independent and previously not seen) sample

• However we can’t compute this, so we use another data set to
estimate it by counting errors

this is called the holdout estimate of error rate.

dxxlfPxpD
train

)|ˆ()()(  E

 

test

error

N

i

itraini

test

testiitesttrain

N

N

lDxf
N

NilxDD
test

 
1

)),,(ˆ(1
1

})…1),,{(,(ˆ E

4

Sampling Issues

• The estimated classifier

is a random variable dependent on the particular training set.

• Its estimated error rate

is a random variable, dependent on the particular test set.

• So how do we compare different classifiers on a given

problem?

),(ˆ trainDxf

),(ˆ testtrain DDE

5

Empirical Error Rate Estimates

We estimate the performance of a classifier by counting errors on a
finite test set.

Suppose the true error rate is E. Then the number of errors made on a
sample of N objects follows a binomial distribution

The average number of errors is E[Nerrors]=N E so

The variance is

and can be substantial for N relatively small, or E near ½.

N

N

N

datatestonerrorsof errors


#

errorserrors NNN

errors
errors N

N
NP







 )1(( EE)

N
NN

N
errors

N

)1(
)1()var()ˆvar(

22

11 EEEEE



EE ]ˆ[E

6

Error Estimates

Problems with holdout method:

• Usually have only one dataset. Partition it into training (Dtrain ) and

test (Dtest). This gives ONE measurement of the error rather than

the true error rate.

Since the empirical error estimate is unbiased it’s clear

• We’d like to use as much of the data as possible for training, as

this gives a more accurate (lower variance) estimator of the

classifier.

)()],(ˆ[ traintesttrainD DDDE test
EE 

7

Error Estimates

We’d like to use as much of the data as possible for training, as

this gives a more accurate (lower variance) estimator of the classifier.

One approach is leave-one-out.

Start with N data samples.

1. Choose one sample and remove it.

2. Design classifier based on remaining N-1 samples

3. Test on single removed sample.

but this increases variance of error rate estimate.

8

Cross Validation

Both the hold-out and the leave-one out provide a single

measurement. We really want an average over datasets,

but we have only one dataset!

Solution

Generate many splits into training and test sets.

Measure the empirical error rate on each of these splits,

and average.

9

Leave One Out Cross-Validation

• Start with N data samples.

1. Choose one sample and remove it.

2. Design classifier based on remaining N-1 samples

3. Test on single removed sample.

Repeat 1-3 for all N different choices of single-sample test

sets. Average the error rate of all splits.

• Leave-one-out is expensive for any technique that requires iterative training

(neural networks, mixture model fitting) since you must learn N different models.

• However, leave-one-out is cheap for memory-based techniques like k-NN, kernel

methods etc.

• All classifiers have very similar training sets – similar to total training set.

(Bias of error estimate is low) )],(ˆ
testtrain

DDE

10

K-Fold Cross Validation
• Divide data into k disjoint sets of equal size N/k.

• Train the classifier k times, each time with a different set held out

to estimate the performance.

• Estimate error rate as mean of the rate measured for each of the

k splits. (Reduction in amount of training data biases the error

rate upward. Variance is lower than in leave-one-out.)

• Cross-validation (leave-one-out, and k-fold) are useful for picking

hyper-parameters and architectures

– Number of components in a Gaussian mixture model.

– Radius of kernel in kernel density estimates.

– Number of neighbors in k-NN.

– Number of layers and hidden neurons in a neural network.

11

Resampling

• Cross-validation attempts to approximate averages over training and test

sets. It is a means of ameliorating the variance of estimates due to limited

data set size.

It is one example of a resampling technique.

• Bootstrap – another resampling technique, allows even more data sets to

be averaged.

Bootstrap data set

– Start with our data set of N samples D.

– Randomly select N samples, with replacement select a sample at

random, copy it into the new dataset, and return the sample to the

original bucket of data. (On average, .632xN distinct samples.)

– Generates independent datasets drawn from the empirical density



N

i
iN

deltaDiracxxxp
1

1 )()()(ˆ 

12

Bootstrap Error Estimate

Generate B bootstrap datasets Db, b=1, … , B. Train a classifier to

each of the bootstrap datasets – denote these classifiers

Evaluate each of the bootstrap classifier on the original complete data

set – less the samples present in that particular bootstrap training set

Since have, on average, only 0.632xN distinct samples, error rate has

bias similar to 2-fold cross-validation. The “.632 estimator” is

designed to alleviate this bias

)(ˆ xf
b

))(,(ˆˆ

1

11 bb

b

N

i
NBboot

DDDD   

EE

)(368.0ˆ632.0ˆ ,632. traintrainboot DDE EE

13

Bootstrap Aggregates

• Committee machines, or aggregates, use several component

classifiers and vote them for a final decision. If the errors

between the individual component classifiers are uncorrelated

(and this can take some work), then they may be expected to

cancel out during the voting.

• Bootstrap Aggregation – or bagging – constructs the

component classifiers by training on bootstrap replicates.

14