Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L6 – Bounds on Error Rates
Bounding the Bayes Error Rate
Recall the likelihood ratio tests, written in terms of the negative log of the likelihood
Computation of the Bayes error rate for all but the simplest cases (e.g. Gaussian
features x with equal class-conditional covariance matrices) is difficult.
Calculating the integrals
is difficult or impossible
dhhpxdxp
dhhpxdxp
tL
n
t
L
n
)|()|(
)|()|(
1
2
11
2
1
22
Ε
Ε
t
xp
xp
xh
log
)|(
)|(
log )(
1
2
2
1
2
Bounding the Bayes Error Rate
Instead, we seek bounds on the error that are
Hopefully tight
Easy to calculate
In practice, we will use empirical error estimates, but
the theoretical bounds here provide useful tools to
reason with.
The bounds point out factors entering the error rate.
That is, the theoretical framework helps you
conceptualize the problem.
3
A Calculable Case
Let both class-conditional densities be normal, with the
same covariance matrices S1 = S2 = S. The neg. log-
likelihood ratio is then
since x is Gaussian, and h is linear in x, h is Gaussian. Its
mean and variance are
2
1
21
1
12
11
12
)( MMMMxMMxh
TTT
SSS
2
|
|
12
1
12
2
12
1
122
1
2
12
1
122
1
1
S
S
S
MMMM
MMMMhE
MMMMhE
T
i
T
T
4
Calculable Case
The error rates become
t
p(h|1)
p(h|2)
L2L1
2
2 2
1 1
/2
( | )
( | ) 1
1
( )
2
t
t
z
t
p h dh erf
t
p h dh erf
erf z e d
Ε
Ε
5
Back to Bounds
Chernoff Bound
Classification error rate is
Use inequality
to obtain
1 1 2 2
2 1
1 1 2 2
( | ) ( | )
min ( | ), ( | )
n n
L L
n
P p x d x P p x d x
P p x P p x d x
E
10,0,;],min[
1
sbababa
ss
)(1
21
2
1
1
1
21
min
)|()|(min
sss
s
nssss
sU
ePP
xdxpxpPP
EE
6
Simplification – Bhattacharyya
Bound
Don’t insist on minimizing with respect to s, but take s=1/2
for x normal, the exponent is
)2/1(
21
2121
)|()|(
ePP
xdxpxpPP
n
U
E
SS
SS
SS
21
21
2
1
12
1
21
12
2
ln
28
1
)2/1( MMMM
T
7
Bhattacharyya Bound – Special Cases
• Equal Means
Equal Covariances
SS
SS
21
21
2
1 2ln)2/1(
12
1
12
8
1
)2/1( MMMM
T
S
8
Single Hypothesis
Tests
Single Hypothesis
May have data predominantly from one class — e.g. failure
analysis, usually have many examples of “healthy” function
and only a few failures; may have many different failure
modes.
Given an observation x, want to know if the corresponds to a
healthy, or a failed example. Also called “novelty detection.”
May use some kind of distance measure – e.g.
d
2
(x m)
T
S
1
(x m)
where S and m are the covariance and mean of the
class 1 that one is able to model well.
10
Single Hypothesis
To use this in classification
Given x, compute d2 and compare with a threshold
d
2
1
not 1
c
where the threshold c is chosen based on both in-class,
and out-of-class data, or …
11
Single Hypothesis
Statistics of d2
d2 is a random variable, with mean and variance
Note: d2 is the sum of N random identically distributed
random variables. For large dimension N, we can invoke
the central limit theorem with the result that d2 becomes normally
distributed.
2242
11
12
]var[
))((
])( )( [ ][
dEdEd
NTracemxmxETrace
mxmxEdE
T
T
SSS
S
12
Single Hypothesis
Other “distance” measures
– Model the probability distribution for objects in the
known class – e.g. as a Gaussian mixture
where the parameters are fit by maximum
likelihood. Then classify a new datum x0 using a
threshold test
111 2
1
1
| exp
2
Q
k k k k
n
k
k
p x a x m x m
S
S
kkk
ma S,,
cxp
not
)|(ln
1
1
1
13
Single Hypothesis
Other “distance” measures, continued
Model known class by PCA subspace
Then classify a new datum x0 by a threshold test on its
perpendicular distance to the hyperplane.
The hyperplane serves as a geometric model of the
known class. See Oja, Subspace methods of Pattern
Recognition, Wiley and Sons, for other examples.
hyperplane spanned by leading m
m
Xo
14
Single Hypothesis
Other distance measures (cont’d)
Model known class as a curved manifold (constructed e.g. with a neural
network). Measure distance between data x0 and its projection F(x0)
onto the manifold.
The threshold test is
1
1
22
0 0
( ; )
not
d x F x c
Xo
F(Xo)
15
Problems with Single
Hypothesis Tests
• Usually perform worse than full Bayesian tests.
If you have enough data to model both classes,
performance should improve.
• We turned to the single hypothesis test because
one of the two classes is too sparsely sampled
to model well. How does one pick the decision
threshold c?
16