Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L3-4 – Bayes Classifiers (cont’d)
General Cost
• Suppose each of the two classification error
types have different cost. What’s the ideal
decision strategy?
e.g. In a detection problem (medical, fire alarm),
false negatives may be much more costly than
false positives.
• Define an average “loss” (or cost) function and
devise a decision rule that minimizes it.
Loss Function
Cost incurred for choosing class wi when wj is the
actual class
Function to minimize is average cost or “risk”
ij
11221
22112
22222
11111
|
|
|
|
PtrueischooseP
PtrueischooseP
PtrueischooseP
PtrueischoosePR
ww
ww
ww
ww
Loss Function
Now
and similarly for the other terms
So
xdPxpPtrueischooseP n
L
22
1
1222112
|| www
xdPxpPxp
xdPxpPxpR
n
L
n
L
2
22221121
1
11112212
||
||
ww
ww
Re-write Loss
Note that
so
xdPxpPxp
xdPxpPxpR
n
L
n
L
2
22221121
1
11112212
||
||
ww
ww
2 1
| 1 |
n n
i i
L L
p x d x p x d xw w
21 1 22 2
11 21 1 1 12 22 2 2
1
| |
n
L
R P P
p x P p x P d x
w w
Minimum Loss
To minimize this, we want the integrand to be as
negative as possible, so
This tells us how to assign each x to either L1 or L2
21 1 22 2
11 21 1 1 22 12 2 2
1
| |
n
L
R P P
p x P p x P d x
w w
1112111221222
|| LxPxpPxp ww
Minimum Loss
Or (multiply both sides by -1, and change reverse inequality)
another likelihood ratio test.
1112111221222
|| LxPxpPxp ww
>
<
w1
w2 1
2
1121
2212
2
1
)|(
)|(
)(
P
P
xp
xp
xl
w
w
Neyman-Pearson Test
Suppose we don’t know the cost of each type of error, or the priors.
How do we proceed?
Can only work with conditional errors
A way to proceed is to minimize E1 subject to some specified
acceptable E2 = Eo.
This is a constrained minimization problem that uses the Lagrange
multiplier formulation.
1
2
112
2
1
221
)|(|
)|(|
Ε
Ε
L
n
L
n
xdxptrueischooseP
xdxptrueischooseP
www
www
Neyman-Pearson Test
We want to minimize E1 subject to the constraint
E2=Eo. The Lagrangian (the function to minimize) is
0
0
Ε
ΕΕΕ
12
)|()|(
21
21
L
nn
L
xdxpxdxp
r
ww
Neyman-Pearson
xdxpxp
xdxpxdxpr
n
L
L
nn
L
1
12
1
2
2
1
)|()|(1
)|()|(
ww
ww
0
0
Ε
Ε
Re-write the Lagrangian
r will be minimized when the integrand is kept negative, so the
decision rule is
another likelihood ratio test.
w
w
w
w
)|(
)|(
1
2
2
1
xp
xp
Neyman-Pearson Test
• What’s the threshold ? It’s set by requiring that
the constraint be satisfied
0
Ε
1
2
)|(
L
n
xdxp w
Hypothesis Test
Continued
Minimax Test --
We’ve been using likelihood ratio tests like
but what happens if the priors change after the test is
designed?
One approach - construct test so that its performance is
no worse than the worst possible Bayes test.
1
1 12 22 2
2 21 11 1
2
( | )
( | )
p x P
p x P
w
w
w
w
Minimax Test
Rewrite the expected loss, using 1
21
PP
xdxp
xdxpP
xdxpR
n
L
L
n
L
n
1
2
1
)|( )(
)|( )( )(
)|( )(
22212
1112122111
1221222
w
w
w
What’s this look like as a function of P1 ?
eqn (***)
Minimax Test
P1
R
1.0
Bayes expected cost
with L1 and L2 chosen
optimally for priors
P10
Expected cost for model
with L1 and L2 chosen for P10 ,
then used at varying P1
Minimax Test
P1
R
1.0
P10
Bayes expected cost
with L1 and L2 chosen
optimally for priors
Expected cost for model with L1 and L2
chosen for P10 , then used at varying P1
Performance is no worse than worst
Bayes performance
Best operating point
if priors are not known
Want dR/dP1 for CONSTANT L1 to be zero.
Minimax Test
From eqn(***)
21
ΕΕ )( )( )(
0
221222111121
, 1 21
LLFixed
dP
dR
For example, if the cost for each kind of correct decision are the same,
and the costs for each type of error are the same
I
c
2112
2211
and the above condition becomes 21 ΕΕ
choose the operating point
that gives equal rates for
both kinds of error.
ROC Curves
• Likelihood ratio tests are threshold tests, with the
threshold defined by the priors, and the decision costs.
As the priors and decision costs change, the threshold
changes and the rate of each kind of error changes.
• The Receiver Operating Characteristic (or ROC) curve
shows the system behavior over the full range of
thresholds.
• The ROC is determined only by the class conditional
probability distributions for the measured features.
ROC Curves
Recall the error rates
jiL
n
j
jji
xdxp
trueischoose
)|(
) | P(
w
wwjΕ
To display the system performance at a glance, we’ll plot
the error rates as a function of threshold. This is the ROC
curve.
where the different regions L1 and L2 are defined by the
likelihood ratio test
p(x |w1)
p(x |w2 )
w2
w1
ROC Curves
p(x |w1)
p(x |w2 )
w2
w1
E2
1 - E1
= 0
ROC Curves
• Concave downward
• ROCs are above the diagonal
• Slope = threshold
Log-Likelihood
Sometimes it’s more convenient to work with the
conditional distribution of the negative log-likelihood
than the conditional distribution of the features x.
p(x |w1)
p(x |w2 )
w2
w1
Recall the likelihood ratio test with threshold
In terms of negative log-likelihood h log
p(x |w1)
p(x |w2 )
h
w2
w1
log
x is a random vector, so h is a
random scalar, and has distribution
p(h|wi) when wi is true
Log Likelihood
We can rewrite the error probabilities in terms of integrals
over the distribution for h
h
w2
w1
log
E1 p(x |w1) d
n
x p(h |w1)
log
dh
L2
E2 p(x |w2 ) d
n
x p(h |w2 )
log
dh
L1