VI Lecture
Discrimination and Classification
6.1. Introduction. Separation and Classification for two populations
Discriminant analysis and classification are widely used multivariate techniques. The
goal is either separating sets of objects (in discriminant analysis terminology) or allocating
new objects to given groups (in classification theory terminology).
Basically, discriminant analysis is more exploratory in nature than classification. How-
ever, the difference is not significant especially because very often a function that separates
may sometimes serve as an allocator, and, conversely, a rule of allocation may suggest a
discriminatory procedure. In practice, the goals in the two procedures often overlap.
We will consider the case of two populations (classes of objects) first.
Typical examples include: an anthropologist wants to classify a skull as a male or
female; a patient needs to be classified as needing surgery or not needing surgery etc.
Denote the two classes by π1 and π2. The separation is to be performed on the basis of
measurements of p associated random variables that form a vector X ∈ Rp. The observed
values of X belong to different distributions when taken from π1 and π2 and we shall
denote the densities of these two distributions by f1(x) and f2(x), respectively.
Allocation or classification is possible due to the fact that one has a learning sample
at hand, i.e. there are some measurement vectors that are known to have been generated
from each of the two populations. These measurements have been generated in earlier
similar experiments. The goal is to partition the sample space into 2 mutually exclusive
regions, say R1 and R2, such that if a new observation falls in R1, it is allocated to π1 and
if it falls in R2, it is allocated to π2.
6.2. Classification errors.
There is always a chance of an erroneous classification (misclassification). Our goal
will be to develop such classification methods that in a suitably defined sense minimize
the chances of misclassification.
It should be noted that one of the two classes may have a greater likelihood of
occurrence because one of the two populations might be much larger than the other.
For example, there tend to be a lot more financially sound companies than bankrupt
companies. These prior probabilities of occurrence should also be taken into account when
constructing an optimal classification rule if we want to perform optimally.
In a more detailed study of optimal classification rules, cost is also important. If
classifying a π1 object to the class π2 represents a much more serious error than classifying
a π2 object to the class π1 then these cost differences should also be taken into account
when designing the optimal rule.
The conditional probabilities for misclassification are defined naturally as:
P (2|1) = P (X ∈ R2|π1) =
∫
R2
f1(x)dx (6.1)
P (1|2) = P (X ∈ R1|π2) =
∫
R1
f2(x)dx (6.2)
1
6.3. Optimal classification rules
i) rules that minimize the expected cost of misclassification (ECM)
Denote by pi the prior probability of πi, i = 1, 2(p1 + p2 = 1). Then the overall
probabilities of incorrectly classifying objects will be: P (misclassified as π1) = P (1|2)p2
and P (misclassified as π2) = P (2|1)p1. Further, let c(i|j), i 6= j, i, j = 1, 2 be the misclas-
sification costs. Then the expected cost of misclassification is
ECM = c(2|1)P (2|1)p1 + c(1|2)P (1|2)p2 (6.3)
Lemma 6.3.1. The regions R1 and R2 that minimize ECM are given by
R1 = {x :
f1(x)
f2(x)
≥
c(1|2)
c(2|1)
p2
p1
} (6.4)
and
R2 = {x :
f1(x)
f2(x)
<
c(1|2)
c(2|1)
p2
p1
} (6.5)
Proof. It is easy to see that ECM =
∫
R1
[c(1|2)p2f2(x)− c(2|1)p1f1(x)]dx + c(2|1)p1.
Hence, the ECM will be minimized if R1 includes those values of x for which the integrand
[c(1|2)p2f2(x)− c(2|1)p1f1(x)] ≤ 0 and excludes all the complementary values.
Note the significance of the fact that in Lemma 6.3.1 only ratios are involved. Often
in practice, one would have a much clearer idea about the cost ratio rather than for the
actual costs themselves.
For your own exercise, consider the partial cases of Lemma 6.3.1 when p2 = p1,
c(1|2) = c(2|1) and when both these equalities hold. Comment on the soundness of the
classification regions in these cases.
ii) rules that minimize the total probability of misclassification (TPM)
If we ignore the cost of misspecification, we can define the total probability of mis-
specification as
TPM = p1
∫
R2
f1(x)dx + p2
∫
R1
f2(x)dx
Mathematically, this is a particular case of i) when the costs of misclassification are equal-
so nothing new here.
iii) Bayesian approach
Here we try to allocate a new observation x0 to the population with the larger pos-
terior probability P (πi|x0), i = 1, 2. According to Bayes formula we have
P (π1|x0) =
p1f1(x0)
p1f1(x0) + p2f2(x0)
, P (π2|x0) =
p2f2(x0)
p1f1(x0) + p2f2(x0)
Mathematically, the strategy of classifying an observation x0 as π1 if P (π1|x0) > P (π2|x0)
is again a particular case of i) when the costs of misclassification are equal. (Why ?) But
note that the calculation of the posterior probabilities themselves is in itself a useful and
informative operation.
2
6.4. Classification with two multivariate normal populations
Until now we did not specify any particular form of the densities f1(x) and f2(x).
Essential simplification occurs under normality assumption and we are going over to a
more detailed discussion of this particular case now. Two different cases will be considered-
of equal and of non-equal covariance matrices.
6.4.1. Case of equal covariance matrices Σ1 = Σ2 = Σ
Now we assume that the two populations π1 and π2 are Np(µ1,Σ) and Np(µ2,Σ),
correspondingly.
Then (6.4) becomes
R1 = {x : exp[−
1
2
(x− µ1)′Σ−1(x− µ1) +
1
2
(x− µ2)′Σ−1(x− µ2)] ≥
c(1|2)
c(2|1)
.
p2
p1
}
Similarly, from (6.5) we get
R2 = {x : exp[−
1
2
(x− µ1)′Σ−1(x− µ1) +
1
2
(x− µ2)′Σ−1(x− µ2)] <
c(1|2)
c(2|1)
.
p2
p1
}
and we arrive at the following result:
Theorem 6.4.1. Under the assumptions in 6.4.1, the allocation rule that minimizes
the ECM is given by:
allocate x0 to π1 if
(µ1 − µ2)′Σ−1x0 −
1
2
(µ1 − µ2)′Σ−1(µ1 + µ2) ≥ log[
c(1|2)
c(2|1)
.
p2
p1
]
Otherwise, allocate x0 to π2.
Proof. Simple exercise (to be discussed at lectures).
Note also that it is unrealistic to assume in most situations that the parameters µ1, µ2
and Σ are known. They will need to be estimated by the data instead. Assume, n1 and n2
observations are available from the first and from the second population, respectively. If x̄1
and x̄2 are the sample mean vectors and S1,S2 the corresponding population covariance
matrices then under the assumption of Σ1 = Σ2 = Σ we can derive the pooled covariance
matrix estimator Spooled =
(n1−1)S1+(n2−1)S2
n1+n2−2
(This is an unbiased estimator of Σ (!)).
Hence the sample classification rule becomes: allocate x0 to π1 if
(x̄1 − x̄2)′S−1pooledx0 −
1
2
(x̄1 − x̄2)′S−1pooled(x̄1 + x̄2) ≥ log[
c(1|2)
c(2|1)
.
p2
p1
] (6.6)
Otherwise, allocate x0 to π2. This empirical classification rule is called an allocation
rule based on Fisher’s discriminant function The function
(x̄1 − x̄2)′S−1pooledx0 −
1
2
(x̄1 − x̄2)′S−1pooled(x̄1 + x̄2)
itself (which is linear in the vector observation x0) is called Fisher’s linear discriminant
function.
Of course, the latter rule is only an estimate of the optimal rule since the parameters
in the latter have been replaced by estimated quantities. But we are expecting this rule
to perform well when n1 and n2 are large. It is to be pointed out that the allocation rule
in (6.6) is linear in the new observation x0. The simplicity of its form is a consequence
of the multivariate normality assumption.
3
6.4.2. Case of different covariance matrices (Σ1 6= Σ2)
Now we assume that the two populations π1 and π2 are Np(µ1,Σ1) and Np(µ2,Σ2),
correspondingly. Repeating the same steps like in 6.4.1 we get
R1 = {x : −
1
2
(x′(Σ−11 − Σ
−1
2 )x + (µ
′
1Σ
−1
1 − µ
′
2Σ
−1
2 )x− k ≥ log[
c(1|2)
c(2|1)
.
p2
p1
]}
R2 = {x : −
1
2
(x′(Σ−11 − Σ
−1
2 )x + (µ
′
1Σ
−1
1 − µ
′
2Σ
−1
2 )x− k < log[
c(1|2)
c(2|1)
.
p2
p1
]}
where k = 1
2
log(
|Σ1|
|Σ2|
) + 1
2
(µ′1Σ
−1
1 µ1 − µ′2Σ
−1
2 µ2) and we see that the classification regions
are quadratic functions of the new observation in this case. One obtains the following
rule:
allocate x0 to π1 if
−
1
2
x0
′(S−11 − S
−1
2 )x0 + (x̄
′
1S
−1
1 − x̄
′
2S
−1
2 )x0 − k̂ ≥ log[
c(1|2)
c(2|1)
.
p2
p1
]
where k̂ is the empirical analog of k. Allocate x0 to π2 otherwise.
When Σ1 = Σ2 the quadratic term disappears and we can easily see that the classi-
fication regions from 6.4.1 are obtained. Of course , the case considered in 6.4.2 is more
general but we should be cautious when applying it in practice. It turns out that in more
than two dimensions, classification rules based on quadratic functions do not always per-
form nicely and can lead to strange results. This is especially true when the data are not
quite normal and when the differences in the covariance matrices are significant. The rule
is very sensitive (non-robust) towards departures from normality. Therefore, it is advisable
to try to first transform the data to more nearly normal by using some classical normality
transformations. A detailed discussion of these effects will be provided during the lecture.
6.4.3. Optimum error rate and Mahalanobis distance
We defined the TPM quantity in general terms for any classification rule (see 6.3).
When the regions R1 and R2 are selected in an optimal way, one obtains the minimal value
of TPM which is called optimum error rate (OER) and is being used to characterize the
difficulty of the classification problem at hand. Hereby we shall illustrate the calculation
of the OER for the simple case of two normal populations with Σ1 = Σ2 = Σ and prior
probabilities p1 = p2 =
1
2
. In this case
TPM =
1
2
∫
R2
f1(x)dx +
1
2
∫
R1
f2(x)dx
and OER is obtained by choosing
R1 = {x : (µ1 − µ2)′Σ−1x−
1
2
(µ1 − µ2)′Σ−1(µ1 + µ2) ≥ 0}
R2 = {x : (µ1 − µ2)′Σ−1x−
1
2
(µ1 − µ2)′Σ−1(µ1 + µ2) < 0}
If we introduce the random variable Y = (µ1−µ2)′Σ−1x = l′x then Y ∼ N1(µiY ,∆2), i =
1, 2 for the two populations π1 and π2 where µiY = (µ1−µ2)′Σ−1µi, i = 1, 2. The quantity
4
∆ =
√
(µ1 − µ2)′Σ−1(µ1 − µ2) is the Mahalanobis distance between the two normal
populations and it has an important role in many applications of Multivariate Analysis.
Now
P (2|1) = P (Y <
1
2
(µ1 − µ2)′Σ−1(µ1 + µ2)) = P (
Y − µ1Y
∆
< −
∆
2
) = Φ(−
∆
2
)
Φ(.) denoting the cumulative distribution function of the standard normal. Along the
same lines we can get (do it (!)) : P (1|2) = Φ(−∆
2
) to that finally OER= minimum
TPM= Φ(−∆
2
).
In practice ∆ is replaced by its estimated value ∆̂ =
√
(x̄1 − x̄2)′S−1pooled(x̄1 − x̄2).
6.5. Classification with more than 2 normal populations.
Formal generalization of the theory for the case of g > 2 groups π1, π2, . . . , πg is
straightforward but optimal error rate analysis is difficult when g > 2. It is easy to see
that the ECM classification rule with equal misclassification costs becomes (compare to
(6.4) and (6.5)) now:
Allocate x0 to πk if pkfk > pifi for all i 6= k. Equivalently, one can check if log pkfk >
log pifi for all i 6= k.
When applying this classification rule to g normal populations fi(x) ∼ Np(µi,Σi), i =
1, 2, . . . , g it becomes:
Allocate x0 to πk if
log pkfk(x0) = log pk−
p
2
log (2π)−
1
2
log |Σk|−
1
2
(x0−µk)′Σ−1k (x0−µk) = max
i
log pifi(x0)
Ignoring the constant p
2
log (2π) we get the quadratic discriminant score for the
ith population:
d
Q
i (x) = −
1
2
log |Σi| −
1
2
(x− µi)′Σ−1i (x− µi) + log pi (6.7)
and the rule advocates to allocate x to the population with a largest quadratic discrim-
inant score. It is obvious how one would estimate from the data the unknown quantities
involved in (6.7) in order to obtain the estimated minimum total probability of misclassi-
fication rule. (You formulate the precise statement (!)).
In the case we are justified to assume that all covariance matrices for the g pop-
ulations are equal, a simplification is possible (like in the case g = 2). Looking only at
the terms that vary with i = 1, 2, . . . , g in (6.7) we can define the linear discriminant
score: di(x) = µ
′
iΣ
−1x − 1
2
µ′iΣ
−1µi + log pi. Correspondingly, a sample version of the
linear discriminant score is obtained by substituting the arithmetic means x̄i instead of
µi and Spooled =
n1−1
n1+n2+…ng−g
S1 + . . .+
ng−1
n1+n2+…ng−g
Sg instead of Σ thus arriving at
d̂i(x) = x̄
′
iS
−1
pooledx−
1
2
x̄′iS
−1
pooledx̄i + log pi
Therefore the Estimated Minimum TPM Rule for Equal Covariance Normal
Populations is the following:
Allocate x to πk if d̂k(x) is the largest of the g values d̂i(x), i = 1, 2, . . . , g.
In this form, the classification rule has been implemented in many computer packages.
5