Foundations of Machine Learning Classification: decisions and discriminants
Srinandan (“Sri”) Dasmahapatra
2-class classification: determine which class a given input X=x belongs to
2-class classification: determine which class a given input X=x belongs to
decision boundary at equal posterior probability for either class, which are:
P(C = +|X = x) = P(C = |X = x) =
P(X =x|C =+)P(C =+)
P(X =x|C =+)P(C =+)+P(X =x|C = )P(C = ) P(X =x|C = )P(C = )
P(X =x|C =+)P(C =+)+P(X =x|C = )P(C = )
2-class classification: determine which class a given input X=x belongs to
decision boundary at equal posterior probability for either class, which are:
P(C = +|X = x) = P(C = |X = x) =
P(X =x|C =+)P(C =+)
P(X =x|C =+)P(C =+)+P(X =x|C = )P(C = ) P(X =x|C = )P(C = )
P(X =x|C =+)P(C =+)+P(X =x|C = )P(C = )
Simplify the notation: introduce g±(x)
2-class classification: determine which class a given input X=x belongs to
decision boundary at equal posterior probability for either class, which are:
P(C = +|X = x) = P(C = |X = x) =
P(X =x|C =+)P(C =+)
P(X =x|C =+)P(C =+)+P(X =x|C = )P(C = ) P(X =x|C = )P(C = )
P(X =x|C =+)P(C =+)+P(X =x|C = )P(C = )
Simplify the notation: introduce g±(x)
eg+(x) exp[g+(x) g (x)]
P(C =+|X =x)= eg+(x) +eg (x) = 1+exp[g+(x) g (x)] eg (x) 1
P(C = |X =x)= eg+(x) +eg (x) = 1+exp[g+(x) g (x)]
Decision boundary: determine which class a given input belongs to
Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class
Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class
Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0
Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0
odds= P(C =+|X =x) =exp[g+(x) g (x)] P(C = |X = x)
Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0
odds= P(C =+|X =x) =exp[g+(x) g (x)] P(C = |X = x)
Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0 odds= P(C =+|X =x) =exp[g+(x) g (x)]
-4-2 24
P(C = |X = x)
odds= P(C =+|X =x) = P(X =x|C =+)P(C =+) P(C = |X =x) P(X =x|C = )P(C = )
Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0 odds= P(C =+|X =x) =exp[g+(x) g (x)]
-4-2 24
P(C = |X = x)
odds= P(C =+|X =x) = P(X =x|C =+)P(C =+)
P(C = |X =x) P(X =x|C = )P(C = ) log-odds=ln✓P(X =x|C =+)◆+ln✓P(C =+)◆
P(X =x|C = ) P(C = )
data dependent data independent (likelihood) (prior)
Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
If we choose linear functions:
g+(x) = w+0 + w⊤+x, g−(x) = w−0 + w⊤−x
•
Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
If we choose linear functions:
g+(x) = w+0 + w⊤+x, g−(x) = w−0 + w⊤−x
• Then f(x; θ) = θ0 + θ⊤x is a linear decision boundary
•
Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
If we choose linear functions:
g+(x) = w+0 + w⊤+x, g−(x) = w−0 + w⊤−x
• Then f(x; θ) = θ0 + θ⊤x is a linear decision boundary • whereθ0 =w+0−w−0,andθ=w+−w−
•
Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
If we choose linear functions:
g+(x) = w+0 + w⊤+x, g−(x) = w−0 + w⊤−x
• Then f(x; θ) = θ0 + θ⊤x is a linear decision boundary
• whereθ0 =w+0−w−0,andθ=w+−w−
This is 2-class softmax (logistic) regression
•
•
Reminder: Logistic (2-class)/softmax (K-class) regression
Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
•
Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
Recall,
softmax(s(A1), …, s(AK)) = ln[exp(s(A1)) + ⋯ + exp(s(AK))]
•
•
Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
Recall,
softmax(s(A1), …, s(AK)) = ln[exp(s(A1)) + ⋯ + exp(s(AK))]
P(class Ai | data) = exp(s(Ai)) = exp(s(Ai)) exp softmax(s(A1), …, s(AK)) ∑Ki=1 exp(s(Ai))
•
•
•
Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
Recall,
softmax(s(A1), …, s(AK)) = ln[exp(s(A1)) + ⋯ + exp(s(AK))]
P(class Ai | data) = exp(s(Ai)) = exp(s(Ai)) exp softmax(s(A1), …, s(AK)) ∑Ki=1 exp(s(Ai))
• fi(datax)=w⊤i Φ(x)=w⊤i x
•
•
•
Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
Recall,
softmax(s(A1), …, s(AK)) = ln[exp(s(A1)) + ⋯ + exp(s(AK))]
•
•
P(class Ai | data) = exp(s(Ai)) = exp softmax(s(A1), …, s(AK))
exp(s(Ai)) ∑Ki=1 exp(s(Ai))
•
• fi(datax)=w⊤i Φ(x)=w⊤i x
Did not seek a decision boundary, just an argmax
•
Decision boundaries from probability distributions: zero log-odds
Decision boundaries from probability distributions: zero log-odds
Decision boundaries from probability distributions: zero log-odds
Notice the contours and line segments at the bottom of the cover graphic
Decision boundaries from probability distributions: zero log-odds
Bi-variate gaussian
Notice the contours and line segments at the bottom of the cover graphic
Log of ratio of two multi-dimensional Gaussians
Log of ratio of two multi-dimensional Gaussians
p(x; μ, Σ) = 1 exp (− 1 (x − μ)⊤Σ−1(x − μ)) (2π)p|Σ| 2
Log of ratio of two multi-dimensional Gaussians
p(x; μ, Σ) = 1 exp (− 1 (x − μ)⊤Σ−1(x − μ)) (2π)p|Σ| 2
Precision matrix Λ = Σ−1
Log of ratio of two multi-dimensional Gaussians
p(x; μ, Σ) = 1 exp (− 1 (x − μ)⊤Σ−1(x − μ)) (2π)p|Σ| 2
Precision matrix Λ = Σ−1
For 2D features p|⇤a| ⇢ 1 ⇤a11 ⇤a12 p(x1,x2|C=a) = p exp [x1 μa1 x2 μa2] ⇤ ⇤
x1 μa1 x μ
2⇡ 2 a21 a22
2 a2
p(x1,x2|C=b) = |⇤b|exp⇢ 1[x1 μb1 x2 μb2] ⇤b11 ⇤b12 x1 μb1 2⇡ 2 ⇤b21 ⇤b22 x2 μb2
Log of ratio of two multi-dimensional Gaussians
p(x; μ, Σ) = 1 exp (− 1 (x − μ)⊤Σ−1(x − μ)) (2π)p|Σ| 2
Precision matrix Λ = Σ−1
For 2D features p|⇤a| ⇢ 1 ⇤a11 ⇤a12 p(x1,x2|C=a) = p exp [x1 μa1 x2 μa2] ⇤ ⇤
ln✓p(x|C+)◆=ln✓|⇤+|◆ +1(x μ )T⇤ (x μ ) 1(x μ+)T⇤+(x μ+) p(x|C ) |⇤ | 2 2
x1 μa1 x μ
2⇡ 2 a21 a22
p(x1,x2|C=b) = |⇤b|exp⇢ 1[x1 μb1 x2 μb2] ⇤b11 ⇤b12 x1 μb1
2⇡ 2 ⇤b21 ⇤b22 x2 μb2 Log-odds (no prior) for 2-class classification
2 a2
Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C )+ 1 ln|⇤ | 1(x μ )T⇤ (x μ ) 22
g (x) =
= lnP(C )+ 1 ln|⇤ | 1 xT⇤ x 2μT ⇤ x+μT ⇤ μ
g+(x) =
lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x 2μT+⇤+x+μT+⇤+μ+ 22
22
Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C )+ 1 ln|⇤ | 1(x μ )T⇤ (x μ ) 22
g (x) =
= lnP(C )+ 1 ln|⇤ | 1 xT⇤ x 2μT ⇤ x+μT ⇤ μ
g+(x) =
22 lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x 2μT+⇤+x+μT+⇤+μ+
22
For unequal covariances, Λ+ ≠ Λ−decision boundary is quadratic Your turn: what is the quadratic term in g+(x) − g−(x)
Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C )+ 1 ln|⇤ | 1(x μ )T⇤ (x μ ) 22
g (x) =
= lnP(C )+ 1 ln|⇤ | 1 xT⇤ x 2μT ⇤ x+μT ⇤ μ
22 lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x 2μT+⇤+x+μT+⇤+μ+
g+(x) =
For unequal covariances, Λ+ ≠ Λ−decision boundary is quadratic
22
Your turn: what is the quadratic term in g+(x) − g−(x) Decision boundary for equal covariances Λ+ = Λ− = Λ is linear:
Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C )+ 1 ln|⇤ | 1(x μ )T⇤ (x μ ) 22
g (x) =
= lnP(C )+ 1 ln|⇤ | 1 xT⇤ x 2μT ⇤ x+μT ⇤ μ
22 lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x 2μT+⇤+x+μT+⇤+μ+
g+(x) =
For unequal covariances, Λ+ ≠ Λ−decision boundary is quadratic
22
Your turn: what is the quadratic term in g+(x) − g−(x) Decision boundary for equal covariances Λ+ = Λ− = Λ is linear:
g+(x) g (x) = ln P (C+) + (μ+ μ )T ⇤x + x-independent term P(C )
Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C )+ 1 ln|⇤ | 1(x μ )T⇤ (x μ ) 22
g (x) =
= lnP(C )+ 1 ln|⇤ | 1 xT⇤ x 2μT ⇤ x+μT ⇤ μ
22 lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x 2μT+⇤+x+μT+⇤+μ+
g+(x) =
For unequal covariances, Λ+ ≠ Λ−decision boundary is quadratic
22
Your turn: what is the quadratic term in g+(x) − g−(x) Decision boundary for equal covariances Λ+ = Λ− = Λ is linear:
g+(x) g (x) = ln P (C+) + (μ+ μ )T ⇤x + x-independent term P(C )
Your turn: derive the linear equation
Discriminants (decision boundaries) for 2-class gaussians: linear if covariances equal, quadratic if not
Discriminants (decision boundaries) for 2-class gaussians: linear if covariances equal, quadratic if not
⌃+ 6= ⌃ 6
4
2
0
-2
-4
-6
-6 -4 -2 0 2 4 6
x
x
x xxx
x x
x
xxxx x
x xx
x x
x xx x xxx
x
x xx x x xx xx
x xxxxx
xxxx xxx
Discriminants (decision boundaries) for 2-class gaussians: linear if covariances equal, quadratic if not
⌃+ 6= ⌃ 6
4
2
0
-2
-4
-6
If the covariance matrices are the same, the quadratic term is common to both classes, and drops out in the ratio
x
x
x xxx
x x
x
xxxx x
x xx
x x
x xx x xxx
x
x xx x x xx xx
x xxxxx
xxxx xxx
-6 -4 -2 0 2 4 6
Discriminants (decision boundaries) for 2-class gaussians: linear if covariances equal, quadratic if not
⌃+ 6= ⌃ If the covariance matrices are the 4 same, the 2 quadratic term is 0 common to both -2 classes, and drops -4 out in the ratio -6
⌃+ = ⌃
x
x
x xxx
x x
x
xxxx x
x xx
x x
x xx x xxx
x
x xx x x xx xx
x xxxxx
xxxx xxx
x
x
x
x
xx xxx x
x
xxxxx xxx
xx
xx xx
x x
xx
xx xx
xxx x xxx
x x
x x xx
4
2
0
-2
-4
66
-6
-6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6
Discriminants for Gaussians for equal priors: data distribution equally divided
1-dim 2-dim 3-dim
Equal priors: log(Pa/Pb) = 0
From Duda and Hart
Discriminants for Gaussians: unequal priors shifts decision boundary by log(Pa/Pb)
1-dim
2-dim
3-dim
From Duda and Hart
Discriminants for Gaussians: unequal priors shifts decision boundary by log(Pa/Pb)
1-dim
Prior overrides data distributions
2-dim
3-dim
From Duda and Hart
Discriminants for Gaussians: unequal priors shifts decision boundary by log(Pa/Pb)
1-dim
Prior overrides data distributions
Notice how far boundary shifts
2-dim
3-dim
From Duda and Hart
Decision boundaries not perpendicular to line joining means if covariance matrix not proportional to the identity matrix
w = ⌃ 1(μ+ μ )
From Duda and Hart
Decision boundaries not perpendicular to line joining means if covariance matrix not proportional to the identity matrix
g+(x) g (x) = ln P (C+) + (μ+ μ )T ⇤x + x-independent term P(C )
w = ⌃ 1(μ+ μ )
From Duda and Hart
Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
-10 -5
0.5 0.4 0.3 0.2 0.1
0.30 5 10 0.25
0.20
0.15
0.10 0.05
-10-5 510
Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
0.5 0.4 0.3 0.2 0.1
0.30 5 10 0.25
0.20
0.15
0.10 0.05
d=
μ1 μ2
-10 -5
-10-5 510
Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
0.5 0.4 0.3 0.2 0.1
μ1 μ2
-10 -5
d=
Controls for overlap between classes
0.30 5 10 0.25
0.20
0.15
0.10 0.05
-10-5 510
Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
0.5 0.4 0.3 0.2 0.1
μ1 μ2
-10 -5
d=
Controls for overlap between classes
0.30 5 10 0.25
0.20
0.15
0.10 0.05
-10-5 510
d(x1, x2) = q(x1 x2)T ⌃ 1(x1 x2)
Mahalanobis distance
Fisher Linear Discriminant Analysis: projections along discriminative directions
Fisher Linear Discriminant Analysis: projections along discriminative directions
Fisher Linear Discriminant Analysis: projections along discriminative directions
μc=1Xxn mc=wTμc=1XwTxn nc xn2Dc nc xn2Dc
, 1 X yn
nc
xn 2Dc
s 2c = 1 X ( y n m c ) 2 nc xn c
2D
Fisher Linear Discriminant Analysis: projections along discriminative directions
μc=1Xxn mc=wTμc=1XwTxn nc xn2Dc nc xn2Dc
, 1 X yn
nc
xn 2Dc
s 2c = 1 X ( y n m c ) 2 nc xn c
2D
Fisher Linear Discriminant Analysis: projections along discriminative directions
1X 1XX
μc = n xn mc = wT μc = n wT xn
c xn2Dc c xn2Dc
,1 yn
nc
2D
xn 2Dc
s 2c = 1 X ( y n m c ) 2 nc xn c
Fisher Linear Discriminant Analysis: projections along discriminative directions
1X 1XX
μc = n xn mc = wT μc = n wT xn
c xn2Dc c xn2Dc
,1 yn
n
x X2 D c
s 2c = 1
nc xn2Dc
nc
( y n m c ) 2
Fisher Linear Discriminant Analysis: projections along discriminative directions
1X 1XX
μc = n xn mc = wT μc = n wT xn
c xn2Dc c xn2Dc
,1 yn
n
x X2 D c
nc s 2c = 1
( y n m c ) 2 J(w)= (m+ m )2
n + s 2+ + n s 2 Choose w⇤ = argmaxwJ(w)
nc xn2Dc Fisher score:
Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(yn mc)2=1 X(wTxn wTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
J(w)= (m+ m )2 ⟹ n+s2+ + n s2
denominator: within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(yn mc)2=1 X(wTxn wTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
J(w)= (m+ m )2 ⟹ n+s2+ + n s2
denominator: within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
(a · b)2 = (b>a)(a>b)
Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(yn mc)2=1 X(wTxn wTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
J(w)= (m+ m )2 ⟹ n+s2+ + n s2
denominator: within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
(a · b)2 = (b>a)(a>b)
Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(yn mc)2=1 X(wTxn wTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
J(w)= (m+ m )2 ⟹ n+s2+ + n s2
denominator: within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
(a · b)2 = (b>a)(a>b)
Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(yn mc)2=1 X(wTxn wTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
n+s2+ + n s2 = wT (n+S+ + n S )w denominator:
, wT SW w
within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
J(w)= (m+ m )2 ⟹ n+s2+ + n s2
(a · b)2 = (b>a)(a>b)
Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(yn mc)2=1 X(wTxn wTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
n+s2+ + n s2 = wT (n+S+ + n S )w
, wT SW w
(m+ m )2 = (wT (μ+ μ ))2
= wT (μ+ μ )(μ+ μ )T w
denominator: within class scatter
numerator: between class scatter
J(w)= (m+ m )2 ⟹ J(w)=wTSBw n+s2+ + n s2 wT SW w
(a · b)2 = (b>a)(a>b)
Optimisation: find best direction, set gradients to 0
Optimisation: find best direction, set gradients to 0
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
SBw = SW w
Optimisation: find best direction, set gradients to 0
⇣u⌘0 = 1 (u0v uv0) v v2
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
SBw = SW w
Optimisation: find best direction, set gradients to 0
⇣u⌘0 = 1 (u0v uv0) v v2
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
SBw = SW w
scalars
Optimisation: find best direction, set gradients to 0
⇣u⌘0 = 1 (u0v uv0) v v2
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
scalars SBw = SW w ratio
Optimisation: find best direction, set gradients to 0
⇣u⌘0 = 1 (u0v uv0) v v2
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
scalars SBw = SW w ratio
Extend to multiple classes, not just 2.
Classifiers can be generative or discriminative
Classifiers can be generative or discriminative
Generative — class conditional probabilities can be used to simulate (generate) data
•
Classifiers can be generative or discriminative
Generative — class conditional probabilities can be used to simulate (generate) data
Generative — probabilities consistent with allocation of distance from decision boundaries
•
•
Classifiers can be generative or discriminative
Generative — class conditional probabilities can be used to simulate (generate) data
Generative — probabilities consistent with allocation of distance from decision boundaries
Discriminative — does not model data within classes, only what distinguishes a class from another
•
•
•
Classifiers can be generative or discriminative
Generative — class conditional probabilities can be used to simulate (generate) data
Generative — probabilities consistent with allocation of distance from decision boundaries
Discriminative — does not model data within classes, only what distinguishes a class from another
Next time: maximum margin classifiers (Support Vector Machines)
•
•
•
•
References
PRML, Bishop, Chapter 1, section 1.5 PRML, Bishop, Chapter 4.
FCML, Rogers-Girolami, Chapter 5.
• • •