程序代写代做代考 AI Foundations of Machine Learning Classification: decisions and discriminants

Foundations of Machine Learning Classification: decisions and discriminants
Srinandan (“Sri”) Dasmahapatra

2-class classification: determine which class a given input X=x belongs to

2-class classification: determine which class a given input X=x belongs to
decision boundary at equal posterior probability for either class, which are:
P(C = +|X = x) = P(C = |X = x) =
P(X =x|C =+)P(C =+)
P(X =x|C =+)P(C =+)+P(X =x|C =)P(C =) P(X =x|C =)P(C =)
P(X =x|C =+)P(C =+)+P(X =x|C =)P(C =)

2-class classification: determine which class a given input X=x belongs to
decision boundary at equal posterior probability for either class, which are:
P(C = +|X = x) = P(C = |X = x) =
P(X =x|C =+)P(C =+)
P(X =x|C =+)P(C =+)+P(X =x|C =)P(C =) P(X =x|C =)P(C =)
P(X =x|C =+)P(C =+)+P(X =x|C =)P(C =)
Simplify the notation: introduce g±(x)

2-class classification: determine which class a given input X=x belongs to
decision boundary at equal posterior probability for either class, which are:
P(C = +|X = x) = P(C = |X = x) =
P(X =x|C =+)P(C =+)
P(X =x|C =+)P(C =+)+P(X =x|C =)P(C =) P(X =x|C =)P(C =)
P(X =x|C =+)P(C =+)+P(X =x|C =)P(C =)
Simplify the notation: introduce g±(x)
eg+(x) exp[g+(x)g(x)]
P(C =+|X =x)= eg+(x) +eg(x) = 1+exp[g+(x)g(x)] eg (x) 1
P(C =|X =x)= eg+(x) +eg(x) = 1+exp[g+(x)g(x)]

Decision boundary: determine which class a given input belongs to

Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class

Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class

Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0

Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0
odds= P(C =+|X =x) =exp[g+(x)g(x)] P(C = |X = x)

Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
-4-2 24
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0
odds= P(C =+|X =x) =exp[g+(x)g(x)] P(C = |X = x)

Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0 odds= P(C =+|X =x) =exp[g+(x)g(x)]
-4-2 24
P(C = |X = x)
odds= P(C =+|X =x) = P(X =x|C =+)P(C =+) P(C =|X =x) P(X =x|C =)P(C =)

Decision boundary: determine which class a given input belongs to
1.0 0.8 0.6 0.4 0.2
decision boundary at equal posterior probability (plotted) for either class
odds = 1, log(odds) = 0 odds= P(C =+|X =x) =exp[g+(x)g(x)]
-4-2 24
P(C = |X = x)
odds= P(C =+|X =x) = P(X =x|C =+)P(C =+)
P(C =|X =x) P(X =x|C =)P(C =) log-odds=ln✓P(X =x|C =+)◆+ln✓P(C =+)◆
P(X =x|C =) P(C =)
data dependent data independent (likelihood) (prior)

Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0

Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
If we choose linear functions:
g+(x) = w+0 + w⊤+x, g−(x) = w−0 + w⊤−x

Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
If we choose linear functions:
g+(x) = w+0 + w⊤+x, g−(x) = w−0 + w⊤−x
• Then f(x; θ) = θ0 + θ⊤x is a linear decision boundary

Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
If we choose linear functions:
g+(x) = w+0 + w⊤+x, g−(x) = w−0 + w⊤−x
• Then f(x; θ) = θ0 + θ⊤x is a linear decision boundary • whereθ0 =w+0−w−0,andθ=w+−w−

Decision boundaries from probability distributions: zero log-odds: 0 = g+(x) − g−(x) = f(x; θ) + θ0
If we choose linear functions:
g+(x) = w+0 + w⊤+x, g−(x) = w−0 + w⊤−x
• Then f(x; θ) = θ0 + θ⊤x is a linear decision boundary
• whereθ0 =w+0−w−0,andθ=w+−w−
This is 2-class softmax (logistic) regression

Reminder: Logistic (2-class)/softmax (K-class) regression

Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))

Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
Recall,
softmax(s(A1), …, s(AK)) = ln[exp(s(A1)) + ⋯ + exp(s(AK))]

Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
Recall,
softmax(s(A1), …, s(AK)) = ln[exp(s(A1)) + ⋯ + exp(s(AK))]
P(class Ai | data) = exp(s(Ai)) = exp(s(Ai)) exp softmax(s(A1), …, s(AK)) ∑Ki=1 exp(s(Ai))


Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
Recall,
softmax(s(A1), …, s(AK)) = ln[exp(s(A1)) + ⋯ + exp(s(AK))]
P(class Ai | data) = exp(s(Ai)) = exp(s(Ai)) exp softmax(s(A1), …, s(AK)) ∑Ki=1 exp(s(Ai))
• fi(datax)=w⊤i Φ(x)=w⊤i x


Reminder: Logistic (2-class)/softmax (K-class) regression
Choose suitable features Φ to compute scores/activations: s(Ai) := score(class Ai) = fi(Φ(data))
Recall,
softmax(s(A1), …, s(AK)) = ln[exp(s(A1)) + ⋯ + exp(s(AK))]


P(class Ai | data) = exp(s(Ai)) = exp softmax(s(A1), …, s(AK))
exp(s(Ai)) ∑Ki=1 exp(s(Ai))

• fi(datax)=w⊤i Φ(x)=w⊤i x
Did not seek a decision boundary, just an argmax

Decision boundaries from probability distributions: zero log-odds

Decision boundaries from probability distributions: zero log-odds

Decision boundaries from probability distributions: zero log-odds
Notice the contours and line segments at the bottom of the cover graphic

Decision boundaries from probability distributions: zero log-odds
Bi-variate gaussian
Notice the contours and line segments at the bottom of the cover graphic

Log of ratio of two multi-dimensional Gaussians

Log of ratio of two multi-dimensional Gaussians
p(x; μ, Σ) = 1 exp (− 1 (x − μ)⊤Σ−1(x − μ)) (2π)p|Σ| 2

Log of ratio of two multi-dimensional Gaussians
p(x; μ, Σ) = 1 exp (− 1 (x − μ)⊤Σ−1(x − μ)) (2π)p|Σ| 2
Precision matrix Λ = Σ−1

Log of ratio of two multi-dimensional Gaussians
p(x; μ, Σ) = 1 exp (− 1 (x − μ)⊤Σ−1(x − μ)) (2π)p|Σ| 2
Precision matrix Λ = Σ−1
For 2D features p|⇤a| ⇢ 1  ⇤a11 ⇤a12 p(x1,x2|C=a) = p exp [x1μa1 x2μa2] ⇤ ⇤
 x1 μa1 x μ
2⇡ 2 a21 a22
2 a2
p(x1,x2|C=b) = |⇤b|exp⇢1[x1μb1 x2μb2] ⇤b11 ⇤b12  x1μb1 2⇡ 2 ⇤b21 ⇤b22 x2 μb2

Log of ratio of two multi-dimensional Gaussians
p(x; μ, Σ) = 1 exp (− 1 (x − μ)⊤Σ−1(x − μ)) (2π)p|Σ| 2
Precision matrix Λ = Σ−1
For 2D features p|⇤a| ⇢ 1  ⇤a11 ⇤a12 p(x1,x2|C=a) = p exp [x1μa1 x2μa2] ⇤ ⇤
ln✓p(x|C+)◆=ln✓|⇤+|◆ +1(xμ)T⇤(xμ)1(xμ+)T⇤+(xμ+) p(x|C) |⇤| 2 2
 x1 μa1 x μ
2⇡ 2 a21 a22
p(x1,x2|C=b) = |⇤b|exp⇢1[x1μb1 x2μb2] ⇤b11 ⇤b12  x1μb1
2⇡ 2 ⇤b21 ⇤b22 x2 μb2 Log-odds (no prior) for 2-class classification
2 a2

Log of ratio of posteriors of two Gaussians: for equal and unequal covariances

Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C)+ 1 ln|⇤| 1(xμ)T⇤(xμ) 22
g(x) =
= lnP(C)+ 1 ln|⇤| 1 xT⇤x2μT⇤x+μT⇤μ
g+(x) =
lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x2μT+⇤+x+μT+⇤+μ+ 22
22

Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C)+ 1 ln|⇤| 1(xμ)T⇤(xμ) 22
g(x) =
= lnP(C)+ 1 ln|⇤| 1 xT⇤x2μT⇤x+μT⇤μ
g+(x) =
22 lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x2μT+⇤+x+μT+⇤+μ+
22
For unequal covariances, Λ+ ≠ Λ−decision boundary is quadratic Your turn: what is the quadratic term in g+(x) − g−(x)

Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C)+ 1 ln|⇤| 1(xμ)T⇤(xμ) 22
g(x) =
= lnP(C)+ 1 ln|⇤| 1 xT⇤x2μT⇤x+μT⇤μ
22 lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x2μT+⇤+x+μT+⇤+μ+
g+(x) =
For unequal covariances, Λ+ ≠ Λ−decision boundary is quadratic
22
Your turn: what is the quadratic term in g+(x) − g−(x) Decision boundary for equal covariances Λ+ = Λ− = Λ is linear:

Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C)+ 1 ln|⇤| 1(xμ)T⇤(xμ) 22
g(x) =
= lnP(C)+ 1 ln|⇤| 1 xT⇤x2μT⇤x+μT⇤μ
22 lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x2μT+⇤+x+μT+⇤+μ+
g+(x) =
For unequal covariances, Λ+ ≠ Λ−decision boundary is quadratic
22
Your turn: what is the quadratic term in g+(x) − g−(x) Decision boundary for equal covariances Λ+ = Λ− = Λ is linear:
g+(x) g(x) = ln P (C+) + (μ+ μ)T ⇤x + x-independent term P(C)

Log of ratio of posteriors of two Gaussians: for equal and unequal covariances
lnP(C)+ 1 ln|⇤| 1(xμ)T⇤(xμ) 22
g(x) =
= lnP(C)+ 1 ln|⇤| 1 xT⇤x2μT⇤x+μT⇤μ
22 lnP(C+)+ 1 ln|⇤+| 1 xT⇤+x2μT+⇤+x+μT+⇤+μ+
g+(x) =
For unequal covariances, Λ+ ≠ Λ−decision boundary is quadratic
22
Your turn: what is the quadratic term in g+(x) − g−(x) Decision boundary for equal covariances Λ+ = Λ− = Λ is linear:
g+(x) g(x) = ln P (C+) + (μ+ μ)T ⇤x + x-independent term P(C)
Your turn: derive the linear equation

Discriminants (decision boundaries) for 2-class gaussians: linear if covariances equal, quadratic if not

Discriminants (decision boundaries) for 2-class gaussians: linear if covariances equal, quadratic if not
⌃+ 6= ⌃ 6
4
2
0
-2
-4
-6
-6 -4 -2 0 2 4 6
x
x
x xxx
x x
x
xxxx x
x xx
x x
x xx x xxx
x
x xx x x xx xx
x xxxxx
xxxx xxx

Discriminants (decision boundaries) for 2-class gaussians: linear if covariances equal, quadratic if not
⌃+ 6= ⌃ 6
4
2
0
-2
-4
-6
If the covariance matrices are the same, the quadratic term is common to both classes, and drops out in the ratio
x
x
x xxx
x x
x
xxxx x
x xx
x x
x xx x xxx
x
x xx x x xx xx
x xxxxx
xxxx xxx
-6 -4 -2 0 2 4 6

Discriminants (decision boundaries) for 2-class gaussians: linear if covariances equal, quadratic if not
⌃+ 6= ⌃ If the covariance matrices are the 4 same, the 2 quadratic term is 0 common to both -2 classes, and drops -4 out in the ratio -6
⌃+ = ⌃
x
x
x xxx
x x
x
xxxx x
x xx
x x
x xx x xxx
x
x xx x x xx xx
x xxxxx
xxxx xxx
x
x
x
x
xx xxx x
x
xxxxx xxx
xx
xx xx
x x
xx
xx xx
xxx x xxx
x x
x x xx
4
2
0
-2
-4
66
-6
-6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6

Discriminants for Gaussians for equal priors: data distribution equally divided
1-dim 2-dim 3-dim
Equal priors: log(Pa/Pb) = 0
From Duda and Hart

Discriminants for Gaussians: unequal priors shifts decision boundary by log(Pa/Pb)
1-dim
2-dim
3-dim
From Duda and Hart

Discriminants for Gaussians: unequal priors shifts decision boundary by log(Pa/Pb)
1-dim
Prior overrides data distributions
2-dim
3-dim
From Duda and Hart

Discriminants for Gaussians: unequal priors shifts decision boundary by log(Pa/Pb)
1-dim
Prior overrides data distributions
Notice how far boundary shifts
2-dim
3-dim
From Duda and Hart

Decision boundaries not perpendicular to line joining means if covariance matrix not proportional to the identity matrix
w = ⌃1(μ+ μ)
From Duda and Hart

Decision boundaries not perpendicular to line joining means if covariance matrix not proportional to the identity matrix
g+(x) g(x) = ln P (C+) + (μ+ μ)T ⇤x + x-independent term P(C)
w = ⌃1(μ+ μ)
From Duda and Hart

Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:

Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
-10 -5
0.5 0.4 0.3 0.2 0.1
0.30 5 10 0.25
0.20
0.15
0.10 0.05
-10-5 510

Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
0.5 0.4 0.3 0.2 0.1
0.30 5 10 0.25
0.20
0.15
0.10 0.05
d=
μ1 μ2
-10 -5
-10-5 510

Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
0.5 0.4 0.3 0.2 0.1
μ1 μ2
-10 -5
d=
Controls for overlap between classes
0.30 5 10 0.25
0.20
0.15
0.10 0.05
-10-5 510

Distinctiveness relative to natural variation measured by distance between class means scaled by standard deviation:
0.5 0.4 0.3 0.2 0.1
μ1 μ2
-10 -5
d=
Controls for overlap between classes
0.30 5 10 0.25
0.20
0.15
0.10 0.05
-10-5 510
d(x1, x2) = q(x1 x2)T ⌃1(x1 x2)
Mahalanobis distance

Fisher Linear Discriminant Analysis: projections along discriminative directions

Fisher Linear Discriminant Analysis: projections along discriminative directions

Fisher Linear Discriminant Analysis: projections along discriminative directions
μc=1Xxn mc=wTμc=1XwTxn nc xn2Dc nc xn2Dc
, 1 X yn
nc
xn 2Dc
s 2c = 1 X ( y n m c ) 2 nc xn c
2D

Fisher Linear Discriminant Analysis: projections along discriminative directions
μc=1Xxn mc=wTμc=1XwTxn nc xn2Dc nc xn2Dc
, 1 X yn
nc
xn 2Dc
s 2c = 1 X ( y n m c ) 2 nc xn c
2D

Fisher Linear Discriminant Analysis: projections along discriminative directions
1X 1XX
μc = n xn mc = wT μc = n wT xn
c xn2Dc c xn2Dc
,1 yn
nc
2D
xn 2Dc
s 2c = 1 X ( y n m c ) 2 nc xn c

Fisher Linear Discriminant Analysis: projections along discriminative directions
1X 1XX
μc = n xn mc = wT μc = n wT xn
c xn2Dc c xn2Dc
,1 yn
n
x X2 D c
s 2c = 1
nc xn2Dc
nc
( y n m c ) 2

Fisher Linear Discriminant Analysis: projections along discriminative directions
1X 1XX
μc = n xn mc = wT μc = n wT xn
c xn2Dc c xn2Dc
,1 yn
n
x X2 D c
nc s 2c = 1
( y n m c ) 2 J(w)= (m+m)2
n + s 2+ + n s 2 Choose w⇤ = argmaxwJ(w)
nc xn2Dc Fisher score:

Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(ynmc)2=1 X(wTxnwTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
J(w)= (m+m)2 ⟹ n+s2+ + ns2
denominator: within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w

Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(ynmc)2=1 X(wTxnwTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
J(w)= (m+m)2 ⟹ n+s2+ + ns2
denominator: within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
(a · b)2 = (b>a)(a>b)

Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(ynmc)2=1 X(wTxnwTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
J(w)= (m+m)2 ⟹ n+s2+ + ns2
denominator: within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
(a · b)2 = (b>a)(a>b)

Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(ynmc)2=1 X(wTxnwTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
J(w)= (m+m)2 ⟹ n+s2+ + ns2
denominator: within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
(a · b)2 = (b>a)(a>b)

Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(ynmc)2=1 X(wTxnwTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
n+s2+ + ns2 = wT (n+S+ + nS)w denominator:
, wT SW w
within class scatter
numerator: between class scatter
J(w)=wTSBw wT SW w
J(w)= (m+m)2 ⟹ n+s2+ + ns2
(a · b)2 = (b>a)(a>b)

Rewrite Fisher score to make weight vector dependence explicit: between and within class scatter
s2c = 1 X(ynmc)2=1 X(wTxnwTμ)2 nc xn2Dc nc xn2Dc
= 1 wT (xn μ)(xn μ)T w nc xn2Dc
, wTSw
n+s2+ + ns2 = wT (n+S+ + nS)w
, wT SW w
(m+ m)2 = (wT (μ+ μ))2
= wT (μ+ μ)(μ+ μ)T w
denominator: within class scatter
numerator: between class scatter
J(w)= (m+m)2 ⟹ J(w)=wTSBw n+s2+ + ns2 wT SW w
(a · b)2 = (b>a)(a>b)

Optimisation: find best direction, set gradients to 0

Optimisation: find best direction, set gradients to 0
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
SBw = SW w

Optimisation: find best direction, set gradients to 0
⇣u⌘0 = 1 (u0vuv0) v v2
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
SBw = SW w

Optimisation: find best direction, set gradients to 0
⇣u⌘0 = 1 (u0vuv0) v v2
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
SBw = SW w
scalars

Optimisation: find best direction, set gradients to 0
⇣u⌘0 = 1 (u0vuv0) v v2
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
scalars SBw = SW w ratio

Optimisation: find best direction, set gradients to 0
⇣u⌘0 = 1 (u0vuv0) v v2
Partial derivatives (gradient) set to 0
@J(w) = 2 (wT SW w)SBw (wT SBw)SW w
@w (wT SW w)2
reduces to a generalised eigenvalue problem:
scalars SBw = SW w ratio
Extend to multiple classes, not just 2.

Classifiers can be generative or discriminative

Classifiers can be generative or discriminative
Generative — class conditional probabilities can be used to simulate (generate) data

Classifiers can be generative or discriminative
Generative — class conditional probabilities can be used to simulate (generate) data
Generative — probabilities consistent with allocation of distance from decision boundaries

Classifiers can be generative or discriminative
Generative — class conditional probabilities can be used to simulate (generate) data
Generative — probabilities consistent with allocation of distance from decision boundaries
Discriminative — does not model data within classes, only what distinguishes a class from another


Classifiers can be generative or discriminative
Generative — class conditional probabilities can be used to simulate (generate) data
Generative — probabilities consistent with allocation of distance from decision boundaries
Discriminative — does not model data within classes, only what distinguishes a class from another
Next time: maximum margin classifiers (Support Vector Machines)



References
PRML, Bishop, Chapter 1, section 1.5 PRML, Bishop, Chapter 4.
FCML, Rogers-Girolami, Chapter 5.
• • •