计算机代写 Problem Set: Classification SOLUTIONS

Problem Set: Classification SOLUTIONS
Discriminative and Generative classification are forms of probabilistic classi- fication in which our learning problem becomes seeking to infer the posterior probability pY(y|x).
Discriminative Classification seeks to learn pY(y|x) directly.
Generative Classification uses Bayes’ Theorem to express the posterior as: pY (y|x) ∝ pX (x|y)pY (y)

Copyright By PowCoder代写 加微信 powcoder

And then seeks to learn pY (y|x) indirectly by first inferring pX (x|y) and pY (y).

(b) Misclassification loss: E(f(x),y) = I[y ̸= f(x)]
Generalisation loss: L(E,D,f) = ED􏰆I[Y ̸= f(X)]􏰇
Bayes’ Optimal Classifier, f∗, is defined as:
f∗ = argminED􏰆I[Y ̸= f(X)]􏰇 f∈F
pY (y|x)pX (x)I[y ̸= f (x)]dx 􏰑􏰠􏰚􏰛
f∈F = argmin
pY(y|x)pX (x) f(x)(1 − y) + (1 − f(x))y dx 􏰠􏰚􏰛
pY(y = 1|x)(1 − f(x)) + pY(y = 0|x)f(x) pX (x)dx
pY(y=1|x)(1−f(x))+(1−pY(y=1|x))f(x) pX(x)dx 􏰠􏰚􏰛
pY(y = 1|x) + f(x)(1 − 2pY(y = 1|x)) pX (x)dx
We need to find a function, f∗, which is optimal for all x ∈ dom(X), so: f∗(x) = argmin􏰄pY(y = 1|x)+f(x)(1−2pY(y = 1|x))􏰅
= argmin M(x)
This is a discrete optimisation problem:
if: f∗(x) = 1 then: minf(x) M(x) = 1 − pY(y = 1|x) if: f∗(x) = 0 then: minf(x) M(x) = pY(y = 1|x)
This means that if f∗(x) = 1, optimality implies:
1 − pY (y = 1|x) ≤ pY (y = 1|x)
pY(y = 1|x) ≥ 0.5
So the generalisation minimiser for the Misclassification Loss can be specified entirely in term of the posterior distribution:
f∈F = argmin
This is the Bayes’ Optimal Classifier
1 if pY(y=1|x)≥0.5 0 if pY(y=1|x)<0.5 (c) The discriminant boundary satisfies: pY(y = 1|x) = pY(y = 0|x) So, using Bayes’ Rule: pY(y = 1)pX(x|y = 1) = pY(y = 0)pX(x|y = 0) Because of Bernoulli assumption: pY (y = 1) = θ pY (y = 0) = 1 − θ Because of conditional Gaussian assumption: 11􏰂1T−1 􏰃 pX (x|y = 1) = (2π)m/2 |Σ1|1/2 exp −2(x − μ1) Σ1 (x − μ1) 11􏰂1T−1 􏰃 pX (x|y = 0) = (2π)m/2 |Σ0|1/2 exp −2(x − μ0) Σ0 (x − μ0) Thus: 1 1 􏰂1 T−1 􏰃 1 1 􏰂1 T−1 􏰃 θ(2π)m/2 |Σ1|1/2 exp −2(x−μ1) Σ1 (x−μ1) =(1−θ)(2π)m/2 |Σ0|1/2 exp −2(x−μ0) Σ0 (x−μ0) θ |Σ0|1/2 􏰂 1 T −1 1 T −1 􏰃 =⇒ 1−θ|Σ1|1/2 =exp −2(x−μ0) Σ0 (x−μ0)+2(x−μ1) Σ1 (x−μ1) 􏰂 θ 􏰃 􏰘|Σ0|1/2􏰙 1 T −1 1 T −1 =⇒ ln 1−θ +ln |Σ1|1/2 =−2(x−μ0) Σ0 (x−μ0)+ 2(x−μ1) Σ1 (x−μ1) =−1􏰚xTΣ−1x−2μTΣ−1x+μTΣ−1μ −xTΣ−1x+2μTΣ−1x−μTΣ−1μ 􏰛 2000000111111 1T −1 −1 T −1 T −1 1 T −1 T −1 􏰂1−θ􏰃 􏰘|Σ1|1/2􏰙 =⇒ 2x (Σ1 −Σ0 )x+(μ0 Σ0 −μ1 Σ1 )x+2(μ1 Σ1 μ1−μ0 Σ0 μ0)+ln θ +ln |Σ0|1/2 Since Σ0 = Σ1: TT−11T−1 T−1 􏰂1−θ􏰃 =⇒ (μ0 −μ1)Σ0 x+2(μ1Σ0 μ1−μ0Σ0 μ0)+ln θ Where: w=Σ−1(μ −μ) 001 b=ln􏰄1−θ􏰅+1􏰄μTΣ−1μ −μTΣ−1μ􏰅 θ2101000 (d) This describes a linear hyperplane or linear discriminant =0 =⇒ w·x+b=0 (e) We proceed in a similar way to part (c), but now we no longer assume that Σ0 = Σ1, and we end up with: A= 1(Σ−1 −Σ−1) w=(Σ−1μ −Σ−1μ) 0011 xT Ax + w · x + b = 0 b=ln􏰄1−θ􏰅+1(μTΣ−1μ −μTΣ−1μ)+ln􏰚|Σ1|1/2􏰛 θ 2111 000 |Σ0|1/2 This describes a quadratic hypersurface, or quadratic discriminant (f) In Naive Bayes we assume conditional independence amongst the attributes. For Gaussian class conditional densities this is equivalent to assuming that Σ0, Σ1 are diagonal matrices. So the approach suggested in (c) represents a richer model, with more param- eters to learn, but with fewer assumptions. Discriminative and Generative classification are forms of probabilistic classi- fication in which our learning problem becomes seeking to infer the posterior probability pY(y|x). Discriminative Classification seeks to learn pY(y|x) directly. Generative Classification uses Bayes’ Theorem to express the posterior as: pY (y|x) ∝ pX (x|y)pY (y) And then seeks to learn pY (y|x) indirectly by first inferring pX (x|y) and pY (y). (b) Assume x = [1, x1, ..., xm]T , i.e. the bias term is present in the attribute vector. y∗ =w·x+ε Where Y∗ is a latent variable with outcome y∗, such that y∗ ∼ Logistic(w·x, 1) We assume the following classification model: y(i) =1 if y∗(i) ≥0 y(i) =0 if y∗(i) <0 pY (y(i)=1|x(i)) = P(y∗ ≥ 0|x(i)) =P(w·x(i) +ε(i) ≥0) = P(ε(i) ≥ −w · x(i)) =1−P(ε(i) <−w·x(i)) = 1 − 1 by the cdf property of the Logistic distribution 1 + exp (w · x(i)) exp 􏰄w · x(i)􏰅 = 1+exp(w·x(i)) 1 + exp (−w · x(i)) (c) We perform a similar derivation to the one in (a), with the following updates: y∗ ∼Logistic(w·x+a,b) pY(y(i) = 1|x(i)) = 1 − 1 1+exp((w·x(i) +a)/b) exp 􏰄(w · x(i) + a􏰅 /b) = 1+exp((w·x(i) +a)/b) =1 1+exp(−(w·x(i) +a)/b) =1 1+exp(−w􏰡 ·x(i)) Here: w􏰡 i = w i Thus instead of learning w, we must now learn w􏰡. But this amounts to just re-naming the variables! So there is no effect from changing the values of a and b. ∀ i ̸ = 0 w􏰡 0 = w 0 + a (d) Assume: WhereY∗ isalatentvariablewithoutcomey∗,suchthaty∗ ∼N(w·x,1) We assume the following classification model: y(i) =1 if y∗(i) ≥0 y(i) =0 if y∗(i) <0 pY (y(i) = 1|x(i)) = P(y∗ ≥ 0|x(i)) =P(w·x(i) +ε(i) ≥0) = P(ε(i) ≥ −w · x(i)) exp − dt 2 by the cdf property of the Gaussian distri 2π −w·x(i) (e) Probit regression is more sensitive to outliers than Logistic regression. Why? In logistic regression as x → ∞ the logistic sigmoid decays like exp(−x) In probit regression as x → ∞ the activation function decays like exp(−x2) (f) If K = 2: pY (y = 1|x) = exp (w1 · x) exp(w1 ·x)+exp(w2 ·x) =1 1+exp((w2 −w1)·x) 1 + exp (w · x) Where: w=w2 −w1 And this is just Logistic Regression (g) We have a class boundary between two classes, j and k, when: pY(y = j|x) = pY(y = k|x) exp(wj · x) exp(wk · x) 􏰐Ki=1 exp(wi · x) = 􏰐Ki=1 exp(wi · x) exp(wj · x) = exp(wk · x) wj ·x=wk ·x (wj −wk)·x=0 This is the equation for a linear hyperplane, thus the discriminant boundaries are linear.