ANLY-601 Spring 2018
Assignment 4, Due Tuesday, March 27, 2018 — in class
You may use your class notes, the text, or any calculus books — please use no other references (including internet or other statistics texts). If you use Mathematica to derive results, you must include the notebook with your solution so I can see what you did.
1. Maximum-likelihood Cost Function for Multiclass Problems
This problem extends the cross-entropy error function to multiple classes. Suppose we have L classes ω1, . . . , ωL and each example feature vector x is from an object that belongs to one and only one class. Suppose further that the class labeling scheme assigns a binary vector y with L components to each example with
yi(x) = 1ifx∈ωi and yj(x) = 0 forallj̸=i .
That is, each vector y has exactly one element set equal to 1, and the other elements set equal to 0. We can then write the probability of the class label vector y for a sample with features x as a multinomial distribution
L
p(y|x) = αi(x)yi (1)
i=1
with 0 ≤ αi(x) ≤ 1. For example,
p((0,1,0,0,0,…,0)|x) = α2(x) .
- (a) We want to be sure that p(y|x) is properly normalized, that is
p(y|x) = 1 (2){y}
where the sum is over the set of all allowed vectors y. Show that this normalizationcondition requires that
L
αi(x) = 1 ∀x . (3) i=1(To be clear, you should probably explicate the sum over the allowed label vectors by giving the first several terms in the sum in (2).)
- (b) Suppose we have a collection of N statistically independent samples with feature vectors xa and label vectors ya, a = 1,2,…,N (the superscript denotes the sample number). Write the likelihood of the data set
p({y(1),y(2),…,y(N)} | {x(1),x(2),…,x(N)},α1(x),…αL(x)) (4) that follows from the likelihood for each data sample from equation (1).
- (c) Show that maximizing the log-likelihood of the entire data set is equivalent to minimizing
the cost function
NL
E = − yia logαi(xa) . (5)
a=1 i=1
2. Extension of Logistic Regression to Multi-Class Problems
In logistic regression, we have a two classes and we model the posterior for the single class
label y ∈ {0, 1} as
α(x) ≡ p(y=1|x) = 1 , (6) 1+exp(VT x + ν)
where V and ν are the (vector and scalar respectively) parameters in the model. We fit V and ν to data by minimizing the cross-entropy error function
N
E = −logp({y}|{x}) = ya logα(xa) + (1−ya)log(1−α(xa)) . (7)
a=1
We can fit the two-class problem into the framework in Problem 1. We use two class labels yi,i=1,2withy1 =1,y2 =0iftheexampleisinclassω1,andy1 =0,y2 =1iftheexample is in class ω2.
(a) A natural model for the class posteriors is the soft-max function exp gi(x)
αi(x) = 2j=1 expgj(x) (8) where −∞ < gi(x) < ∞ . Show that the softmax function guarantees that
and
0 ≤ αi(x) ≤ 1,∀x
2
αi(x) = 1 . i=1
(b) Show that for our two-class case the soft-max forms of the αi(x) reduce to α1(x) = 1
1 + exp(g2 − g1)
1 .
and
choice g2 − g1 = V T x + ν.
(c) We fit the parameters V, ν by minimizing the cost function derived in problem 1 (Eqn.5)
N2
E = − yia logαi(xa) . (9)
a=1 i=1
Show that for the two-class case (with our choice of class labels) this error function
reduces to the cross-entropy function Eqn. (7).
This extends to the general L-class case. The gi(x) functions can be linear functions of x as in logistic regression. They can also be realized by more complicated functions — for example, the outputs of a neural network.
α2(x) =
Thus we really need only one g(x), and a familiar candidate is the logistic regression
2
1+exp−(g2 −g1)