机器学习模式识别代写: ANLY-601 Assignment 4

ANLY-601 Spring 2018
Assignment 4, Due Tuesday, March 27, 2018 — in class

You may use your class notes, the text, or any calculus books — please use no other references (including internet or other statistics texts). If you use Mathematica to derive results, you must include the notebook with your solution so I can see what you did.

1. Maximum-likelihood Cost Function for Multiclass Problems

This problem extends the cross-entropy error function to multiple classes. Suppose we have L classes ω1, . . . , ωL and each example feature vector x is from an object that belongs to one and only one class. Suppose further that the class labeling scheme assigns a binary vector y with L components to each example with

yi(x) = 1ifx∈ωi and yj(x) = 0 forallj̸=i .

That is, each vector y has exactly one element set equal to 1, and the other elements set equal to 0. We can then write the probability of the class label vector y for a sample with features x as a multinomial distribution

L
p(y|x) = 􏰋 αi(x)yi (1)

i=1

with 0 ≤ αi(x) ≤ 1. For example,
p((0,1,0,0,0,…,0)|x) = α2(x) .

  1. (a)  We want to be sure that p(y|x) is properly normalized, that is
    􏰈 p(y|x) = 1 (2)

    {y}
    where the sum is over the set of all allowed vectors y. Show that this normalization

    condition requires that

    L
    􏰈 αi(x) = 1 ∀x . (3) i=1

    (To be clear, you should probably explicate the sum over the allowed label vectors by giving the first several terms in the sum in (2).)

  2. (b)  Suppose we have a collection of N statistically independent samples with feature vectors xa and label vectors ya, a = 1,2,…,N (the superscript denotes the sample number). Write the likelihood of the data set

    p({y(1),y(2),…,y(N)} | {x(1),x(2),…,x(N)},α1(x),…αL(x)) (4) that follows from the likelihood for each data sample from equation (1).

  3. (c)  Show that maximizing the log-likelihood of the entire data set is equivalent to minimizing

the cost function

NL
E = −􏰈 􏰈 yia logαi(xa) . (5)

a=1 i=1

2. Extension of Logistic Regression to Multi-Class Problems
In logistic regression, we have a two classes and we model the posterior for the single class

label y ∈ {0, 1} as

α(x) ≡ p(y=1|x) = 1 , (6) 1+exp(VT x + ν)

where V and ν are the (vector and scalar respectively) parameters in the model. We fit V and ν to data by minimizing the cross-entropy error function

N
E = −logp({y}|{x}) = 􏰈 ya logα(xa) + (1−ya)log(1−α(xa)) . (7)

a=1

We can fit the two-class problem into the framework in Problem 1. We use two class labels yi,i=1,2withy1 =1,y2 =0iftheexampleisinclassω1,andy1 =0,y2 =1iftheexample is in class ω2.

(a) A natural model for the class posteriors is the soft-max function exp gi(x)

αi(x) = 􏰊2j=1 expgj(x) (8) where −∞ < gi(x) < ∞ . Show that the softmax function guarantees that

and

0 ≤ αi(x) ≤ 1,∀x

2

􏰈 αi(x) = 1 . i=1

(b) Show that for our two-class case the soft-max forms of the αi(x) reduce to α1(x) = 1

1 + exp(g2 − g1)
1 .

and

choice g2 − g1 = V T x + ν.
(c) We fit the parameters V, ν by minimizing the cost function derived in problem 1 (Eqn.5)

N2
E = −􏰈 􏰈 yia logαi(xa) . (9)

a=1 i=1
Show that for the two-class case (with our choice of class labels) this error function

reduces to the cross-entropy function Eqn. (7).

This extends to the general L-class case. The gi(x) functions can be linear functions of x as in logistic regression. They can also be realized by more complicated functions — for example, the outputs of a neural network.

α2(x) =
Thus we really need only one g(x), and a familiar candidate is the logistic regression

2

1+exp−(g2 −g1)