程序代写代做代考 scheme ANLY-601 Spring 2018

ANLY-601 Spring 2018

Assignment 4, Due Tuesday, March 27, 2018 — in class

You may use your class notes, the text, or any calculus books — please use no other references
(including internet or other statistics texts). If you use Mathematica to derive results, you must
include the notebook with your solution so I can see what you did.

1. Maximum-likelihood Cost Function for Multiclass Problems

This problem extends the cross-entropy error function to multiple classes. Suppose we have
L classes ω1, . . . , ωL and each example feature vector x is from an object that belongs to one
and only one class. Suppose further that the class labeling scheme assigns a binary vector y
with L components to each example with

yi(x) = 1 if x ∈ ωi and yj(x) = 0 for all j 6= i .

That is, each vector y has exactly one element set equal to 1, and the other elements set equal
to 0. We can then write the probability of the class label vector y for a sample with features
x as a multinomial distribution

p(y|x) =
L∏
i=1

αi(x)
yi (1)

with 0 ≤ αi(x) ≤ 1. For example,

p((0, 1, 0, 0, 0, . . . , 0) |x) = α2(x) .

(a) We want to be sure that p(y|x) is properly normalized, that is∑
{y}

p(y|x) = 1 (2)

where the sum is over the set of all allowed vectors y. Show that this normalization
condition requires that

L∑
i=1

αi(x) = 1 ∀x . (3)

(To be clear, you should probably explicate the sum over the allowed label vectors by
giving the first several terms in the sum in (2).)

(b) Suppose we have a collection of N statistically independent samples with feature vectors
xa and label vectors ya, a = 1, 2, . . . , N (the superscript denotes the sample number).
Write the likelihood of the data set

p({y(1), y(2), . . . , y(N)} | {x(1), x(2), . . . , x(N)}, α1(x), . . . αL(x) ) (4)

that follows from the likelihood for each data sample from equation (1).

(c) Show that maximizing the log-likelihood of the entire data set is equivalent to minimizing
the cost function

E = −
N∑
a=1

L∑
i=1

yai logαi(x
a) . (5)

2. Extension of Logistic Regression to Multi-Class Problems

In logistic regression, we have a two classes and we model the posterior for the single class
label y ∈ {0, 1} as

α(x) ≡ p(y = 1|x) =
1

1 + exp(V T x + ν)
, (6)

where V and ν are the (vector and scalar respectively) parameters in the model. We fit V
and ν to data by minimizing the cross-entropy error function

E = − log p({y}|{x}) =
N∑
a=1

ya logα(xa) + (1− ya) log(1− α(xa)) . (7)

We can fit the two-class problem into the framework in Problem 1. We use two class labels
yi, i = 1, 2 with y1 = 1, y2 = 0 if the example is in class ω1, and y1 = 0, y2 = 1 if the example
is in class ω2.

(a) A natural model for the class posteriors is the soft-max function

αi(x) =
exp gi(x)∑2
j=1 exp gj(x)

(8)

where −∞ < gi(x) <∞ . Show that the softmax function guarantees that 0 ≤ αi(x) ≤ 1 , ∀x and 2∑ i=1 αi(x) = 1 . (b) Show that for our two-class case the soft-max forms of the αi(x) reduce to α1(x) = 1 1 + exp(g2 − g1) and α2(x) = 1 1 + exp−(g2 − g1) . Thus we really need only one g(x), and a familiar candidate is the logistic regression choice g2 − g1 = V Tx+ ν. (c) We fit the parameters V, ν by minimizing the cost function derived in problem 1 (Eqn.5) E = − N∑ a=1 2∑ i=1 yai logαi(x a) . (9) Show that for the two-class case (with our choice of class labels) this error function reduces to the cross-entropy function Eqn. (7). This extends to the general L-class case. The gi(x) functions can be linear functions of x as in logistic regression. They can also be realized by more complicated functions — for example, the outputs of a neural network. 2