程序代写代做代考 algorithm CS5487 Problem Set 8

CS5487 Problem Set 8
Linear Classifiers
Antoni Chan Department of Computer Science City University of Hong Kong
Logisitic Regression
Problem 8.1
Logistic sigmoid
Let (a) be the logistic sigmoid function,
(a) =
Let’s derive some useful properties:
(a) Show that the derivative of the sigmoid is
1 . 1+ea
(8.1)
(8.2)
(8.3)
(8.4)
(b) Show that
(c) Show that the inverse of (a) is
@(a) = (a)(1 (a)) @a
1 (f) = (f). 1(a) = log a .
1a This is called the logit function, or log odds.
……… Problem 8.2 Logistic Regression: MLE and IRLS
Consider the two-class logistic regression problem. Given the training set D = {(xi, yi)}ni=1, where xi 2 Rd is the input and yi 2 {0, 1} is the corresponding class. Define the conditional probability of the output class given the input
p(yi|xi) = ⇡yi (1 ⇡i)1yi , (8.5) i
where ⇡i = (wT xi) is the conditional probability that xi belongs to class 1, and (a) is the logistic sigmoid function,
(a) = 1 . (8.6) 1+ea
58

In this problem, we will derive the maximum likelihood estimate of w from the training set D,
⇤ Xn
w = argmax `(w), `(w) = log p(yi|xi),
w i=1 w⇤ = argmin E(w),
(8.7)
(8.8) (8.9)
(8.10)
or equivalently
logp(yi|xi) =
Define X = [x1,··· ,xn], y = [y1,··· ,yn]T, and ⇡ = [⇡1,··· ,⇡n]T.
(a) Show that the gradient of E(w) is
Xn i=1
Xn
Xn E(w) =
Xn i=1
{yi log⇡i +(1yi)log(1⇡i)}.
rE(w) = (b) Show that the Hessian of E(w) is
(⇡i yi)xi = X(⇡ y).
Hint: use (8.2).
w
i=1
r2E(w) =
where R = diag(⇡1(1 ⇡1), · · · , ⇡n(1 ⇡n)).
⇡i(1 ⇡i)xixTi = XRXT ,
(c) Show that r2E(w) is strictly positive definite, and hence E(w) is convex and has a unique
i=1
minimum.
(d) E(w) can be minimized using the Newton-Raphson method (see Problem 8.6), which iteratively
(8.11)
updates w,
w(new) = w(old) [r2E(w)]1rE(w). Show that the update step for logistic regression is
w(new) = (XRXT )1XRz, where R and z are calculated from the previous w(old),
R = diag(⇡1(1 ⇡1), · · · , ⇡n(1 ⇡n)), z = XT w(old) R1(⇡ y).
(8.12)
(8.13)
(8.14) (8.15)
The formula in (8.13) is the same as that of weighted least-squares, with weighting R and target z. The weighting and target change in each iteration, since they depend on the current estimate of w. Hence this algorithm is called iterative reweighted least-squares (IRLS or IRWLS).
………
59

Problem 8.3 Logistic Regression: Overfitting and singularities
In this problem, we will consider how logistic regression (Problem 8.2) can overfit to a training set when the points are linearly separable.
(a) Show that for a linearly separable training set, the ML solution for logistic regression is obtained by finding a w whose decision boundary wT x = 0 separates the classes, and then taking the magnitude of w to infinity.
(b) In this case, what happens to the shape of the sigmoid function, and the resulting estimates of the posterior probabilities p(y|x)? How is this a case of overfitting?
This singularity problem can be fixed by including a prior and finding the MAP solution, as considered in Problem 8.4. The prior term e↵ectively adds a penalty on w with large magnitude, kwk2, thus preventing its tendency to infinity.
………
Problem 8.4 Regularized Logistic Regression: MAP framework
In this problem we will consider the MAP estimation for logistic regression. This is also known as regularized logistic regression.
Assume a prior distribution on w that is zero-mean Gaussian and known precision matrix (i.e., inverse of the covariance matrix),
p(w) = N (w|0, 1). Given the training set D, the MAP estimate is
w⇤ = argmax log p(y|X, w) + log p(w). w
(a) Show that (8.17) is equivalent to the minimization problem, w⇤ =argminEˆ(w),
w
Eˆ ( w ) = E ( w ) + 1 w T w . 2
with E(w) defined in (8.9).
(b) Show that the gradient of Eˆ(w) is given by
r Eˆ ( w ) = X ( ⇡ y ) + w .
(c) Show that the Hessian of Eˆ(w) is given by
r2Eˆ(w) = XRXT + ,
with R = diag(⇡1(1 ⇡1), · · · , ⇡n(1 ⇡n)) as before. 60
(8.16)
(8.17)
(8.18)
(8.19)
(8.20)
(8.21)

(d) Show that the Newton-Raphson iteration (see Problem 8.2d and Problem 8.6) for minimizing Eˆ(w) is
w(new) = (XRXT + )1XRz, where R and z are calculated from the previous w(old),
R = diag(⇡1(1 ⇡1), · · · , ⇡n(1 ⇡n)), z = XT w(old) R1(⇡ y).
(e) How does the precision help to smoothen (regularize) the estimate of w?
(f) Consider the case where we include the bias term in w and x, i.e.
x=x ̃, w=w ̃, 1 ̃b
(8.22)
(8.23) (8.24)
(8.25)
where x ̃ is the original input feature vector. The linear discriminant function now contains a bias term
f(x) = wT x = w ̃T x ̃ + ̃b. For the regularization, consider the two precision matrices
1 = I,
2 = diag(,··· ,,0).
(8.26)
(8.27) (8.28)
Show that 1 applies a penalty based on the magnitudes of w ̃ and ̃b, while 2 applies a penalty only on the magnitude of w ̃. In general, should we prefer 1 or 2? Why? (Hint: consider translations of the training set in the input space).
………
Empirical Risk Minimization
Problem 8.5 Loss functions
All the linear classification algorithms that we saw in lecture are optimization problems over some error function on the training set. Where they di↵er is in how they define this error function.
Consider the 2-class problem, with y 2 {+1, 1}, input x 2 Rd, and training set {(xi, yi)}ni=1. Define the emprical risk over the training set as
Xn i=1
where f(xi) = wT xi and L(·) is a loss function. The optimal separating hyperplane is obtained by minimizing the empirical risk,
w⇤ = argmin Remp(w). (8.30) w
61
Remp(w) =
L(f(xi), yi), (8.29)

This is called empirical risk minimization.
In this problem, we will consider the classifiers from lecture within this framework. Define
the quantity
zi = yiwT xi. (8.31) Recall that zi > 0 when the training point xi is correctly classified, and zi < 0 when the point is misclassified. (a) Show that the loss function for minimizing the number of misclassified training points is L01(zi) = (0, zi 0 1, zi<0, i.e., the 0-1 loss function. (b) Show that the loss-function for the perceptron is Lp(zi) = max(0, zi). (c) Show that the loss-function for least-squares classification is (8.32) (8.33) (8.34) (8.35) LLSC (zi) = (zi 1)2. (d) Show that the loss-function for logistic regression is LLR(zi) / 1 log(1 + ezi ). log(2) Hint: use (8.3). (e) Plot the above loss functions as a function of zi. Intuitively, which loss functions should be better for learning a good linear classifier? Why? ......... Optimization Problem 8.6 Newton-Raphson Method The Newton-Raphson method (also called Newton’s method) is an iterative scheme for finding a root (or zero) of a function f(x), i.e., an x⇤ such that f(x⇤) = 0. (a) Given an initial point x(0), the Newton-Raphson method constructs a tangent to f(x) at the current point x(i) and then sets the next x(i+1) to the zero of the tangent function. 62 Hint: use the fact that yi2 = 1. 2 1.5 1 0.5 0 −0.5 −1 0 0.5 1 1.5 2 x f(x) f’(x) x(i+1) x(i) Show that this scheme yields the iteration, x(i+1) = x(i) f(x(i)) , (8.36) f0(x(i)) where f0(x) is the first derivative of f(x). (b) For well-behaved function, the Newton-Raphson method converges quadratically to the zero, if it is started “suciently” close to the true x⇤. However, the iterations are not guaranteed to converge and could oscillate in an infinite cycle. Consider the function f (x) = x3 2x + 2. Plot the function, and show that when starting at x = 0, the Newton-Raphson iterations never converge. (c) The Newton-Raphson method can also be used to find an optimal point in a function. Given a function g(x), show that an optimal point (or stationary point) can be found by the iterations, x(i+1) = x(i) g0(x(i)) , (8.37) g00(x(i)) where g0(x) and g00(x) are the first and second derivatives of g(x). How is the iteration in (8.37) similar to gradient ascent? Does it need to be modified to do gradient descent? (d) Show that the iteration in (8.37) is equivalent to successively optimizing a quadratic approxi- mation of g(x). ......... 63