CS5487 Problem Set 8
Linear Classifiers
Copyright By PowCoder代写 加微信 powcoder
Department of Computer Science
City University of
Logisitic Regression
Problem 8.1 Logistic sigmoid
Let �(a) be the logistic sigmoid function,
Let’s derive some useful properties:
(a) Show that the derivative of the sigmoid is
= �(a)(1� �(a)) (8.2)
(b) Show that
1� �(f) = �(�f). (8.3)
(c) Show that the inverse of �(a) is
��1(a) = log
This is called the logit function, or log odds.
. . . . . . . . .
Problem 8.2 Logistic Regression: MLE and IRLS
Consider the two-class logistic regression problem. Given the training set D = {(x
2 Rd is the input and y
2 {0, 1} is the corresponding class. Define the conditional probability
of the output class given the input
)1�yi , (8.5)
) is the conditional probability that x
belongs to class 1, and �(a) is the logistic
sigmoid function,
In this problem, we will derive the maximum likelihood estimate of w from the training set D,
w⇤ = argmax
`(w), `(w) =
or equivalently
w⇤ = argmin
E(w), (8.8)
) log(1� ⇡
)} . (8.9)
Define X = [x
, · · · , x
, · · · , y
]T , and ⇡ = [⇡
, · · · ,⇡
(a) Show that the gradient of E(w) is
= X(⇡ � y). (8.10)
Hint: use (8.2).
(b) Show that the Hessian of E(w) is
= XRXT , (8.11)
where R = diag(⇡
), · · · ,⇡
(c) Show that r2E(w) is strictly positive definite, and hence E(w) is convex and has a unique
(d) E(w) can be minimized using the Newton-Raphson method (see Problem 8.6), which iteratively
updates w,
w(new) = w(old) � [r2E(w)]�1rE(w). (8.12)
Show that the update step for logistic regression is
w(new) = (XRXT )�1XRz, (8.13)
where R and z are calculated from the previous w(old),
R = diag(⇡
), · · · ,⇡
)), (8.14)
z = XTw(old) �R�1(⇡ � y). (8.15)
The formula in (8.13) is the same as that of weighted least-squares, with weighting R and target
z. The weighting and target change in each iteration, since they depend on the current estimate of
w. Hence this algorithm is called iterative reweighted least-squares (IRLS or IRWLS).
. . . . . . . . .
Problem 8.3 Logistic Regression: Overfitting and singularities
In this problem, we will consider how logistic regression (Problem 8.2) can overfit to a training set
when the points are linearly separable.
(a) Show that for a linearly separable training set, the ML solution for logistic regression is obtained
by finding a w whose decision boundary wTx = 0 separates the classes, and then taking the
magnitude of w to infinity.
(b) In this case, what happens to the shape of the sigmoid function, and the resulting estimates of
the posterior probabilities p(y|x)? How is this a case of overfitting?
This singularity problem can be fixed by including a prior and finding the MAP solution, as
considered in Problem 8.4. The prior term e↵ectively adds a penalty on w with large magnitude,
kwk2, thus preventing its tendency to infinity.
. . . . . . . . .
Problem 8.4 Regularized Logistic Regression: MAP framework
In this problem we will consider the MAP estimation for logistic regression. This is also known as
regularized logistic regression.
Assume a prior distribution on w that is zero-mean Gaussian and known precision matrix
� (i.e., inverse of the covariance matrix),
p(w) = N (w|0,��1). (8.16)
Given the training set D, the MAP estimate is
w⇤ = argmax
log p(y|X,w) + log p(w). (8.17)
(a) Show that (8.17) is equivalent to the minimization problem,
w⇤ = argmin
Ê(w), (8.18)
Ê(w) = E(w) +
wT�w. (8.19)
with E(w) defined in (8.9).
(b) Show that the gradient of Ê(w) is given by
rÊ(w) = X(⇡ � y) + �w. (8.20)
(c) Show that the Hessian of Ê(w) is given by
r2Ê(w) = XRXT + �, (8.21)
with R = diag(⇡
), · · · ,⇡
)) as before.
(d) Show that the Newton-Raphson iteration (see Problem 8.2d and Problem 8.6) for minimizing
w(new) = (XRXT + �)�1XRz, (8.22)
where R and z are calculated from the previous w(old),
R = diag(⇡
), · · · ,⇡
)), (8.23)
z = XTw(old) �R�1(⇡ � y). (8.24)
(e) How does the precision � help to smoothen (regularize) the estimate of w?
(f) Consider the case where we include the bias term in w and x, i.e.
where x̃ is the original input feature vector. The linear discriminant function now contains a
f(x) = wTx = w̃T x̃+ b̃. (8.26)
For the regularization, consider the two precision matrices
= �I, (8.27)
= diag(�, · · · ,�, 0). (8.28)
Show that �
applies a penalty based on the magnitudes of w̃ and b̃, while �
applies a penalty
only on the magnitude of w̃. In general, should we prefer �
? Why? (Hint: consider
translations of the training set in the input space).
. . . . . . . . .
Empirical Risk Minimization
Problem 8.5 Loss functions
All the linear classification algorithms that we saw in lecture are optimization problems over some
error function on the training set. Where they di↵er is in how they define this error function.
Consider the 2-class problem, with y 2 {+1,�1}, input x 2 Rd, and training set {(x
Define the emprical risk over the training set as
and L(·) is a loss function. The optimal separating hyperplane is obtained by
minimizing the empirical risk,
w⇤ = argmin
(w). (8.30)
This is called empirical risk minimization.
In this problem, we will consider the classifiers from lecture within this framework. Define
the quantity
Recall that z
> 0 when the training point x
is correctly classified, and z
< 0 when the point is misclassified. (a) Show that the loss function for minimizing the number of misclassified training points is i.e., the 0-1 loss function. (b) Show that the loss-function for the perceptron is ) = max(0,�z (c) Show that the loss-function for least-squares classification is � 1)2. (8.34) Hint: use the fact that y2 (d) Show that the loss-function for logistic regression is log(1 + e�zi). (8.35) Hint: use (8.3). (e) Plot the above loss functions as a function of z . Intuitively, which loss functions should be better for learning a good linear classifier? Why? . . . . . . . . . Optimization Problem 8.6 Newton-Raphson Method The Newton-Raphson method (also called Newton’s method) is an iterative scheme for finding a root (or zero) of a function f(x), i.e., an x⇤ such that f(x⇤) = 0. (a) Given an initial point x(0), the Newton-Raphson method constructs a tangent to f(x) at the current point x(i) and then sets the next x(i+1) to the zero of the tangent function. 0 0.5 1 1.5 2 Show that this scheme yields the iteration, x(i+1) = x(i) � where f 0(x) is the first derivative of f(x). (b) For well-behaved function, the Newton-Raphson method converges quadratically to the zero, if it is started “su�ciently” close to the true x⇤. However, the iterations are not guaranteed to converge and could oscillate in an infinite cycle. Consider the function f(x) = x3 � 2x+ 2. Plot the function, and show that when starting at x = 0, the Newton-Raphson iterations never (c) The Newton-Raphson method can also be used to find an optimal point in a function. Given a function g(x), show that an optimal point (or stationary point) can be found by the iterations, x(i+1) = x(i) � where g0(x) and g00(x) are the first and second derivatives of g(x). How is the iteration in (8.37) similar to gradient ascent? Does it need to be modified to do gradient descent? (d) Show that the iteration in (8.37) is equivalent to successively optimizing a quadratic approxi- mation of g(x). . . . . . . . . . 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com