Logistic regression
Daniel Hsu (COMS 4771)
The logistic regression model
Logistic regression is a model for binary classification data with feature vectors in Rd and labels in {−1, +1}. Data (X1, Y1), . . . , (Xn, Yn) are treated as iid random variables taking values in Rd × {−1, +1}, and for each x ∈ Rd,
Yi | Xi = x ∼ Bern(σ(xTw))
where σ(t) = 1/(1 + exp(−t)) is the sigmoid function. Here, w ∈ Rd is the parameter of the model, and it is
not involved in the marginal distribution of Xi (which we leave unspecified). Maximum likelihood
The log-likelihood of w given (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} is n
− ln(1 + exp(−yixTi w)) + (terms that do not involve w). i=1
There is no closed-form expression for the maximizer of the log-likelihood. Nevertheless, we can approximately minimize the negative log-likelihood with gradient descent.
Empirical risk minimization
Maximum likelihood is very different from finding the linear classifier of smallest empirical zero-one loss risk. Finding the empirical zero-one loss risk minimizer is computationally intractable in general.
Finding a linear separator
There are special cases when finding the empirical zero-one loss risk minimizer is computationally tractable. One is when the training data is linearly separable: i.e., when there exists w⋆ ∈ Rd such that
yixTiw⋆ > 0, for all i = 1,…,n.
Claim. Define L(w) := ni=1 ln(1 + exp(−yixTi w)). Suppose (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} is
linearly separable. Then any wˆ ∈ Rd with
L(wˆ ) < inf L(w) + ln(2)
w∈Rd
is a linear separator.
Proof. We first observe that the infimum1 (i.e., greatest lower bound) of L is zero. Let w⋆ ∈ Rd be a linear separator, so si := yixTiw⋆ > 0 for all i = 1,…,n. For any r > 0,
n
L(rw⋆) = ln(1 + exp(−rsi)),
i=1 1 https://en.wikipedia.org/wiki/Infimum_and_supremum
1
and therefore
n
lim ln(1 + exp(−rsi)) = 0.
r→∞
i=1
Every term ln(1 + exp(−yixTi w)) in L(w) is positive, so L(w) > 0. Therefore, we conclude that inf L(w) = 0.
w∈Rd
So now we just have to show that any wˆ ∈ Rd with
L(wˆ ) < ln(2)
is a linear separator. So let wˆ satisfy L(wˆ ) < ln(2), which implies ln(1 + exp(−yixTi wˆ )) < ln(2)
for every i = 1, . . . , n. Exponentiating both sides gives
1 + exp(−yixTi wˆ ) < 2.
Now subtracting 1 from both sides and taking logarithms gives − y i x Ti wˆ < 0 .
This means that wˆ correctly classifies (xi, yi). Since this holds for all i = 1, . . . , n, it follows that wˆ is a linear separator.
Surrogate loss
Even if (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} is not linearly separable, approximately maximizing the log-likelihood can yield a good linear classifier. This is because maximizing L is the same as minimizing the empirical logistic loss risk:
1 n
lzo(z) ≤ 1 llog(z), ln2
where lzo(z) = 1{z≤0}. If the empirical logistic loss risk is small, then the empirical zero-one loss is also small.
Gradient descent for logistic regression
llog(yixTiw) llog(z) := − ln σ(z)
R(w):= n
is the logistic loss. The logistic loss (up to scaling) turns out to be an upper-bound on the zero-one loss:
where
dllog(z) = − 1 · dσ(z)
dz
i=1
The derivative of llog is given by
σ(z) dz
=− 1 ·σ(z)·σ(−z)
σ(z) = −σ(−z).
2
Therefore, by linearity and the chain rule, the negative gradient of R with respect to w is 1 n
−∇R(w) = −n ∇llog(yixTiw) i=1
=−1 n dllog(z) ·∇yixTiw
ni=1 dz z=yixTiw
1 n
σ(−yixTi w) · yixi.
Now suppose A = [x1|···|xn]T ∈ Rn×d and b = [y1|···|yn]T ∈ Rn. (Notice that we have omitted the 1/√n
= n
scaling that we had for least squares linear regression.) Then the negative gradient of R can be written as
i=1
−∇R(w)= 1AT(b⊙σ(−b⊙(Aw))), n
where u ⊙ v ∈ Rn is the coordinate-wise product of vectors u ∈ Rn and v ∈ Rn, and σ(v) ∈ Rn is the coordinate-wise application of the sigmoid function to v ∈ Rn.
Gradient descent for logistic regression begins with an initial weight vector w(0) ∈ Rd, and then iteratively updates it by subtracting a positive multiple η > 0 of the gradient at the current iterate:
w(t) := w(t−1) − η∇R(w(t−1)).
3