CS计算机代考程序代写 chain Logistic regression

Logistic regression

Logistic regression

Daniel Hsu (COMS 4771)

The logistic regression model
Logistic regression is a model for binary classification data with feature vectors in Rd and labels in {−1,+1}.
Data (X1, Y1), . . . , (Xn, Yn) are treated as iid random variables taking values in Rd×{−1,+1}, and for each
x ∈ Rd,

Yi |Xi = x ∼ Bern(σ(xTw))
where σ(t) = 1/(1 + exp(−t)) is the sigmoid function. Here, w ∈ Rd is the parameter of the model, and it is
not involved in the marginal distribution of Xi (which we leave unspecified).

Maximum likelihood
The log-likelihood of w given (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1} is


n∑

i=1
ln(1 + exp(−yixTi w)) + (terms that do not involve w).

There is no closed-form expression for the maximizer of the log-likelihood. Nevertheless, we can approximately
minimize the negative log-likelihood with gradient descent.

Empirical risk minimization
Maximum likelihood is very different from finding the linear classifier of smallest empirical zero-one loss risk.
Finding the empirical zero-one loss risk minimizer is computationally intractable in general.

Finding a linear separator
There are special cases when finding the empirical zero-one loss risk minimizer is computationally tractable.
One is when the training data is linearly separable: i.e., when there exists w? ∈ Rd such that

yix
T
i w

? > 0, for all i = 1, . . . , n.

Claim. Define L(w) :=
∑n

i=1 ln(1 + exp(−yix
T
i w)). Suppose (x1, y1), . . . , (xn, yn) ∈ R

d × {−1,+1} is
linearly separable. Then any ŵ ∈ Rd with

L(ŵ) < inf w∈Rd L(w) + ln(2) is a linear separator. Proof. We first observe that the infimum1 (i.e., greatest lower bound) of L is zero. Let w? ∈ Rd be a linear separator, so si := yixTi w ? > 0 for all i = 1, . . . , n. For any r > 0,

L(rw?) =
n∑

i=1
ln(1 + exp(−rsi)),

1https://en.wikipedia.org/wiki/Infimum_and_supremum

1

https://en.wikipedia.org/wiki/Infimum_and_supremum

and therefore

lim
r→∞

n∑
i=1

ln(1 + exp(−rsi)) = 0.

Every term ln(1 + exp(−yixTi w)) in L(w) is positive, so L(w) > 0. Therefore, we conclude that

inf
w∈Rd

L(w) = 0.

So now we just have to show that any ŵ ∈ Rd with

L(ŵ) < ln(2) is a linear separator. So let ŵ satisfy L(ŵ) < ln(2), which implies ln(1 + exp(−yixTi ŵ)) < ln(2) for every i = 1, . . . , n. Exponentiating both sides gives 1 + exp(−yixTi ŵ) < 2. Now subtracting 1 from both sides and taking logarithms gives −yixTi ŵ < 0. This means that ŵ correctly classifies (xi, yi). Since this holds for all i = 1, . . . , n, it follows that ŵ is a linear separator. Surrogate loss Even if (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1} is not linearly separable, approximately maximizing the log-likelihood can yield a good linear classifier. This is because maximizing L is the same as minimizing the empirical logistic loss risk: R̂(w) := 1 n n∑ i=1 `log(yixTi w) where `log(z) := − ln σ(z) is the logistic loss. The logistic loss (up to scaling) turns out to be an upper-bound on the zero-one loss: `zo(z) ≤ 1 ln 2 `log(z), where `zo(z) = 1{z≤0}. If the empirical logistic loss risk is small, then the empirical zero-one loss is also small. Gradient descent for logistic regression The derivative of `log is given by d`log(z) dz = − 1 σ(z) · dσ(z) dz = − 1 σ(z) · σ(z) · σ(−z) = −σ(−z). 2 Therefore, by linearity and the chain rule, the negative gradient of R̂ with respect to w is −∇R̂(w) = − 1 n n∑ i=1 ∇ `log(yixTi w) = − 1 n n∑ i=1 d`log(z) dz ∣∣∣∣∣ z=yixTi w · ∇ ( yix T i w ) = 1 n n∑ i=1 σ(−yixTi w) · yixi. Now suppose A = [x1| · · · |xn]T ∈ Rn×d and b = [y1| · · · |yn]T ∈ Rn. (Notice that we have omitted the 1/ √ n scaling that we had for least squares linear regression.) Then the negative gradient of R̂ can be written as −∇R̂(w) = 1 n AT(b� σ(−b� (Aw))), where u � v ∈ Rn is the coordinate-wise product of vectors u ∈ Rn and v ∈ Rn, and σ(v) ∈ Rn is the coordinate-wise application of the sigmoid function to v ∈ Rn. Gradient descent for logistic regression begins with an initial weight vector w(0) ∈ Rd, and then iteratively updates it by subtracting a positive multiple η > 0 of the gradient at the current iterate:

w(t) := w(t−1) − η∇R̂(w(t−1)).

3

The logistic regression model
Maximum likelihood
Empirical risk minimization
Finding a linear separator
Surrogate loss
Gradient descent for logistic regression