程序代写代做代考 chain Logistic regression

Logistic regression
Daniel Hsu (COMS 4771)
The logistic regression model
Logistic regression is a model for binary classification data with feature vectors in Rd and labels in {−1, +1}. Data (X1, Y1), . . . , (Xn, Yn) are treated as iid random variables taking values in Rd × {−1, +1}, and for each x ∈ Rd,
Yi | Xi = x ∼ Bern(σ(xTw))
where σ(t) = 1/(1 + exp(−t)) is the sigmoid function. Here, w ∈ Rd is the parameter of the model, and it is
not involved in the marginal distribution of Xi (which we leave unspecified). Maximum likelihood
The log-likelihood of w given (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} is n
− 􏰊 ln(1 + exp(−yixTi w)) + (terms that do not involve w). i=1
There is no closed-form expression for the maximizer of the log-likelihood. Nevertheless, we can approximately minimize the negative log-likelihood with gradient descent.
Empirical risk minimization
Maximum likelihood is very different from finding the linear classifier of smallest empirical zero-one loss risk. Finding the empirical zero-one loss risk minimizer is computationally intractable in general.
Finding a linear separator
There are special cases when finding the empirical zero-one loss risk minimizer is computationally tractable. One is when the training data is linearly separable: i.e., when there exists w⋆ ∈ Rd such that
yixTiw⋆ > 0, for all i = 1,…,n.
Claim. Define L(w) := 􏰉ni=1 ln(1 + exp(−yixTi w)). Suppose (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} is
linearly separable. Then any wˆ ∈ Rd with
L(wˆ ) < inf L(w) + ln(2) w∈Rd is a linear separator. Proof. We first observe that the infimum1 (i.e., greatest lower bound) of L is zero. Let w⋆ ∈ Rd be a linear separator, so si := yixTiw⋆ > 0 for all i = 1,…,n. For any r > 0,
n
L(rw⋆) = 􏰊 ln(1 + exp(−rsi)),
i=1 1 https://en.wikipedia.org/wiki/Infimum_and_supremum
1

and therefore
n
lim 􏰊 ln(1 + exp(−rsi)) = 0.
r→∞
i=1
Every term ln(1 + exp(−yixTi w)) in L(w) is positive, so L(w) > 0. Therefore, we conclude that inf L(w) = 0.
w∈Rd
So now we just have to show that any wˆ ∈ Rd with
L(wˆ ) < ln(2) is a linear separator. So let wˆ satisfy L(wˆ ) < ln(2), which implies ln(1 + exp(−yixTi wˆ )) < ln(2) for every i = 1, . . . , n. Exponentiating both sides gives 1 + exp(−yixTi wˆ ) < 2. Now subtracting 1 from both sides and taking logarithms gives − y i x Ti wˆ < 0 . This means that wˆ correctly classifies (xi, yi). Since this holds for all i = 1, . . . , n, it follows that wˆ is a linear separator. Surrogate loss Even if (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} is not linearly separable, approximately maximizing the log-likelihood can yield a good linear classifier. This is because maximizing L is the same as minimizing the empirical logistic loss risk: 1 􏰊n lzo(z) ≤ 1 llog(z), ln2 where lzo(z) = 1{z≤0}. If the empirical logistic loss risk is small, then the empirical zero-one loss is also small. Gradient descent for logistic regression llog(yixTiw) llog(z) := − ln σ(z) R􏰓(w):= n is the logistic loss. The logistic loss (up to scaling) turns out to be an upper-bound on the zero-one loss: where dllog(z) = − 1 · dσ(z) dz i=1 The derivative of llog is given by σ(z) dz =− 1 ·σ(z)·σ(−z) σ(z) = −σ(−z). 2 Therefore, by linearity and the chain rule, the negative gradient of R􏰓 with respect to w is 1 􏰊n −∇R􏰓(w) = −n ∇llog(yixTiw) i=1 =−1 􏰊n dllog(z)􏰘􏰘􏰘􏰘 ·∇􏰏yixTiw􏰐 ni=1 dz 􏰘z=yixTiw 1 􏰊n σ(−yixTi w) · yixi. Now suppose A = [x1|···|xn]T ∈ Rn×d and b = [y1|···|yn]T ∈ Rn. (Notice that we have omitted the 1/√n = n scaling that we had for least squares linear regression.) Then the negative gradient of R􏰓 can be written as i=1 −∇R􏰓(w)= 1AT(b⊙σ(−b⊙(Aw))), n where u ⊙ v ∈ Rn is the coordinate-wise product of vectors u ∈ Rn and v ∈ Rn, and σ(v) ∈ Rn is the coordinate-wise application of the sigmoid function to v ∈ Rn. Gradient descent for logistic regression begins with an initial weight vector w(0) ∈ Rd, and then iteratively updates it by subtracting a positive multiple η > 0 of the gradient at the current iterate:
w(t) := w(t−1) − η∇R􏰓(w(t−1)).
3