Logistic Regression
Logisitc regression defines the conditional probability directly and is a discriminative model rather than a generative model
I Start with the scoring function ✓ · f (x , y ) that measures the comptability between the features x and y.
I To make sure it’s not negative, we exponentiate it and get exp ✓ · f (x , y )
I We then normalize it by dividing it over all possible labels y 2 Y and get a probability.
p(y|x;✓)=P exp✓·f(x,y) y02Y exp✓·f(x,y0)
Logistic Regression
The weights are estimated by maximum conditional likelihood. Give a data set D = {(x(i),y(i))}Ni=1, the maximum conditional likelihood is:
XN i=1
log p(y(1:N)|x(1:N); ✓) =
= ✓ · f (x (i ) , y (i ) ) log exp ✓ · f (x (i ) , y 0 )
log p(y(i)|x(i); ✓)
XN X
i=1 y02Y
Or they can be estimated by minimizing the logistic regression loss
(or cross-entropy loss):
LLOGREG = XN 0@✓·f(x(i),y(i)) logXexp⇣✓·f(x(i),y0)⌘1A i=1 y02Y
Logistic Regression Objective
Loss of a single sample:
`LOGREG = ✓·f(x(i),y(i))+logXexp⇣✓·f(x(i),y0)⌘ y02Y
Gradient of Logistic Regression
The gradient with respect to the loss of a single example
1 @✓ y002Y exp ✓·f(x(i),y00)
@ = fX(x(i),y(i))+P
⇣X⌘
⇥ exp ✓ · f (x(i),y0) y02Y
⇥ f (x(i),y0)
y02Y y 2Y
= f (x(i), y(i)) + P(y0|x(i); ✓) ⇥ f (x(i), y0)
y02Y
= f (x(i), y(i)) + EY |X [f (x(i), y)]
Application of the chain rule in calculus, expectation
(i)0 exp ✓·f(x ,y)
(i) 0
(i) (i)
= f(x ,y )+XP00 exp ✓·f(x(i),y00) ⇥f(x ,y)
Gradient of Logistic Regression
@ = f(x(i),y(i))+XP(y0|x(i);✓)⇥f(x(i),y0) @✓ y02Y
= f (x(i), y(i)) + EY |X [f (x(i), y)] I This is a very nice intuitive result
I The gradient equals to the di↵erence between the expected feature counts under the current model EY |X [f (x(i), y)] and
the observed feature counts f(x(i),y(i))
I The loss is minimized if the feature counts under the current
model and the observed feature counts are the same
I The power of Logistic Regression model is that you can use arbitrary number of features without making any independence assumptions. The allows creative feature engineering to improve the performance of the model.
Digression: Expectation
Expectation is the mean or average of a random variable, for discrete case. Let X be a random variable:
E[X] = X xP(x) = μ x 2 XX
E[g(X)] = g(x)p(x)
xX
E[g(X,Y)] = g(x,y)p(x,y)
x,y
Let X be the random variable which is the value of rolling a single
dice:
X6 1X6 21
E[x]= xP(y)=6 x= 6 =3.5 x=1 x=1
Digression: Linearity of Expectations
E[X + Y ] = X P (x , y )(x + y ) x,y
= X P (x , y )x + X P (x , y )y x,y x,y
= X XP(x,y)!x X XP(x,y)!y xy yx
= X P (x )x + X P (y )y xy
= E[X ] + E[Y ]
Let the Y be the random variable for the sum of two dice rolled.
Expected value of Y :
E[Y ] = E[X ] + E[X ] When two random variables are independent:
E[X , Y ] = E[X ] ⇥ E[X ]
Digression: Variance
I Variance of a random variable is a measure of whether the values of the random variable tend to be consistent over trials/experiments or whether they vary a lot
Var(X) = E[(X E[X]2] = E[X2] (E[X])2 E[(X E[X ]2 ] = X(x E[X ])P (x )
Xx
= (x2 2E[X]x + (E[X])2)P(x)
x
= E[X2] 2(E[X])2 + (E[X])2 = E[X2] (E[X])2
I 2 denotes the variance, and is the standard deviation.
= X x 2 P (x ) 2E[X ] X xP (x ) + (E[X ])2 X P (x ) xxx
Regularization
I L2 regularization: add a multiple of the squared norm k✓k2
to the minimization objective
I Regularization forces the estimator to trade o↵ performance on the training data against the norm of the weights, and thus helps relieve overfitting.
The overall loss of a training set with a regularization term:
LLOGREG= k✓k2 XN 0@✓·f(x(i),y(i)) logXexp⇣✓·f(x(i),y)⌘1A 2 i=1 y02Y
22
Regularized gradient
Derivative of the L2 term:
00 11 12
XN 2 XN
2 k ✓ k 2 2 = 2 B@ @ ✓ j 2 A CA = 2 ✓ j 2
j=1 j=1 @ @ XN XN@
@✓ 2k✓k2=@✓ 2 ✓j2=2 @✓ ✓j2= ✓k k kj=1 j=1k
@ ✓j2 = (2✓k if j=k
@ ✓k 0 otherwise
Gradient of regularized loss:
XN⇣ ⌘ r✓LLOGREG = ✓ f (x(i), y(i)) EY |X [f (x(i), y0)]
i=1
Batch Optimization vs online optimization
Gradient Descent vs Stochastic Gradient Descent. In batch optimization, each update to the weights is based on the entire dataset. One such algorithm is gradient descent, which iteratively updates the weights,
✓(t+1) ✓(t) ⌘(t)r✓L
where r✓L is the gradient computed over the entire training set, ⌘(t) is the learning rate at iteration t.
Variations of Gradient Descent
I Online learning algorithms make updates as going through the data. In stochasttic gradient descent, the approximate gradient is computed by sampling a single instance, and an update is made immediately.
I In mini-batch stochastic gradient descent, the gradient is computed over a small subset of instances.
Generalized gradient descent algorithm
The function BATCHER partitions the training set into B batches such that each instance appears in exactly one batch. In stochastic gradient descent, B = N; in gradient descent, B = 1; in mini-batch stochastic gradient descent, 1 < B < N.
1:
2:
3:
4:
5:
6:
procedureGRDNT-DSCNT(x(1:N),y(1:N),L,⌘(1···1),BATCHER,TMAX) t 0
9: 10: 11: 12: 13:
✓(0) 0
repeat
(b(1),b(2),··· ,b(B)) BATCHER(N) for n 2 {1,2,··· ,B} do
t t+1 (n) (n)
if Converged(✓(1,2,··· ,t)) then return ✓(t)
end if end for
until t = TMAX end procedure
7:
8: ✓(t) ✓(t 1) ⌘(t)rL(✓(t 1);x(b1 ,b2
(n) (n)
,···),y(b1
,b2
,···))
Binary logistic regression is a special case of multinominal logistic regression
exp(P ✓kfk(y = 1,x))
P(y = 1|x) =
P Pk
k ✓kfk(y =1,x))+exp( k ✓kfk(y =0,x))
exp( k ✓kfk(y = 1,x)) =PPP
⇥ exp( Pk ✓kfk(y = 1,x)) exp( k ✓kfk(y = 1,x))
=P1P 1+exp( k ✓kfk(y =0,x))⇥exp(
k ✓kfk(y =1,x))
P
exp(
exp( k ✓kfk(y =1,x))+exp( k ✓kfk(y =0,x))
=P1
1+exp( k ✓k(fk(y =0,x) fk(y =1,x)))
1
= 1 + exp(Pk ✓kfk0(x)) = (✓ · f (x))
Note: For binary classification, you only need to pay attention to the positive class.
Logistic Regression: features and weights
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7 NEG f2 f5 f8 NEU f3 f6 f9
POS -1 2 -2.5 NEG 2 -2 1.8 NEU -0.4 -0.9 -1.5
f10f13 f16f19f22f25
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 21
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
e.g., f1(x,y) = 1 if x=“not” ^ y= “POS”, ✓1 = 1
Note: The features f (x , y ) and ✓ are presented in a table rather than a vector due to limitation of space. Mathematically they should still be viewed as vectors
Logistic Regression: Inference
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7 NEG f2 f5 f8 NEU f3 f6 f9
POS -1 2 -2.5 NEG 2 -2 1.8 NEU -0.4 -0.9 -1.5
f10f13 f16f19f22f25
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 21
f17 f20 f18 f21
.08 1.5 -0.6 -2 -0.2 -1.2
f23 f26
f24 f27
0.8 1.2 -1.2 0.8 -0.3 0.4
Test instance: “funny, smart, and visually stunning”
exp(✓·f(x,y)) exp(Pk ✓kfk(x,y)) p(y|x)= P0y exp(✓·f(x,y0)) = Py0 exp(Pk ✓kfk(x,y0))
Logistic Regression: Inference
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7 NEG f2 f5 f8 NEU f3 f6 f9
POS -1 2 -2.5 NEG 2 -2 1.8 NEU -0.4 -0.9 -1.5
f10f13 f16f19f22f25
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 21
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Test instance: “funny, smart, and visually stunning” p(y = POS|x)
= exp(✓4f4 + ✓25f25)
exp(✓4f4 + ✓25f25) + exp(✓5f5 + ✓26f26) + exp(✓6f6 + ✓27f27)
= exp(2 + 1.2) = 0.9643 exp(2 + 1.2) + exp( 2 + 0.8) + exp( 0.9 + 0.4)
Logistic Regression: Inference
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7 NEG f2 f5 f8 NEU f3 f6 f9
POS -1 2 -2.5 NEG 2 -2 1.8 NEU -0.4 -0.9 -1.5
f10f13 f16f19f22f25
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 21
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Test instance: “funny, smart, and visually stunning” p(y = NEG|x)
= exp(✓5f5 + ✓26f26)
exp(✓4f4 + ✓25f25) + exp(✓5f5 + ✓26f26) + exp(✓6f6 + ✓27f27)
= exp( 2 + 0.8) = 0.0118 exp(2 + 1.2) + exp( 2 + 0.8) + exp( 0.9 + 0.4)
Logistic Regression: Inference
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7 NEG f2 f5 f8 NEU f3 f6 f9
POS -1 2 -2.5 NEG 2 -2 1.8 NEU -0.4 -0.9 -1.5
f10f13 f16f19f22f25
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 21
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Test instance: “funny, smart, and visually stunning” p(y = NEU|x)
= exp(✓6f6 + ✓27f27)
exp(✓4f4 + ✓25f25) + exp(✓5f5 + ✓26f26) + exp(✓6f6 + ✓27f27)
= exp( 9 + 0.4) = 0.0238 exp(2 + 1.2) + exp( 2 + 0.8) + exp( 0.9 + 0.4)
Logistic Regression: Parameter estimation
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7
f10f13 f16f19f22f25
NEG f2 NEU f3
POS -1 NEG 2 NEU -0.4
f5 f8 f6 f9
2 -2.5 -2 1.8 -0.9 -1.5
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 21
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Training instance: y
XN XNX
f(x(i),y(i))+ p(y0|x(i))f(x(i),y0)
r✓LLOGREG =
i=1 i=1 y02Y
= NEG, x = “not funny at all”
Logistic Regression: parameter estimation
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7
f10f13 f16f19f22f25
NEG f2 NEU f3
POS -1 NEG 2 NEU -0.4
f5 f8 f6 f9
2 -2.5 -2 1.8 -0.9 -1.5
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 2 1
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Training instance: y
p(y = NEG|x)
= NEG, x = “not funny at all”
exp(✓2f2 + ✓5f5 + ✓26f26)
exp(✓1f1 + ✓4f4 + ✓25f25)) + exp(✓2f2 + ✓5f5 + ✓26f26)) + exp(✓3f3 + ✓6f6 + ✓27f27))
= exp(2 2 + 0.8) = 0.1909 exp(2 1 + 1.2) + exp(2 2 + 0.8) + exp( 0.9 0.4 + 0.4)
=
Logistic Regression: parameter estimation
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7
f10f13 f16f19f22f25
NEG f2 NEU f3
POS -1 NEG 2 NEU -0.4
f5 f8 f6 f9
2 -2.5 -2 1.8 -0.9 -1.5
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 2 1
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Training instance: y
p(y = POS|x)
= NEG, x = “not funny at all”
exp(✓1f1 + ✓4f4 + ✓25f25))
exp(✓1f1 + ✓4f4 + ✓25f25)) + exp(✓2f2 + ✓5f5 + ✓26f26)) + exp(✓3f3 + ✓6f6 + ✓27f27))
= exp(2 1 + 1.2) = 0.7742 exp(2 1 + 1.2) + exp(2 2 + 0.8) + exp( 0.9 0.4 + 0.4)
=
Logistic Regression: Parameter estimation
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7
f10f13 f16f19f22f25
NEG f2 NEU f3
POS -1 NEG 2 NEU -0.4
f5 f8 f6 f9
2 -2.5 -2 1.8 -0.9 -1.5
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 2 1
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Training instance: y
p(y = NEU|x)
= NEG, x = “not funny at all”
exp(✓3f3 + ✓6f6 + ✓27f27))
exp(✓1f1 + ✓4f4 + ✓25f25)) + exp(✓2f2 + ✓5f5 + ✓26f26)) + exp(✓3f3 + ✓6f6 + ✓27f27))
= exp( 0.9 4 + 0.4) = 0.0349 exp(2 1 + 1.2) + exp(2 2 + 0.8) + exp( 0.9 0.4 + 0.4)
=
Logistic Regression: Parameter estimation
not funny POSf1f4 f7
f(x,y)
painful ok overall story good jokes bias f10f13 f16f19f22f25
NEG f2 NEU f3
POS -1 NEG 2 NEU -0.4
f5 f8 f6 f9
2 -2.5 -2 1.8 -0.9 -1.5
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 21
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Training instance: y
= NEG, x = “not funny at all”
gradient[✓1] = 0 + 0.7742, gradient[✓4] = 0 gradient[✓2] = 1 + 0.1909, gradient[✓5] = 1 gradient[✓3] = 0 + 0.0349, gradient[✓6] = 0
+ 0.7742 + 0.1909 + 0.0349
Logistic Regression: Parameter estimation
f(x,y)
not funny painful ok overall story good jokes bias
POSf1f4 f7
f10f13 f16f19f22f25
NEG f2 NEU f3
POS -1 NEG 2 NEU -0.4
f5 f8 f6 f9
2 -2.5 -2 1.8 -0.9 -1.5
f11 f14 f12 f15
✓
0.5 0.2 -0.5 0.1 21
f17 f20 f23 f26 f18 f21 f24 f27
.08 1.5 0.8 1.2 -0.6 -2 -1.2 0.8 -0.2 -1.2 -0.3 0.4
Training instance: y
= NEG, x = “not funny at all”
gradient[✓25] = 0 + 0.7742 gradient[✓26] = 1 + 0.1909 gradient[✓27] = 0 + 0.0349
Note: The “bias” is the feature that always fires.
Some observations about the gradient for Logistic Regression
I The gradient of all features that predict the same label is updated by the same amount.
I A feature is penalized or rewarded by an amount that is proportional to how badly o↵ the prediction is.
Discussion question:
I Assuming that the weights ✓ are initialized as 0, how would the first iteration of the Logistic Regression Training proceed?
Adding a regularization term
✓
POS -1 2 -2.5 0.5 0.2 .08 1.5 0.8 1.2 NEG 2 -2 1.8 -0.5 0.1 -0.6 -2 -1.2 0.8 NEU -0.4 -0.9 -1.5 2 1 -0.2 -1.2 -0.3 0.4
Training instance: y = NEG, x = “not funny at all”
XN XDX
r✓LLOGREG = ✓ f(x(i),y(i))+ p(y0|x(i))f(x(i),y0)
i=1 i=1 y02Y
Let =0.5:
gradient[✓1] = 0.5 0 + 0.7742, gradient[✓4] = 1 0 + 0.7742 gradient[✓2] = 1 1 + 0.1909, gradient[✓5] = 1 1 + 0.1909 gradient[✓3] = 0.2 0 + 0.0349, gradient[✓6] = 0.45 0 + 0.0349