Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
April 9, 2020
1/23
Wei Wang @ CSE, UNSW
Logistic Regression and MaxEnt
Generative vs. Discriminative Learning
Generative models:
Pr[y | x] = Pr[x | y]Pr[y] Pr[x]
∝ Pr[x | y]Pr[y] = Pr[x, y]
The key is to model the generative probability: Pr[x | y].
Example: Naive Bayes. Discriminative models:
models Pr[y | x] directly as g (x; θ). Example: Decision tree, Logistic Regression.
Instance-based Learning. Example: kNN classifier.
2/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Linear Regression
Figure: Linear Regression
Task
Input: (x(i),y(i)) pairs (1 ≤ i ≤ n)
Preprocess: let x(i) = 1 x(i)⊤
Output: The best w = w0 w1⊤ such that yˆ = w⊤x best explains the observations
3/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Least Square Fit
The criterion for “best”:
Individual error: εi = yˆ(i ) − y (i ) Sum squared error: l = ni=1 ε2i
Find w such that l is minimized.
4/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Minimizing a Function
Taylor Series of f (x) at point a +∞
f (x) = f (i)(a)(x − a)n (1) n=0 n!
=f(a)+f′(a)·(x−a)+f′′(a)(x−a)2+o((x−a)2) (2) 2
Intuitively, f(x) is almost f(a)+f′(a)·(x −a) for all a if it is close to x.
If f (x) has local minimum x∗, then
f ′(x∗) = 0, and f ′′(x∗) > 0.
Minimum of the local minima is the global minimum if it is
smaller than the function values at all the boundary points.
Intuitively, f (x) is almost f (a) + f ′′(a) (x − a)2 if a is close to
x∗.
2
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
5/23
Find the Least Square Fit for Linear Regression
nn ∂l=2ε∂εi =2ε∂w⊤x(i)
∂w
i ∂w i ∂w j i=1 j i=1 j
nn
= 2 ε i x ( i ) = 2 ( yˆ ( i ) − y ( i ) ) x ( i )
jj i=1 i=1
By setting the above to 0, this essentially requires, for all j
nn yˆ(i)x(i) = y(i)x(i)
jj i=1 i=1
what the model predicts
what the data says
6/23
Wei Wang @ CSE, UNSW
Logistic Regression and MaxEnt
Find the Least Square Fit for Linear Regression
w0 In the simple 1D case, we have only two parameters in w = w1
nn
(w0 +w1x(i))x(i) =y(i)x(i)
100 i=1 i=1
nn
(w0 +w1x(i))x(i) =y(i)x(i)
111 i=1 i=1
Since x(i) = 1, they are essentially 0
nn
(w0 +w1x(i))·1=y(i) ·1
1
i=1 nn
i=1
(w0 +w1x(i))·x(i) =y(i) ·x(i)
111 i=1 i=1
7/23
Wei Wang @ CSE, UNSW
Logistic Regression and MaxEnt
Example
Using the same example in https://en.wikipedia.org/wiki/ Linear_least_squares_(mathematics)
(x(1))⊤ 1 1 6
(x(2))⊤ 1 2 w0 5 X= =w= y= (x(3))⊤ 1 3 w1 7
(x(4))⊤ 14 10
1 1 6
1 1 yˆ 1
1 2 5
1 3 7
3
=
1410 14yˆ4
1 2 yˆ2 1 3 yˆ
8/23
Wei Wang @ CSE, UNSW
Logistic Regression and MaxEnt
Generalization to m-dim
Easily generalizes to more than 2-dim:
(1) (1)
1 x1 … xm w
0 1 … … …
(1) y
… X=1 x(i) … x(i) w=w1 y=y(i)
m
1 …
1 … … … wm 1 x(n) … x(n)
How to perform polynomial regression for one dimensional x?
yˆ = w 0 + w 1 x + w 2 x 2 . . . + w m x m .
Let x(i) = (x(i))j =⇒ Polynomial least square fitting j1
(http://mathworld.wolfram.com/ LeastSquaresFittingPolynomial.html)
1m
… y(n)
9/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Probablistic Interpretation
High-level idea:
Observations (i.e., training data) are noisy
P(y(i) | yˆ(i)) = = fi(w) Any w is possible, but some w is most likely.
Assuming independence of training examples, the likelihood of the training dataset is i fi (w).
We shall choose the w∗ that maximizes the likelihood.
Maximum likelihood estimation (MLE)
If we also incorporate some prior on w, this becomes Maximum Posterior Estimation (MAP): If we assume some Gaussian prior on w, this will add a l2 regularization term to the objective function.
Many models and their variants can be deemed as different ways of estimating P(y(i) | yˆ(i))
10/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Geometric Interpretation and the Closed Form Solution
Find w such that ∥y − Xw∥2 is minimized. What is Xw when X is fixed?
It is the hyperplane spanned by the d column vectors of X.
y in general is a vector outside the hyperplane. So the minimum distance is achieved when Xw∗ is exactly the projection of y on the hyperplane. This means (denote i-th column of X as Xi)
X1⊤(y−Xw) =0
X2⊤(y−Xw) =0=⇒X⊤(y−Xw)=0
…… =0 Xd⊤(y−Xw) =0
w = (X⊤X)−1X⊤y = X+y (X+: pseudo inverse of X)
11/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Illustration
20 −5
0 −20
0
−5 0 55x
y
40
20 0
−5 0
−20
−5 0 55x
y
12/23
Wei Wang @ CSE, UNSW
Logistic Regression and MaxEnt
zz
Logistic Regression
Special case: y(i) ∈ {0,1}.
Not appropriate to directly regress y(i).
Rather, model y(i) as the observed outcome of a Bernoulli trial with an unknown parameter pi
How to model pi
We assume that pi depends on x Xi• =⇒ rename pi to px. Still hard to estimate px reliably.
MLE:px =E[y=1|x]
What can we say about px+ε when px is given?
Answer: we impose a linear relationship between px and x What about a simple linear model px = w⊤x for some w?
(Note: all points share the same parameter w) Problem: mismatch of the domains: vs Solution: mean function / inverse of link function: g−1 :R→params
13/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Solution
Solution: Link function g(parameters) → R
g(p) = logit(p) log p = w⊤x (3)
1−p
Equivalently, solve for p.
ew⊤x 1 ⊤
p=1+ew⊤x =1+e−w⊤x =σ(w x) (4) Where σ(z) = 1 .
1+exp(−z ) Recall that px = E[y = 1 | x].
Decision boundary is p ≥ 0.5.
Equivalent to whether w⊤x ≥ 0. Hence, LR is a linear classifier.
14/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Learning the Parameter w
Consider a training data point x(i).
Recall that the conditional probability (Pr[y(i) = 1 | x(i)])
computed by the model is denoted by the shorthand notation p (which is a function of w and x(i)).
p
The likelihood of x(i) is
1 − p
, if y(i) = 1 , otherwise
, or equivalently,
py(i) (1 − p)1−y(i) .
Hence, the likelihood of the whole training dataset is
n
L(w) = p(x(i))y(i) (1 − p(x(i)))1−y(i) .
i=1 Log-likelihood is (assume log ln)
n l(w)=y(i)logp(x(i))+(1−y(i))log(1−p(x(i))) (5)
i=1
15/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Learning the Parameter w
To maximize l, notice that it is concave. So take its partial derivatives
n
∂l(w) = y(i) 1 ∂p(x(i)) + (1 − y(i)) 1
∂(1 − p(x(i))) ∂wj
∂wj
i=1 p(x(i)) ∂wj 1 − p(x(i)) n
=x(i)jy(i) −x(i)jp(x(i)) i=1
and set them to 0 essentially means, for all j nnn
yˆ(i) ·x(i)j = p(x(i))x(i)j = y(i) ·x(i)j i=1 i=1 i=1
what the model predicts
Wei Wang @ CSE, UNSW
Logistic Regression and MaxEnt
what the data says
16/23
Understand the Equilibrium
Consider one dimensional x. The above condition is simplified to
nn p(i)x(i) = y(i)x(i)
i=1 i=1
The RHS is essentially the sum of x values only for the
training data in class Y = 1.
The LHS says: if we use our learned model to assign a probability (of belonging to the class Y = 1) for every training data, the LHS is the expected sum of x values.
If this is still abstract, think of an example.
17/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Numeric Solution
There is no closed-form solution to maximize l. Use the Gradient Ascent algorithm to maximize l. There are faster algorithms.
18/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
(Stochastic) Gradient Ascent
w is intialized to some random value (e.g., 0).
Since the gradient gives the steepest direction to increase a function’s value, we move a small step towards that direction, i.e.,
wj ←wj +α∂l(w) ,or ∂wj
n
wj ←wj +α(y(i)−p(x(i)))x(i)j
i=1
where α (learning rate) is usually a small constant, or
decreasing over the epochs.
Stochastic version: using the gradient on a randomly selected training instance, i.e.,
wj ← wj + α(y(i) − p(x(i)))x(i)j Logistic Regression and MaxEnt
19/23
Wei Wang @ CSE, UNSW
Newton’s Method
Gradient Ascent moves to the “right” direction a tiny step a time. Can we find a good step size?
Consider 1D case: minimize f (x) and the current point is a.
f(x)=f(a)+f′(a)(x−a)+f′′(a)(x−a)2 forxneara. 2
To minimize f (x), take ∂f (x) = 0, i.e., ∂x
∂f (x) = 0 ∂x
⇔ f′(a)·1+f′′(a)·2(x−a)·1=f′(a)+f′′(a)(x−a)=0 2
⇔ x=a−f′(a) f′′(a)
Can be applied to multiple dimension cases too ⇒ need to use ∇ (gradient) and Hess (Hessian).
20/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Regularization
Regularization is another method to deal with overfitting.
It is designed to penalize large values of the model parameters.
Hence it encourages simpler models, which are less likely to overfit.
Instead of optimizing for l(w), we optimize l(w) + λR(w). λ is a hyper-parameter that controls the strength of regularization.
It is usually determined by cross validating with a list of possible values (e.g., 0.001, 0.01, 0.1, 1, 10, . . . )
Grid search: http:
//scikit- learn.org/stable/modules/grid_search.html There are alternative methods.
R(w) quantifies the “size” of the model parameters. Popular choices are:
L2 regularization (Ridge LR) R(w) = ∥w∥2
L1 regularization (Lasso LR) R(w) = ∥w∥1
L1 regularization is more likely to result in sparse models.
21/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Generalizing LR to Multiple Classes
LR can be generalized to multiple classes =⇒ MaxEnt. ⊤ expwc⊤x
Pr[c | x] ∝ exp wc x =⇒ Pr[c | x] = Z
=?
Z is the normalization constant.
Letc∗ bethelastclassinC,thenwc∗ =0.
How?
Derive LR from MaxEnt
Both belong to exponential or log-linear classifiers.
22/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
Further Reading
Andrew Ng’s note:
http://cs229.stanford.edu/notes/cs229-notes1.pdf
Cosma Shalizi’s note: http://www.stat.cmu.edu/ ~cshalizi/uADA/12/lectures/ch12.pdf
Tom Mitchell’s book chapter: https: //www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
23/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW