1/23
Logistic Regression and MaxEnt
Wei Wang @ CSE, UNSW
April 8, 2018
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
2/23
Generative vs. Discriminative Learning
Generative models:
Pr[y | x] =
Pr[x | y ]Pr[y ]
Pr[x]
∝ Pr[x | y ]Pr[y ] = Pr[x, y ]
The key is to model the generative probability: Pr[x | y ].
Example: Naive Bayes.
Discriminative models:
models Pr[y | x] directly as g(x;θ).
Example: Decision tree, Logistic Regression.
Instance-based Learning.
Example: kNN classifier.
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
3/23
Linear Regression
Figure: Linear Regression
Task
Input: (x (i), y (i)) pairs (1 ≤ i ≤ n)
Preprocess: let x(i) =
[
1 x (i)
]>
Output: The best w =
[
w0 w1
]>
such that ŷ = w>x best
explains the observations
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
4/23
Least Square Fit
The criterion for “best”:
Individual error: �i = ŷ
(i) − y (i)
Sum squared error: ` =
∑n
i=1 �
2
i
Find w such that ` is minimized.
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
5/23
Minimizing a Function
Taylor Series of f (x) at point a
f (x) =
+∞∑
n=0
f (i)(a)
n!
(x − a)n (1)
= f (a) + f ′(a) · (x − a) +
f ′′(a)
2
(x − a)2 + o((x − a)2) (2)
Intuitively, f (x) is almost f (a) + f ′(a) · (x − a) for all a if it is
close to x .
If f (x) has local minimum x∗, then
f ′(x∗) = 0, and
f ′′(x∗) > 0.
Minimum of the local minima is the global minimum if it is
smaller than the function values at all the boundary points.
Intuitively, f (x) is almost f (a) +
f ′′(a)
2
(x − a)2 if a is close to
x∗.
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
6/23
Find the Least Square Fit for Linear Regression
∂`
∂wj
=
n∑
i=1
2�i
∂�i
∂wj
=
n∑
i=1
2�i
∂w>x(i)
∂wj
=
n∑
i=1
2�ix
(i)
j = 2
n∑
i=1
(ŷ (i) − y (i))x (i)j
By setting the above to 0, this essentially requires, for all j
n∑
i=1
ŷ (i)x
(i)
j =
n∑
i=1
y (i)x
(i)
j
what the model predicts what the data says
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
7/23
Find the Least Square Fit for Linear Regression
In the simple 1D case, we have only two parameters in w =
[
w0
w1
]
n∑
i=1
(w0 + w1x
(i)
1 )x
(i)
0 =
n∑
i=1
y (i)x
(i)
0
n∑
i=1
(w0 + w1x
(i)
1 )x
(i)
1 =
n∑
i=1
y (i)x
(i)
1
Since x
(i)
0 = 1, they are essentially
n∑
i=1
(w0 + w1x
(i)
1 ) · 1 =
n∑
i=1
y (i) · 1
n∑
i=1
(w0 + w1x
(i)
1 ) · x
(i)
1 =
n∑
i=1
y (i) · x (i)1
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
8/23
Example
Using the same example in https://en.wikipedia.org/wiki/
Linear_least_squares_(mathematics)
X =
(x (1))>
(x (2))>
(x (3))>
(x (4))>
=
1 1
1 2
1 3
1 4
w =
[
w0
w1
]
y =
6
5
7
10
1 1
1 2
1 3
1 4
6
5
7
10
=
1 1
1 2
1 3
1 4
ŷ1
ŷ2
ŷ3
ŷ4
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)
https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)
9/23
Generalization to m-dim
Easily generalizes to more than 2-dim:
X =
1 x
(1)
1 . . . x
(1)
m
1 . . . . . . . . .
1 x
(i)
1 . . . x
(i)
m
1 . . . . . . . . .
1 x
(n)
1 . . . x
(n)
m
w =
w0
w1
. . .
wm
y =
y (1)
. . .
y (i)
. . .
y (n)
How to perform polynomial regression for one dimensional x?
ŷ = w0 + w1x + w2x
2 . . .+ wmx
m.
Let x
(i)
j = (x
(i)
1 )
j =⇒ Polynomial least square fitting
(http://mathworld.wolfram.com/
LeastSquaresFittingPolynomial.html)
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
http://mathworld.wolfram.com/LeastSquaresFittingPolynomial.html
http://mathworld.wolfram.com/LeastSquaresFittingPolynomial.html
10/23
Probablistic Interpretation
High-level idea:
Any w is possible, but some w is most likely.
P(y (i) | ŷ (i)) = = fi (w)
Assuming independence of training examples, the likelihood of
the training dataset is
∏
i fi (w).
We shall choose the w∗ that maximizes the likelihood.
Maximum likelihood estimation (MLE)
If we also incorporate some prior on w, this becomes
Maximum Posterior Estimation (MAP)
If we assume some Gaussian prior on w, this will add a `2
regularization term to the objective function.
Many models and their variants can be deemed as different
ways of estimating P(y (i) | ŷ (i))
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
11/23
Geometric Interpretation and the Closed Form Solution
Find w such that ‖y − Xw‖2 is minimized.
What is Xw when X is fixed?
It is the hyperplane spanned by the d column vectors of X.
y in general is a vector outside the hyperplane. So the
minimum distance is achieved when Xw∗ is exactly the
projection of y on the hyperplane. This means (denote i-th
column of X as Xi )
X>1 (y − Xw) = 0
X>2 (y − Xw) = 0
. . . . . . = 0
X>d (y − Xw) = 0
=⇒ X>(y − Xw) = 0
w = (X>X)−1X>y = X+y (X+: pseudo inverse of X)
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
12/23
Illustration
−5
0
5−5 0 5
−20
0
20
x
y
z
−5
0
5−5 0 5
−20
0
20
40
x
y
z
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
13/23
Logistic Regression
Special case: y (i) ∈ { 0, 1 }.
Not appropriate to directly regress y (i).
Rather, model y (i) as the observed outcome of a Bernoulli
trial with an unknown parameter pi
How to model pi
We assume that pi depends on x , Xi• =⇒ rename pi to px.
Still hard to estimate px reliably.
MLE: px = E[y = 1 | x]
What can we say about px+� when px is given?
Answer: we impose a linear relationship between px and x
What about a simple linear model px = w
>x for some w?
(Note: all points share the same parameter w)
Problem: mismatch of the domains: vs
Solution: mean function / inverse of link function:
g−1 : < → params
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
14/23
Solution
Solution: Link function g(parameters)→ <
g(p) = logit(p) , log
p
1− p
= w>x (3)
Equivalently, solve for p.
p =
ew
>x
1 + ew
>x
=
1
1 + e−w
>x
= σ(w>x) (4)
Where σ(z) = 1
1+exp(−z) .
Recall that px = E[y = 1 | x].
Decision boundary is p ≥ 0.5.
Equivalent to whether w>x ≥ 0. Hence, LR is a linear
classifier.
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
15/23
Learning the Parameter w
Consider a training data point x(i).
Recall that the conditional probability (Pr[y (i) = 1 | x(i)])
computed by the model is denoted by the shorthand notation
p (which is a function of w and x(i)).
The likelihood of x(i) is
{
p , if y (i) = 1
1− p , otherwise
, or equivalently,
py
(i)
(1− p)1−y
(i)
.
Hence, the likelihood of the whole training dataset is
L(w) =
n∏
i=1
p(x(i))y
(i)
(1− p(x(i)))1−y
(i)
.
Log-likelihood is (assume log , ln)
`(w) =
n∑
i=1
y (i) log p(x(i)) + (1− y (i)) log (1− p(x(i))) (5)
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
16/23
Learning the Parameter w
To maximize `, notice that it is concave. So take its partial
derivatives
∂`(w)
∂wj
=
n∑
i=1
(
y (i)
1
p(x(i))
∂p(x(i))
∂wj
+ (1− y (i))
1
1− p(x(i))
∂(1− p(x(i)))
∂wj
)
=
n∑
i=1
(
x(i)jy
(i) − x(i)jp(x(i))
)
and set them to 0 essentially means, for all j
n∑
i=1
ŷ (i) · x(i)j =
n∑
i=1
p(x(i))x(i)j =
n∑
i=1
y (i) · x(i)j
what the model predicts what the data says
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
17/23
Understand the Equilibrium
Consider one dimensional x. The above condition is simplified
to
n∑
i=1
p(i)x (i) =
n∑
i=1
y (i)x (i)
The RHS is essentially the sum of x values only for the
training data in class Y = 1.
The LHS says: if we use our learned model to assign a
probability (of belonging to the class Y = 1) for every
training data, the LHS is the expected sum of x values.
If this is still abstract, think of an example.
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
18/23
Numeric Solution
There is no closed-form solution to maximize `.
Use the Gradient Ascent algorithm to maximize `.
There are faster algorithms.
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
19/23
(Stochastic) Gradient Ascent
w is intialized to some random value (e.g., 0).
Since the gradient gives the steepest direction to increase a
function’s value, we move a small step towards that direction,
i.e.,
wj ← wj + α
∂`(w)
∂wj
, or
wj ← wj + α
n∑
i=1
(y (i) − p(x(i)))x(i)j
where α (learning rate) is usually a small constant, or
decreasing over the epochs.
Stochastic version: using the gradient on a randomly selected
training instance, i.e.,
wj ← wj + α(y (i) − p(x(i)))x(i)j
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
20/23
Newton’s Method
Gradient Ascent moves to the “right” direction a tiny step a
time. Can we find a good step size?
Consider 1D case: minimize f (x) and the current point is a.
f (x) = f (a) + f ′(a)(x − a) + f
′′(a)
2
(x − a)2 for x near a.
To minimize f (x), take
∂f (x)
∂x
= 0, i.e.,
∂f (x)
∂x
= 0
⇔ f ′(a) · 1 +
f ′′(a)
2
· 2(x − a) · 1 = f ′(a) + f ′′(a)(x − a) = 0
⇔ x = a−
f ′(a)
f ′′(a)
Can be applied to multiple dimension cases too ⇒ need to use
∇ (gradient) and Hess (Hessian).
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
21/23
Regularization
Regularization is another method to deal with overfitting.
It is designed to penalize large values of the model parameters.
Hence it encourages simpler models, which are less likely to
overfit.
Instead of optimizing for `(w), we optimize `(w) + λR(w).
λ is a hyper-parameter that controls the strength of
regularization.
It is usually determined by cross validating with a list of
possible values (e.g., 0.001, 0.01, 0.1, 1, 10, . . . )
Grid search: http:
//scikit-learn.org/stable/modules/grid_search.html
There are alternative methods.
R(w) quantifies the “size” of the model parameters. Popular
choices are:
L2 regularization (Ridge LR) R(w) = ‖w‖2
L1 regularization (Lasso LR) R(w) = ‖w‖1
L1 regularization is more likely to result in sparse models.
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
http://scikit-learn.org/stable/modules/grid_search.html
http://scikit-learn.org/stable/modules/grid_search.html
22/23
Generalizing LR to Multiple Classes
LR can be generalized to multiple classes =⇒ MaxEnt.
Pr[c | x] ∝ exp
(
w
>
c x
)
=⇒ Pr[c | x] =
exp
(
w
>
c x
)
Z
Z is the normalization constant.
= ?
Let c∗ be the last class in C , then wc∗ = 0.
Derive LR from MaxEnt How?
Both belong to exponential or log-linear classifiers.
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
23/23
Further Reading
Andrew Ng’s note:
http://cs229.stanford.edu/notes/cs229-notes1.pdf
Cosma Shalizi’s note: http://www.stat.cmu.edu/
~cshalizi/uADA/12/lectures/ch12.pdf
Tom Mitchell’s book chapter: https:
//www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt
http://cs229.stanford.edu/notes/cs229-notes1.pdf
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf