程序代写代做代考 decision tree algorithm 1/23

1/23

Logistic Regression and MaxEnt

Wei Wang @ CSE, UNSW

April 8, 2018

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

2/23

Generative vs. Discriminative Learning

Generative models:

Pr[y | x] =
Pr[x | y ]Pr[y ]

Pr[x]

∝ Pr[x | y ]Pr[y ] = Pr[x, y ]

The key is to model the generative probability: Pr[x | y ].
Example: Naive Bayes.

Discriminative models:

models Pr[y | x] directly as g(x;θ).
Example: Decision tree, Logistic Regression.

Instance-based Learning.

Example: kNN classifier.

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

3/23

Linear Regression

Figure: Linear Regression

Task

Input: (x (i), y (i)) pairs (1 ≤ i ≤ n)
Preprocess: let x(i) =

[
1 x (i)

]>
Output: The best w =

[
w0 w1

]>
such that ŷ = w>x best

explains the observations

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

4/23

Least Square Fit

The criterion for “best”:

Individual error: �i = ŷ
(i) − y (i)

Sum squared error: ` =
∑n

i=1 �
2
i

Find w such that ` is minimized.

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

5/23

Minimizing a Function

Taylor Series of f (x) at point a

f (x) =
+∞∑
n=0

f (i)(a)

n!
(x − a)n (1)

= f (a) + f ′(a) · (x − a) +
f ′′(a)

2
(x − a)2 + o((x − a)2) (2)

Intuitively, f (x) is almost f (a) + f ′(a) · (x − a) for all a if it is
close to x .
If f (x) has local minimum x∗, then

f ′(x∗) = 0, and
f ′′(x∗) > 0.

Minimum of the local minima is the global minimum if it is
smaller than the function values at all the boundary points.

Intuitively, f (x) is almost f (a) +
f ′′(a)
2

(x − a)2 if a is close to
x∗.

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

6/23

Find the Least Square Fit for Linear Regression

∂`

∂wj
=

n∑
i=1

2�i
∂�i
∂wj

=
n∑

i=1

2�i
∂w>x(i)

∂wj

=
n∑

i=1

2�ix
(i)
j = 2

n∑
i=1

(ŷ (i) − y (i))x (i)j

By setting the above to 0, this essentially requires, for all j

n∑
i=1

ŷ (i)x
(i)
j =

n∑
i=1

y (i)x
(i)
j

what the model predicts what the data says

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

7/23

Find the Least Square Fit for Linear Regression

In the simple 1D case, we have only two parameters in w =

[
w0
w1

]
n∑

i=1

(w0 + w1x
(i)
1 )x

(i)
0 =

n∑
i=1

y (i)x
(i)
0

n∑
i=1

(w0 + w1x
(i)
1 )x

(i)
1 =

n∑
i=1

y (i)x
(i)
1

Since x
(i)
0 = 1, they are essentially

n∑
i=1

(w0 + w1x
(i)
1 ) · 1 =

n∑
i=1

y (i) · 1

n∑
i=1

(w0 + w1x
(i)
1 ) · x

(i)
1 =

n∑
i=1

y (i) · x (i)1

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

8/23

Example

Using the same example in https://en.wikipedia.org/wiki/
Linear_least_squares_(mathematics)

X =




(x (1))>

(x (2))>

(x (3))>

(x (4))>


 =




1 1
1 2
1 3
1 4


 w =

[
w0
w1

]
y =




6
5
7

10







1 1
1 2
1 3
1 4







6
5
7

10


 =




1 1
1 2
1 3
1 4






ŷ1
ŷ2
ŷ3
ŷ4




Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)
https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)

9/23

Generalization to m-dim

Easily generalizes to more than 2-dim:

X =




1 x
(1)
1 . . . x

(1)
m

1 . . . . . . . . .

1 x
(i)
1 . . . x

(i)
m

1 . . . . . . . . .

1 x
(n)
1 . . . x

(n)
m


 w =



w0
w1
. . .
wm


 y =



y (1)

. . .

y (i)

. . .

y (n)




How to perform polynomial regression for one dimensional x?

ŷ = w0 + w1x + w2x
2 . . .+ wmx

m.

Let x
(i)
j = (x

(i)
1 )

j =⇒ Polynomial least square fitting
(http://mathworld.wolfram.com/
LeastSquaresFittingPolynomial.html)

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

http://mathworld.wolfram.com/LeastSquaresFittingPolynomial.html
http://mathworld.wolfram.com/LeastSquaresFittingPolynomial.html

10/23

Probablistic Interpretation

High-level idea:

Any w is possible, but some w is most likely.

P(y (i) | ŷ (i)) = = fi (w)
Assuming independence of training examples, the likelihood of
the training dataset is


i fi (w).

We shall choose the w∗ that maximizes the likelihood.
Maximum likelihood estimation (MLE)
If we also incorporate some prior on w, this becomes
Maximum Posterior Estimation (MAP)

If we assume some Gaussian prior on w, this will add a `2
regularization term to the objective function.

Many models and their variants can be deemed as different
ways of estimating P(y (i) | ŷ (i))

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

11/23

Geometric Interpretation and the Closed Form Solution

Find w such that ‖y − Xw‖2 is minimized.

What is Xw when X is fixed?
It is the hyperplane spanned by the d column vectors of X.

y in general is a vector outside the hyperplane. So the
minimum distance is achieved when Xw∗ is exactly the
projection of y on the hyperplane. This means (denote i-th
column of X as Xi )

X>1 (y − Xw) = 0
X>2 (y − Xw) = 0
. . . . . . = 0
X>d (y − Xw) = 0


 =⇒ X>(y − Xw) = 0

w = (X>X)−1X>y = X+y (X+: pseudo inverse of X)

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

12/23

Illustration

−5
0

5−5 0 5
−20

0
20

x

y

z

−5
0

5−5 0 5
−20

0
20
40

x

y

z

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

13/23

Logistic Regression

Special case: y (i) ∈ { 0, 1 }.
Not appropriate to directly regress y (i).

Rather, model y (i) as the observed outcome of a Bernoulli
trial with an unknown parameter pi
How to model pi

We assume that pi depends on x , Xi• =⇒ rename pi to px.
Still hard to estimate px reliably.

MLE: px = E[y = 1 | x]
What can we say about px+� when px is given?

Answer: we impose a linear relationship between px and x
What about a simple linear model px = w

>x for some w?
(Note: all points share the same parameter w)
Problem: mismatch of the domains: vs
Solution: mean function / inverse of link function:
g−1 : < → params Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt 14/23 Solution Solution: Link function g(parameters)→ < g(p) = logit(p) , log p 1− p = w>x (3)

Equivalently, solve for p.

p =
ew

>x

1 + ew
>x

=
1

1 + e−w
>x

= σ(w>x) (4)

Where σ(z) = 1
1+exp(−z) .

Recall that px = E[y = 1 | x].
Decision boundary is p ≥ 0.5.

Equivalent to whether w>x ≥ 0. Hence, LR is a linear
classifier.

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

15/23

Learning the Parameter w

Consider a training data point x(i).
Recall that the conditional probability (Pr[y (i) = 1 | x(i)])
computed by the model is denoted by the shorthand notation
p (which is a function of w and x(i)).

The likelihood of x(i) is

{
p , if y (i) = 1

1− p , otherwise
, or equivalently,

py
(i)

(1− p)1−y
(i)

.

Hence, the likelihood of the whole training dataset is

L(w) =
n∏

i=1

p(x(i))y
(i)

(1− p(x(i)))1−y
(i)
.

Log-likelihood is (assume log , ln)

`(w) =
n∑

i=1

y (i) log p(x(i)) + (1− y (i)) log (1− p(x(i))) (5)

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

16/23

Learning the Parameter w

To maximize `, notice that it is concave. So take its partial
derivatives

∂`(w)

∂wj
=

n∑
i=1

(
y (i)

1

p(x(i))

∂p(x(i))

∂wj
+ (1− y (i))

1

1− p(x(i))
∂(1− p(x(i)))

∂wj

)

=
n∑

i=1

(
x(i)jy

(i) − x(i)jp(x(i))
)

and set them to 0 essentially means, for all j

n∑
i=1

ŷ (i) · x(i)j =
n∑

i=1

p(x(i))x(i)j =
n∑

i=1

y (i) · x(i)j

what the model predicts what the data says

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

17/23

Understand the Equilibrium

Consider one dimensional x. The above condition is simplified
to

n∑
i=1

p(i)x (i) =
n∑

i=1

y (i)x (i)

The RHS is essentially the sum of x values only for the
training data in class Y = 1.

The LHS says: if we use our learned model to assign a
probability (of belonging to the class Y = 1) for every
training data, the LHS is the expected sum of x values.

If this is still abstract, think of an example.

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

18/23

Numeric Solution

There is no closed-form solution to maximize `.

Use the Gradient Ascent algorithm to maximize `.

There are faster algorithms.

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

19/23

(Stochastic) Gradient Ascent

w is intialized to some random value (e.g., 0).

Since the gradient gives the steepest direction to increase a
function’s value, we move a small step towards that direction,
i.e.,

wj ← wj + α
∂`(w)

∂wj
, or

wj ← wj + α
n∑

i=1

(y (i) − p(x(i)))x(i)j

where α (learning rate) is usually a small constant, or
decreasing over the epochs.

Stochastic version: using the gradient on a randomly selected
training instance, i.e.,

wj ← wj + α(y (i) − p(x(i)))x(i)j

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

20/23

Newton’s Method

Gradient Ascent moves to the “right” direction a tiny step a
time. Can we find a good step size?

Consider 1D case: minimize f (x) and the current point is a.

f (x) = f (a) + f ′(a)(x − a) + f
′′(a)
2

(x − a)2 for x near a.
To minimize f (x), take

∂f (x)
∂x

= 0, i.e.,

∂f (x)

∂x
= 0

⇔ f ′(a) · 1 +
f ′′(a)

2
· 2(x − a) · 1 = f ′(a) + f ′′(a)(x − a) = 0

⇔ x = a−
f ′(a)

f ′′(a)

Can be applied to multiple dimension cases too ⇒ need to use
∇ (gradient) and Hess (Hessian).

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

21/23

Regularization

Regularization is another method to deal with overfitting.

It is designed to penalize large values of the model parameters.
Hence it encourages simpler models, which are less likely to
overfit.

Instead of optimizing for `(w), we optimize `(w) + λR(w).
λ is a hyper-parameter that controls the strength of
regularization.

It is usually determined by cross validating with a list of
possible values (e.g., 0.001, 0.01, 0.1, 1, 10, . . . )
Grid search: http:
//scikit-learn.org/stable/modules/grid_search.html

There are alternative methods.

R(w) quantifies the “size” of the model parameters. Popular
choices are:

L2 regularization (Ridge LR) R(w) = ‖w‖2
L1 regularization (Lasso LR) R(w) = ‖w‖1
L1 regularization is more likely to result in sparse models.

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

http://scikit-learn.org/stable/modules/grid_search.html
http://scikit-learn.org/stable/modules/grid_search.html

22/23

Generalizing LR to Multiple Classes

LR can be generalized to multiple classes =⇒ MaxEnt.

Pr[c | x] ∝ exp
(
w

>
c x
)

=⇒ Pr[c | x] =
exp

(
w

>
c x
)

Z

Z is the normalization constant.
= ?

Let c∗ be the last class in C , then wc∗ = 0.

Derive LR from MaxEnt How?

Both belong to exponential or log-linear classifiers.

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

23/23

Further Reading

Andrew Ng’s note:
http://cs229.stanford.edu/notes/cs229-notes1.pdf

Cosma Shalizi’s note: http://www.stat.cmu.edu/

~cshalizi/uADA/12/lectures/ch12.pdf

Tom Mitchell’s book chapter: https:
//www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

Wei Wang @ CSE, UNSW Logistic Regression and MaxEnt

http://cs229.stanford.edu/notes/cs229-notes1.pdf
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf