Predictive Analytics – Week 5: Model Selection and Estimation II
Predictive Analytics
Week 5: Model Selection and Estimation II
Semester 2, 2018
Discipline of Business Analytics, The University of Sydney Business School
Week 5: Model Selection and Estimation II
1. Maximum likelihood for regression
2. Maximum likelihood estimation with gradient ascend
Reading: Chapter 5.1 of ISL to review the cross validation.
Exercise questions: Chapter 5.4 of ISL Q3. Try to implement
gradient ascend in Python.
2/34
Notation
• Let p(y;θ) denote a probability mass function or density
function with associated parameter vector θ.
• Y1, Y2, . . . , YN are random variables from this distribution.
The random variables are independent.
• D = {y1, . . . , yN} are the actual observed sample values.
• θ̂(D) is an estimator of θ.
• θ̂ an estimator (as above) or estimate of θ according to the
context.
3/34
Maximum likelihood for regression
The Gaussian linear regression model
• In Lecture 2, most of our regression analysis did not make any
distributional assumptions about the regression errors. Even
though we assumed conditions such as E(ε|X) = 0 and
Var(ε|X) = σ2, we left the probability distribution of ε
unspecified.
• We managed to learn quite a lot from these minimal
assumptions. For example, the assumptions for the
conditional mean and variance of the errors naturally leads us
to the mean and variance of the OLS estimator.
• But we may want to learn more. For example, what is the full
sampling distribution of the OLS estimator? Knowing this
distribution is necessary for making probability statements
about the uncertainty in this estimator. 4/34
The Gaussian linear regression model
We now add the assumption that ε ∼ N(0, σ2), leading to the
model equation
Y = β0 +
p∑
j=1
βjXj + εi, εi ∼ N(0, σ2).
A key feature of the Gaussian linear regression model is that it
gives us the full form of the conditional distribution of Y :
Y |X = x ∼ N
β0 + p∑
j=1
βjxj , σ
2
5/34
Maximum Likelihood Estimation (MLE)
• Maximum likelihood (ML) estimation is available when we
specify a full probabilistic model for the population. This is a
new concept which we now introduce for the specific case of
the Gaussian linear regression model.
• Intuitively, ML estimation chooses the values of the
parameters that maximise the likelihood of the observed data
under the model (for discrete data the likelihood is the
probability of the data, but our response Y is continuous).
• ML is one of the most highly used estimation techniques in
statistics.
6/34
Normal probability density function
Recall the formula for the normal probability density function
(PDF) from basic statistics:
p(y) =
1
√
2πσ2
exp−
(y−µ)2
2σ2
7/34
Normal distribution
Normal distributions with different µ and σ values
8/34
Normal distribution
9/34
Normal distribution
10/34
The likelihood function
Since
Yi|Xi = xi ∼ N
β0 + p∑
j=1
βjxij , σ
2
,
the density for an observed value yi is
p(yi|xi;β, σ2) =
1
√
2πσ2
exp−
(yi−β0−
∑p
j=1
βjxij)
2
2σ2
11/34
The likelihood function
The likelihood function is the joint PDF of the data evaluated at
the sample values. In our Gaussian linear regression model,
Assumption 4 (independence) of Lecture 2 slide 29 implies that we
can multiply the PDFs for each observation:
p(y;β, σ2) =
N∏
i=1
1
√
2πσ2
exp−
(
yi−β0−
∑p
j=1
βjxij
)2
2σ2
12/34
The log-likelihood function
The complete log-likelihood is the log-density of the observed
samples,
L(β, σ2) = log
N∏
i=1
p(yi;β, σ2) =
N∑
i=1
log p(yi;β, σ2)
=−
N
2
log(2π)−
N
2
log(σ2)−
1
2σ2
N∑
i=1
yi − β0 − p∑
j=1
βjxij
2
• What are the advantageous of taking log operation here?
13/34
Maximum likelihood estimation
We maximise the log-likelihood as a function of the parameters
max
β,σ2
L(β, σ2),
where
L(β, σ2) = −
N
2
log(2π)−
N
2
log(σ2)−
1
2σ2
N∑
i=1
yi − β0 − p∑
j=1
βjxij
2
Note that the last term corresponds RSS(β) times a negative
multiplier.
14/34
Maximum likelihood estimation
We can simplify the log-likelihood function as below, since the
removed term will not affect our parameter estimates. Why?
max
β
L(β),
where
L(β) = −
1
2N
N∑
i=1
yi − β0 − p∑
j=1
βjxij
2
Therefore, ML estimator (maximise) is equivalent to the OLS
estimator (minimise) for this model.
15/34
Discussion
So far what are the take-aways of OLS?
• We added a new estimation principle to our toolbox and
started by understanding it in this sample case.
• ML estimation is broadly applicable and we will use it
extensively for supervised learning.
• We need the concept of a log-likelihood for certain model
selection methods.
16/34
Discussion
When discussing statistical decision theory, we defined the optimal
prediction rule
δ(x) = argmin
f(·)
E(L(Y, f(x))|X = x)
A probabilistic model estimated by ML will allow us to directly
approximate the optimal prediction for any loss, since it estimates
the full conditional distribution P (Y |X = x).
17/34
Maximum likelihood estimation with
gradient ascend
Maximum likelihood estimation
• How to implement the maximum likelihood estimation (MLE)
for regression?
18/34
Gradient ascent
• Gradient descent is a first-order iterative optimization
algorithm for finding the minimum of a function.
• To find a local minimum of a function using gradient descent,
we take steps proportional to the negative of the gradient
(or approximate gradient) of the function at the current point.
• If instead we take steps proportional to the positive of the
gradient, one approaches a local maximum of that function;
the procedure is then known as gradient ascent.
Source: wikipedia.
19/34
Motivating example
Suppose below is the log-likelihood function plot of a simple linear
regression without intercept term:
β0 is not included for simplicity.
20/34
Motivating example
β0 is not included for simplicity.
21/34
Maximum Likelihood Estimation Intuition
Figure 1: The picture shows an example surface plot of the
log-likelihood function L(β). Gradient vector points to the direction that
L(β) increases.
22/34
Maximum Likelihood Estimation
Based on the plot in the pervious slide:
• At the current value β, we move up the hill in the direction of
∂L(β)
∂β
.
• Step size a controls the jump.
• If a is too large, we might jump over the optimal point.
• If a is too small, we might move too slowly.
• How much is the dimension of β based on the above plot?
How will be the plot looked like if reduce the dimension of β
by 1?
23/34
Gradient ascend algorithm
Algorithm Gradient ascend algorithm for maximum likelihood es-
timation
1: Initialization: start from some initial guess
2: Iterates the following: update until convergence, e.g. likelihood
update is less than a threshold
β := β + α∂L(β)
∂β
• ∂L(β)
∂β
is the gradient vector of the function L(β). Gradient
vector points to the direction that the likelihood function
increases. := is the assignment operation.
• α is called learning rate: a small positive number that
controls the jump in that direction. How to choose α?
• α can be also changed with iteration steps, e.g.
αt = 1/(1 + t). t is the iteration step. 24/34
Gradient ascend illustration
If starting point of β1 is to the left of the local maximum:
β1 is updated to be lager and larger. Positive gradient.
25/34
Gradient ascend illustration
If starting point of β1 is to the right of the local maximum:
β1 is updated to be smaller and smaller. Negative gradient.
26/34
Calculating the gradient
Given the log-likelihood function as below:
L(β) = −
1
2N
N∑
i=1
yi − β0 − p∑
j=1
βjxij
2
We need calculate the gradients (gradient vector) for all
parameters:
β0, β1, β2, . . . , βp
27/34
Calculating the gradient
The gradient vector provides direction of update for each
parameter):
∂L(β)
∂β0
=
1
N
N∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)
∂L(β)
∂β1
=
1
N
N∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)
xi1
∂L(β)
∂β2
=
1
N
N∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)
xi2
…
∂L(β)
∂βp
=
1
N
N∑
i=1
(
yi − β0 −
p∑
j=1
βjxij
)
xip
28/34
Gradient Ascend
So all the parameters are updated simultaneously:
β0 := β0 + α
∂L(β)
∂β0
= β0 + α
1
N
N∑
i=1
yi − β0 − p∑
j=1
βjxij
β1 := β1 + α
∂L(β)
∂β1
= β1 + α
1
N
N∑
i=1
yi − β0 − p∑
j=1
βjxij
xi1
β2 := β2 + α
∂L(β)
∂β2
= β2 + α
1
N
N∑
i=1
yi − β0 − p∑
j=1
βjxij
xi2
…
βp := βp + α
∂L(β)
∂βp
= βp + α
1
N
N∑
i=1
yi − β0 − p∑
j=1
βjxij
xip
29/34
Gradient Ascend in matrix form
We can write the Gradient Ascend (the ones in previous slides) for
linear regression with multiple features in a matrix form. The
matrix form looks much more simple. First define:
y =
y1
y2
…
yN
(1)
f(X) =
f(x1) = β0 +
∑p
j=1 βjx1j
f(x2) = β0 +
∑p
j=1 βjx2j
…
f(xN ) = β0 +
∑p
j=1 βjxpj
(2)
30/34
Gradient Ascend in matrix form
Further define:
∂L(β)
∂β
=
∂L(β)
∂β0
∂L(β)
∂β1
…
∂L(β)
∂β2
(3)
Then it can be shown that (essentially re-write slide 29):
∂L(β)
∂β
=
1
N
XT (y − f(X))
31/34
Gradient Ascend in matrix form
Hence gradient ascent in matrix form is:
β := β +
α
N
XT (y − f(X))
• Note the size of each matrix and vector in the above formula.
• Try to implement this in Python.
32/34
Why local maximum?
• If the starting point is the red dot, then gradient descent can only
converge to local minimum B.
• If the starting point is the blue dot, then gradient descent can only
converge to local minimum A.
• The gradients at D and E are 0. The gradients at A, B or C are also 0. 33/34
Review questions
• What is maximum likelihood?
• Try to derive the log-likelihood function with Gaussian linear
regression.
• What is gradient ascend and how it works?
34/34
Maximum likelihood for regression
Maximum likelihood estimation with gradient ascend