Lecture 7 (part 1): Iterative Optimization with Gradient
Descent
COMP90049
Introduction to Machine Learning
Semester 1, 2020
Lea Frermann, CIS
1
Roadmap
So far…
• Naive Bayes Classifier – theory and practice
• MLE estimation of parameters
• Exact optimization
Now: Quick aside on iterative optimization
• Gradient Descent
• Global and local optima
2
Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.
Ex3: Predicting housing prices from a weighted combination of house age
and house location
Ex4: Find the parameters θ of a spam classifier which lead to the lowest error
Ex5: Find the parameters θ of a spam classifier which lead to the highest
data log likelihood
3
Objective functions
Find parameter values θ that maximize (or minimize) the value of a
function f (θ)
• we want to find the extreme points of the objective function.
Depending on our target, this could be
• …the maximum
E.g., the maximum profit of our shoe shop
E.g., the largest possible (log) likelihood of the data
θ̂ = argmax
θ
f (θ)
• …or the minimum (in which case we often call f a loss function)
E.g., the smallest possible classification error
θ̂ = argmin
θ
f (θ)
4
Recipe for finding Minima / Maxima
1. Define your function of interest f (x) (e.g., data log likelihood)
2. Compute its first derivative wrt its input x
3. Set the derivative to zero
4. Solve for x
5
Closed-form vs Iterative Optimization
Closed-form solutions
• Previously, we computed the closed form solution for the MLE of the
binomial distribution
• We follow our recipe, and arrive at a single solution
Unfortunately, life is not always as easy
• Often, no closed-form solution exists
• Instead, we have to iteratively improve our estimate of θ̂ until we arrive
at a satisfactory solution
• Gradient descent is one popular iterative optimization method
6
‘Descending’ the function to find the Optimum
• 1-dimensional case: find parameter θ that minimizes the function
• follow the curvature of the line step by step
7
‘Descending’ the function to find the Optimum
• 2-dimensional case: find parameters θ = [θ0, θ1] that minimize the
function J
• follow the curvature step by step along the steepest way
Source: https://medium.com/binaryandmore/
beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536 7
https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536
https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536
Gradient Descent: Intuition
Intuition
• Descending a mountain (aka. our function) as fast as possible: at every
position take the next step that takes you most directly into the valley
• We compute a series of solutions θ(0), θ(2), θ(3), … by ‘walking’ along the
function and taking steps in the direction with the steepest local slope (or
gradient).
• each solution depends on the current location
8
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
θ ← θ +4θ
9
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
θ ← θ +4θ
• 4θ is the derivative, a measure of
change in the input function given an
change in θ
• for a function f (θ), ∂f
∂θ
tells us how
much f changes in response to a
change in θ.
• the derivative measures the slope or
gradient of a function f at point θ
9
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
θ ← θ +4θ
• if ∂f
∂x > 0: f (θ) increases as θ increases
• if ∂f
∂x < 0: f (θ) increases as θ decreases
• if ∂f
∂x = 0: we are at a minimum (or
maximum)
• so, to approach the minimum:
θ ← θ − η
∂f
∂θ
9
Gradient Descent for multiple parameters
• Usually, our models have several parameters which need to be
optimized to minimize the error
• We compute partial derivatives of f (θ) wrt. individual θi
• Partial derivatives measure change in a function of multiple parameters
given a change in a single parameter, with all others held constant
• For example for f (θ1, θ2) we can compute ∂f∂θ1 and
∂f
∂θ2
• We then update each parameter individually
θ1 ← θ1 +4θ1 with 4 θ1 = −η
∂f
∂θ1
θ2 ← θ1 +4θ2 with 4 θ2 = −η
∂f
∂θ2
10
Gradient Descent for multiple parameters
• Usually, our models have several parameters which need to be
optimized to minimize the error
• We compute partial derivatives of f (θ) wrt. individual θi
• Partial derivatives measure change in a function of multiple parameters
given a change in a single parameter, with all others held constant
• For example for f (θ1, θ2) we can compute ∂f∂θ1 and
∂f
∂θ2
• We then update each parameter individually
θ1 ← θ1 +4θ1 with 4 θ1 = −η
∂f
∂θ1
θ2 ← θ1 +4θ2 with 4 θ2 = −η
∂f
∂θ2
10
Gradient Descent: Recipe
Recipe for Gradient Descent (single parameter)
1: Define objective function f (θ)
2: Initialize parameter θ(0)
3: for iteration t ∈ {0, 1, 2, ...T} do
4: Compute the first derivative of f at that point θ(t) : ∂f
∂θ(t)
5: Update your parameter by subtracting the (scaled) derivative
θ
(t+1) ← θ(t) − η
∂f
∂θ(t)
• η is the step size or learning rate, a parameter
• When to stop? Fix number of iterations, or define other criteria
11
Gradient Descent: Recipe
Recipe for Gradient Descent (multiple parameters)
1: Define objective function f (θ)
2: Initialize parameters {θ(0)1 , θ
(0)
2 , θ
(0)
3 , . . . }
3: for iteration t ∈ {0, 1, 2, ...T} do
4: Initialize vector of gradients← []
5: for parameter f ∈ {1, 2, 3, ...F} do
6: Compute the first derivative of f at that point θ(t)f :
∂f
∂θ
(t)
f
7: append ∂f
∂θ
(t)
f
to gradients
8: Update all parameters by subtracting the (scaled) gradient
θ
(t+1) ← θ(t) − η
∂f
∂θ(t)
11
Aside: Global and Local Minima and Maxima
Possible issue: local maxima and minima!
• A function is convex if a line between any two points of the function lies
above the function
• A global maximum is the single highest value of the function
• A global minimum is the single lowest value of the function
https://commons.wikimedia.org/wiki/File:Maxima and Minima.svg
12
Aside: Global and Local Minima and Maxima
Possible issue: local maxima and minima!
• Convex?
• How many global minima? global maxima?
• How many local minima? local maxima? 12
Aside: Global and Local Minima and Maxima
Possible issue: local maxima and minima!
• Convex?
• How many global minima? global maxima?
• How many local minima? local maxima? 12
Gradient Descent Guarantees
• with an appropriate learning rate, GD will find the global minimum for
differentiable convex functions
• with an appropriate learning rate, GD will find a local minimum for
differentiable non-convex functions
13
Summary
This aside:
• Iterative optimization
• Gradient descent
Next lecture(s)
• Classifiers which do not have a closed-form solution for their paramters
• Logistic Regresion
• (later) the perceptron, and neural networks
14