Lecture 7 (part 1): Iterative Optimization with Gradient Descent
COMP90049
Introduction to Machine Learning Semester 1, 2020
Lea Frermann, CIS
1
Roadmap
So far…
• Naive Bayes Classifier – theory and practice • MLE estimation of parameters
• Exact optimization
Now: Quick aside on iterative optimization
• Gradient Descent
• Global and local optima
2
Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.
Ex3: Predicting housing prices from a weighted combination of house age and house location
Ex4: Find the parameters θ of a spam classifier which lead to the lowest error
Ex5: Find the parameters θ of a spam classifier which lead to the highest data log likelihood
3
Objective functions
Find parameter values θ that maximize (or minimize) the value of a function f (θ)
• we want to find the extreme points of the objective function. Depending on our target, this could be
• …the maximum
E.g., the maximum profit of our shoe shop
E.g., the largest possible (log) likelihood of the data
θˆ = argmax f (θ) θ
• …or the minimum (in which case we often call f a loss function) E.g., the smallest possible classification error
θˆ = argmin f (θ) θ
4
Recipe for finding Minima / Maxima
1. Define your function of interest f(x) (e.g., data log likelihood) 2. Compute its first derivative wrt its input x
3. Set the derivative to zero
4. Solve for x
5
Closed-form vs Iterative Optimization
Closed-form solutions
• Previously, we computed the closed form solution for the MLE of the
binomial distribution
• We follow our recipe, and arrive at a single solution
Unfortunately, life is not always as easy
• Often, no closed-form solution exists
• Instead, we have to iteratively improve our estimate of θˆ until we arrive
at a satisfactory solution
• Gradient descent is one popular iterative optimization method
6
‘Descending’ the function to find the Optimum
• 1-dimensional case: find parameter θ that minimizes the function • follow the curvature of the line step by step
7
‘Descending’ the function to find the Optimum
• 2-dimensional case: find parameters θ = [θ0,θ1] that minimize the function J
• follow the curvature step by step along the steepest way
Source: https://medium.com/binaryandmore/
beginners- guide- to- deriving- and- implementing- backpropagation- e3c1a5a1e536 7
Gradient Descent: Intuition
Intuition
• Descending a mountain (aka. our function) as fast as possible: at every position take the next step that takes you most directly into the valley
• We compute a series of solutions θ(0), θ(2), θ(3), … by ‘walking’ along the function and taking steps in the direction with the steepest local slope (or gradient).
• each solution depends on the current location
8
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’) • we would like an algorithm that tells how to update our parameters
θ ← θ + △θ
9
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
θ ← θ + △θ
• △θ is the derivative, a measure of change in the input function given an change in θ
• for a function f (θ), ∂f tells us how ∂θ
much f changes in response to a change in θ.
• the derivative measures the slope or gradient of a function f at point θ
9
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
• if∂f ∂x
• if∂f ∂x
• if ∂f ∂x
θ ← θ + △θ > 0: f (θ) increases as θ increases
< 0: f (θ) increases as θ decreases =0: weareataminimum(or
maximum)
• so, to approach the minimum:
θ ← θ − η ∂f ∂θ
9
Gradient Descent for multiple parameters
• Usually, our models have several parameters which need to be optimized to minimize the error
• We compute partial derivatives of f (θ) wrt. individual θi
• Partial derivatives measure change in a function of multiple parameters
given a change in a single parameter, with all others held constant
• For example for f(θ1,θ2) we can compute ∂f and ∂f ∂θ1 ∂θ2
10
Gradient Descent for multiple parameters
• Usually, our models have several parameters which need to be optimized to minimize the error
• We compute partial derivatives of f (θ) wrt. individual θi
• Partial derivatives measure change in a function of multiple parameters
given a change in a single parameter, with all others held constant • For example for f(θ1,θ2) we can compute ∂f and ∂f
∂θ1
• We then update each parameter individually
∂θ2
θ1 ←θ1 +△θ1 with △θ1 =−η ∂f ∂θ1
θ2 ←θ1 +△θ2 with △θ2 =−η ∂f ∂θ2
10
Gradient Descent: Recipe
Recipe for Gradient Descent (single parameter)
1: 2: 3: 4: 5:
Defineobjectivefunctionf(θ) Initializeparameterθ(0) foriterationt∈{0,1,2,...T}do
Compute the first derivative of f at that point θ(t) : ∂f ∂θ(t)
Update your parameter by subtracting the (scaled) derivative θ(t+1) ←θ(t) −η ∂f
∂θ(t)
• η is the step size or learning rate, a parameter
• When to stop? Fix number of iterations, or define other criteria
11
Gradient Descent: Recipe
Recipe for Gradient Descent (multiple parameters)
1: 2: 3: 4: 5: 6:
7: 8:
Defineobjectivefunctionf(θ) Initializeparameters{θ(0),θ(0),θ(0),...}
123 foriterationt∈{0,1,2,...T}do
Initialize vector of gradients ← [] for parameter f ∈ {1, 2, 3, ...F } do
Compute the first derivative of f at that point θ(t) : ∂f
append ∂f
to gradients
Update all parameters by subtracting the (scaled) gradient
θ(t+1) ←θ(t) −η ∂f ∂θ(t)
f
∂θ(t) f
∂θ
(t) f
11
Aside: Global and Local Minima and Maxima
Possible issue: local maxima and minima!
• A function is convex if a line between any two points of the function lies above the function
• A global maximum is the single highest value of the function • A global minimum is the single lowest value of the function
https://commons.wikimedia.org/wiki/File:Maxima and Minima.svg
12
Aside: Global and Local Minima and Maxima
Possible issue: local maxima and minima!
• Convex?
• How many global minima? global maxima? • How many local minima? local maxima?
12
Aside: Global and Local Minima and Maxima
Possible issue: local maxima and minima!
• Convex?
• How many global minima? global maxima? • How many local minima? local maxima?
12
Gradient Descent Guarantees
• with an appropriate learning rate, GD will find the global minimum for differentiable convex functions
• with an appropriate learning rate, GD will find a local minimum for differentiable non-convex functions
13
Summary
This aside:
• Iterative optimization • Gradient descent
Next lecture(s)
• Classifiers which do not have a closed-form solution for their paramters • Logistic Regresion
• (later) the perceptron, and neural networks
14