CS计算机代考程序代写 algorithm Lecture 7 (part 1): Iterative Optimization with Gradient

Lecture 7 (part 1): Iterative Optimization with Gradient
Descent

COMP90049
Introduction to Machine Learning
Semester 1, 2020

Lea Frermann, CIS

1

Roadmap

So far…

• Naive Bayes Classifier – theory and practice

• MLE estimation of parameters

• Exact optimization

Now: Quick aside on iterative optimization

• Gradient Descent

• Global and local optima

2

Finding Optimal Points I

Finding the parameters that optimize a target

Ex1: Estimate the study time which leads to the best grade in COMP90049.

Ex2: Find the shoe price which leads to maximum profit of our shoe shop.

Ex3: Predicting housing prices from a weighted combination of house age
and house location

Ex4: Find the parameters θ of a spam classifier which lead to the lowest error

Ex5: Find the parameters θ of a spam classifier which lead to the highest
data log likelihood

3

Objective functions

Find parameter values θ that maximize (or minimize) the value of a
function f (θ)

• we want to find the extreme points of the objective function.
Depending on our target, this could be

• …the maximum
E.g., the maximum profit of our shoe shop
E.g., the largest possible (log) likelihood of the data

θ̂ = argmax
θ

f (θ)

• …or the minimum (in which case we often call f a loss function)
E.g., the smallest possible classification error

θ̂ = argmin
θ

f (θ)

4

Recipe for finding Minima / Maxima

1. Define your function of interest f (x) (e.g., data log likelihood)

2. Compute its first derivative wrt its input x

3. Set the derivative to zero

4. Solve for x

5

Closed-form vs Iterative Optimization

Closed-form solutions

• Previously, we computed the closed form solution for the MLE of the
binomial distribution

• We follow our recipe, and arrive at a single solution

Unfortunately, life is not always as easy

• Often, no closed-form solution exists

• Instead, we have to iteratively improve our estimate of θ̂ until we arrive
at a satisfactory solution

• Gradient descent is one popular iterative optimization method

6

‘Descending’ the function to find the Optimum

• 1-dimensional case: find parameter θ that minimizes the function
• follow the curvature of the line step by step

7

‘Descending’ the function to find the Optimum

• 2-dimensional case: find parameters θ = [θ0, θ1] that minimize the
function J
• follow the curvature step by step along the steepest way

Source: https://medium.com/binaryandmore/

beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536 7

https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536
https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536

Gradient Descent: Intuition

Intuition

• Descending a mountain (aka. our function) as fast as possible: at every
position take the next step that takes you most directly into the valley

• We compute a series of solutions θ(0), θ(2), θ(3), … by ‘walking’ along the
function and taking steps in the direction with the steepest local slope (or
gradient).

• each solution depends on the current location

8

Gradient Descent: Details

Learn the model parameters θ

• such that we minimize the error

• traverse over the loss function step by step (‘descending into a valley’)

• we would like an algorithm that tells how to update our parameters

θ ← θ +4θ

9

Gradient Descent: Details

Learn the model parameters θ

• such that we minimize the error

• traverse over the loss function step by step (‘descending into a valley’)

• we would like an algorithm that tells how to update our parameters

θ ← θ +4θ

• 4θ is the derivative, a measure of
change in the input function given an
change in θ

• for a function f (θ), ∂f
∂θ

tells us how
much f changes in response to a
change in θ.

• the derivative measures the slope or
gradient of a function f at point θ

9

Gradient Descent: Details

Learn the model parameters θ

• such that we minimize the error

• traverse over the loss function step by step (‘descending into a valley’)

• we would like an algorithm that tells how to update our parameters

θ ← θ +4θ

• if ∂f
∂x > 0: f (θ) increases as θ increases

• if ∂f
∂x < 0: f (θ) increases as θ decreases • if ∂f ∂x = 0: we are at a minimum (or maximum) • so, to approach the minimum: θ ← θ − η ∂f ∂θ 9 Gradient Descent for multiple parameters • Usually, our models have several parameters which need to be optimized to minimize the error • We compute partial derivatives of f (θ) wrt. individual θi • Partial derivatives measure change in a function of multiple parameters given a change in a single parameter, with all others held constant • For example for f (θ1, θ2) we can compute ∂f∂θ1 and ∂f ∂θ2 • We then update each parameter individually θ1 ← θ1 +4θ1 with 4 θ1 = −η ∂f ∂θ1 θ2 ← θ1 +4θ2 with 4 θ2 = −η ∂f ∂θ2 10 Gradient Descent for multiple parameters • Usually, our models have several parameters which need to be optimized to minimize the error • We compute partial derivatives of f (θ) wrt. individual θi • Partial derivatives measure change in a function of multiple parameters given a change in a single parameter, with all others held constant • For example for f (θ1, θ2) we can compute ∂f∂θ1 and ∂f ∂θ2 • We then update each parameter individually θ1 ← θ1 +4θ1 with 4 θ1 = −η ∂f ∂θ1 θ2 ← θ1 +4θ2 with 4 θ2 = −η ∂f ∂θ2 10 Gradient Descent: Recipe Recipe for Gradient Descent (single parameter) 1: Define objective function f (θ) 2: Initialize parameter θ(0) 3: for iteration t ∈ {0, 1, 2, ...T} do 4: Compute the first derivative of f at that point θ(t) : ∂f ∂θ(t) 5: Update your parameter by subtracting the (scaled) derivative θ (t+1) ← θ(t) − η ∂f ∂θ(t) • η is the step size or learning rate, a parameter • When to stop? Fix number of iterations, or define other criteria 11 Gradient Descent: Recipe Recipe for Gradient Descent (multiple parameters) 1: Define objective function f (θ) 2: Initialize parameters {θ(0)1 , θ (0) 2 , θ (0) 3 , . . . } 3: for iteration t ∈ {0, 1, 2, ...T} do 4: Initialize vector of gradients← [] 5: for parameter f ∈ {1, 2, 3, ...F} do 6: Compute the first derivative of f at that point θ(t)f : ∂f ∂θ (t) f 7: append ∂f ∂θ (t) f to gradients 8: Update all parameters by subtracting the (scaled) gradient θ (t+1) ← θ(t) − η ∂f ∂θ(t) 11 Aside: Global and Local Minima and Maxima Possible issue: local maxima and minima! • A function is convex if a line between any two points of the function lies above the function • A global maximum is the single highest value of the function • A global minimum is the single lowest value of the function https://commons.wikimedia.org/wiki/File:Maxima and Minima.svg 12 Aside: Global and Local Minima and Maxima Possible issue: local maxima and minima! • Convex? • How many global minima? global maxima? • How many local minima? local maxima? 12 Aside: Global and Local Minima and Maxima Possible issue: local maxima and minima! • Convex? • How many global minima? global maxima? • How many local minima? local maxima? 12 Gradient Descent Guarantees • with an appropriate learning rate, GD will find the global minimum for differentiable convex functions • with an appropriate learning rate, GD will find a local minimum for differentiable non-convex functions 13 Summary This aside: • Iterative optimization • Gradient descent Next lecture(s) • Classifiers which do not have a closed-form solution for their paramters • Logistic Regresion • (later) the perceptron, and neural networks 14