程序代写代做代考 algorithm Lecture 7 (part 1): Iterative Optimization with Gradient Descent

Lecture 7 (part 1): Iterative Optimization with Gradient Descent
COMP90049
Introduction to Machine Learning Semester 1, 2020
Lea Frermann, CIS
1

Roadmap
So far…
• Naive Bayes Classifier – theory and practice • MLE estimation of parameters
• Exact optimization
Now: Quick aside on iterative optimization
• Gradient Descent
• Global and local optima
2

Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.
Ex3: Predicting housing prices from a weighted combination of house age and house location
Ex4: Find the parameters θ of a spam classifier which lead to the lowest error
Ex5: Find the parameters θ of a spam classifier which lead to the highest data log likelihood
3

Objective functions
Find parameter values θ that maximize (or minimize) the value of a function f (θ)
• we want to find the extreme points of the objective function. Depending on our target, this could be
• …the maximum
E.g., the maximum profit of our shoe shop
E.g., the largest possible (log) likelihood of the data
θˆ = argmax f (θ) θ
• …or the minimum (in which case we often call f a loss function) E.g., the smallest possible classification error
θˆ = argmin f (θ) θ
4

Recipe for finding Minima / Maxima
1. Define your function of interest f(x) (e.g., data log likelihood) 2. Compute its first derivative wrt its input x
3. Set the derivative to zero
4. Solve for x
5

Closed-form vs Iterative Optimization
Closed-form solutions
• Previously, we computed the closed form solution for the MLE of the
binomial distribution
• We follow our recipe, and arrive at a single solution
Unfortunately, life is not always as easy
• Often, no closed-form solution exists
• Instead, we have to iteratively improve our estimate of θˆ until we arrive
at a satisfactory solution
• Gradient descent is one popular iterative optimization method
6

‘Descending’ the function to find the Optimum
• 1-dimensional case: find parameter θ that minimizes the function • follow the curvature of the line step by step
7

‘Descending’ the function to find the Optimum
• 2-dimensional case: find parameters θ = [θ0,θ1] that minimize the function J
• follow the curvature step by step along the steepest way
Source: https://medium.com/binaryandmore/
beginners- guide- to- deriving- and- implementing- backpropagation- e3c1a5a1e536 7

Gradient Descent: Intuition
Intuition
• Descending a mountain (aka. our function) as fast as possible: at every position take the next step that takes you most directly into the valley
• We compute a series of solutions θ(0), θ(2), θ(3), … by ‘walking’ along the function and taking steps in the direction with the steepest local slope (or gradient).
• each solution depends on the current location
8

Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’) • we would like an algorithm that tells how to update our parameters
θ ← θ + △θ
9

Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
θ ← θ + △θ
• △θ is the derivative, a measure of change in the input function given an change in θ
• for a function f (θ), ∂f tells us how ∂θ
much f changes in response to a change in θ.
• the derivative measures the slope or gradient of a function f at point θ
9

Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
• if∂f ∂x
• if∂f ∂x
• if ∂f ∂x
θ ← θ + △θ > 0: f (θ) increases as θ increases
< 0: f (θ) increases as θ decreases =0: weareataminimum(or maximum) • so, to approach the minimum: θ ← θ − η ∂f ∂θ 9 Gradient Descent for multiple parameters • Usually, our models have several parameters which need to be optimized to minimize the error • We compute partial derivatives of f (θ) wrt. individual θi • Partial derivatives measure change in a function of multiple parameters given a change in a single parameter, with all others held constant • For example for f(θ1,θ2) we can compute ∂f and ∂f ∂θ1 ∂θ2 10 Gradient Descent for multiple parameters • Usually, our models have several parameters which need to be optimized to minimize the error • We compute partial derivatives of f (θ) wrt. individual θi • Partial derivatives measure change in a function of multiple parameters given a change in a single parameter, with all others held constant • For example for f(θ1,θ2) we can compute ∂f and ∂f ∂θ1 • We then update each parameter individually ∂θ2 θ1 ←θ1 +△θ1 with △θ1 =−η ∂f ∂θ1 θ2 ←θ1 +△θ2 with △θ2 =−η ∂f ∂θ2 10 Gradient Descent: Recipe Recipe for Gradient Descent (single parameter) 1: 2: 3: 4: 5: Defineobjectivefunctionf(θ) Initializeparameterθ(0) foriterationt∈{0,1,2,...T}do Compute the first derivative of f at that point θ(t) : ∂f ∂θ(t) Update your parameter by subtracting the (scaled) derivative θ(t+1) ←θ(t) −η ∂f ∂θ(t) • η is the step size or learning rate, a parameter • When to stop? Fix number of iterations, or define other criteria 11 Gradient Descent: Recipe Recipe for Gradient Descent (multiple parameters) 1: 2: 3: 4: 5: 6: 7: 8: Defineobjectivefunctionf(θ) Initializeparameters{θ(0),θ(0),θ(0),...} 123 foriterationt∈{0,1,2,...T}do Initialize vector of gradients ← [] for parameter f ∈ {1, 2, 3, ...F } do Compute the first derivative of f at that point θ(t) : ∂f append ∂f to gradients Update all parameters by subtracting the (scaled) gradient θ(t+1) ←θ(t) −η ∂f ∂θ(t) f ∂θ(t) f ∂θ (t) f 11 Aside: Global and Local Minima and Maxima Possible issue: local maxima and minima! • A function is convex if a line between any two points of the function lies above the function • A global maximum is the single highest value of the function • A global minimum is the single lowest value of the function https://commons.wikimedia.org/wiki/File:Maxima and Minima.svg 12 Aside: Global and Local Minima and Maxima Possible issue: local maxima and minima! • Convex? • How many global minima? global maxima? • How many local minima? local maxima? 12 Aside: Global and Local Minima and Maxima Possible issue: local maxima and minima! • Convex? • How many global minima? global maxima? • How many local minima? local maxima? 12 Gradient Descent Guarantees • with an appropriate learning rate, GD will find the global minimum for differentiable convex functions • with an appropriate learning rate, GD will find a local minimum for differentiable non-convex functions 13 Summary This aside: • Iterative optimization • Gradient descent Next lecture(s) • Classifiers which do not have a closed-form solution for their paramters • Logistic Regresion • (later) the perceptron, and neural networks 14