Lecture 11 (part 1): Iterative Optimization with Gradient Descent
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.
Copyright By PowCoder代写 加微信 powcoder
Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.
Ex3: Predicting housing prices from a weighted combination of house age and house location
Ex4: Find the parameters θ of a spam classifier which lead to the lowest error
Ex5: Find the parameters θ of a spam classifier which lead to the highest data log likelihood
Recipe for finding Minima / Maxima
1. Define your function of interest f(x) (e.g., data log likelihood) 2. Compute its first derivative wrt its input x
3. Set the derivative to zero
4. Solve for x
Closed-form vs Iterative Optimization
Closed-form solutions
• Previously, we computed the closed form solution for the MLE of the
binomial distribution
• We follow our recipe, and arrive at a single solution
Unfortunately, life is not always as easy
• Often, no closed-form solution exists
• Instead, we have to iteratively improve our estimate of θˆ until we arrive at a satisfactory solution
• Gradient descent is one popular iterative optimization method
‘Descending’ the function to find the Optimum
• 1-dimensional case: find parameter θ that minimizes the function • follow the curvature of the line step by step
‘Descending’ the function to find the Optimum
• 2-dimensional case: find parameters θ = [θ0,θ1] that minimize the function J
• follow the curvature step by step along the steepest way
Source: https://medium.com/binaryandmore/
beginners- guide- to- deriving- and- implementing- backpropagation- e3c1a5a1e536 5
Gradient Descent: Intuition
• Descending a mountain (aka. our function) as fast as possible: at every position take the next step that takes you most directly into the valley
• We compute a series of solutions θ(0), θ(2), θ(3), … by ‘walking’ along the function and taking steps in the direction with the steepest local slope (or gradient).
• each solution depends on the current location
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’) • we would like an algorithm that tells how to update our parameters
θ ← θ + △θ
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
θ ← θ + △θ
• △θ is the derivative ∂f ∂θ
• tells us how much f changes in response to a change in θ.
• a measure of the slope or gradient of a function f at point θ
• the gradient points to the greatest increase of a function
Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’) • we would like an algorithm that tells how to update our parameters
•if∂f >0:f(θ):↗ asθ:↗ ∂θ
•if∂f <0:f(θ):↗ asθ:↘ ∂θ
θ ← θ + △θ
• if ∂f =0: weareataminimum ∂θ
• so, to approach the minimum:
θ ← θ − η ∂f ∂θ
Gradient Descent for multiple parameters
• Usually, our models have several parameters which need to be optimized to minimize the error
• We compute partial derivatives of f (θ) wrt. individual θi
• Partial derivatives measure change in a function of multiple parameters
given a change in a single parameter, with all others held constant • For example for f(θ1,θ2) we can compute ∂f and ∂f
• We then update each parameter individually
θ1 ←θ1 +△θ1 with △θ1 =−η ∂f ∂θ1
θ2 ←θ2 +△θ2 with △θ2 =−η ∂f ∂θ2
Gradient Descent: Recipe
Recipe for Gradient Descent (single parameter)
1: 2: 3: 4: 5:
Defineobjectivefunctionf(θ) Initializeparameterθ(0) foriterationt∈{0,1,2,...T}do
Compute the first derivative of f at that point θ(t) : ∂f ∂θ(t)
Update your parameter by subtracting the (scaled) derivative θ(t+1) ←θ(t) −η ∂f
• η is the step size or learning rate, a parameter
• When to stop? Fix number of iterations, or define other criteria
Gradient Descent: Recipe
Recipe for Gradient Descent (multiple parameters)
1: 2: 3: 4: 5: 6:
Defineobjectivefunctionf(θ) Initializeparameters{θ(0),θ(0),θ(0),...}
123 foriterationt∈{0,1,2,...T}do
Initialize vector of gradients ← [] for parameter f ∈ {1, 2, 3, ...F } do
Compute the first derivative of f at that point θ(t) : ∂f
to gradients
Update all parameters by subtracting the (scaled) gradient
θ(t+1) ←θ(t) −η ∂f ∂θ(t)
Aside: Global and Local Minima and Maxima
Possible issue: local maxima and minima!
• A function is convex if a line between any two points of the function lies above the function
• A global maximum is the single highest value of the function • A global minimum is the single lowest value of the function
https://commons.wikimedia.org/wiki/File:Maxima and Minima.svg
Gradient Descent Guarantees
1. with an appropriate learning rate, GD will find the global minimum for differentiable convex functions
2. with an appropriate learning rate, GD will find a local minimum for differentiable non-convex functions
Equivalently, Gradient Ascent (“GA”) would find the global maximum (case 1.) and local maximum (case 2.)
Now you know:
• What optimization is, and why it’s important
• How to do closed-form optimization (aka. “set the derivative of f (θ) to
zero and solve for θ)
• That closed-form solutions are not always computable
• In that case, iterative optimization can help us
• Gradient descent is one instance of an iterative optimization method
• How gradient descent works!
Next lecture(s)
• Logistic Regresion • The perceptron
• Neural networks
Lecture 11 (part 2): Logistic Regression
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.
• Naive Bayes + KNN
• Optimization (closed-form and iterative)
• Evaluation + Feature Selection
• Detour into un- and semisupervised learning
Today : back to classification! • Logistic Regression
Logistic Regression
Quick Refresher
Recall Naive Bayes
P(x, y) = P(y)P(x|y) = P(yi ) P(xmi |yi )
• a probabilistic generative model of the joint probability P(x,y) • optimized to maximize the likelihood of the observed data
• naive due to unrealistic feature indepencence assumptions
NM i=1 m=1
Quick Refresher
Recall Naive Bayes
P(x, y) = P(y)P(x|y) = P(yi ) P(xmi |yi )
• a probabilistic generative model of the joint probability P(x,y) • optimized to maximize the likelihood of the observed data
• naive due to unrealistic feature indepencence assumptions
For prediction, we apply Bayes Rule to obtain the conditional distribution P(x, y) = P(y)P(x|y) = P(y|x)P(x)
P(y|x) = P(y)P(x|y) P(x)
yˆ = argmax P(y|x) ≈ P(y)P(x|y) y
How about we model P(y|x) directly? → Logistic Regression
NM i=1 m=1
Introduction to Logistic Regression
Logistic Regression on a high level
• Is a binary classification model
• Is a probabilistic discriminative model because it optimizes P(y|x) directly
• Learns to optimally discriminate between inputs which belong to different classes
• No model of P(x|y) → no conditional feature independence assumption
Aside: Linear Regression
• Regression: predict a real-valued quantity y given features x, e.g.,
housing price success of movie ($) air quality
given given given
{location, size, age, ...}
{cast, genre, budget, ...} {temperature, timeOfDay, CO2, ...}
Aside: Linear Regression
• Regression: predict a real-valued quantity y given features x, e.g.,
housing price success of movie ($) air quality
given given given
{location, size, age, ...}
{cast, genre, budget, ...} {temperature, timeOfDay, CO2, ...}
• linear regression is the simples regression model
• a real-valued yˆ is predicted as a linear combination of weighted feature
yˆ = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . = θ0 + θi xi
• The weights θ0 , θ1 , . . . are model parameters, and need to be optimized during training
• Loss(error)isthesumofsquarederrors(SSE): L=Ni=1(yˆi −yi)2
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
p(x)=θ0 +θ1x1 +...θFxF
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not.
logp(x)=θ0 +θ1x1 +...θFxF
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not.
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not.
• How about: minimally modifying log p(x) such that is is unbounded, by applying the logistic transformation
log p(x) =θ0 +θ1x1 +...θFxF 1−p(x)
Logistic Regression: Derivation II
log p(x) =θ0 +θ1x1 +...θFxF 1−p(x)
• also called the log odds
• the odds are defined as the fraction of success over the fraction of
odds = P(success) = P(success) P(failures) 1 − P(success)
• e.g., the odds of rolling a 6 with a fair dice are:
1/6 = 0.17 = 0.2 1 − (1/6) 0.83
Logistic Regression: Derivation III
log P(x) =θ0 +θ1x1 +...θFxF 1−P(x)
If we rearrange and solve for P(x), we get
exp(θ0 + θ1x1 + ...θF xF ) exp(θ0 + Ff=1 θf xf )
P(x)= 1+exp(θ0 +θ1x1 +...θFxF) = 1+exp(θ0 +Ff=1θfxf) 11
= 1+exp(−(θ0 +θ1x1 +...θFxF)) = 1+exp(−(θ0 +Ff=1 θfxf)) • where the RHS is the inverse logit (or
logistic function)
• we pass a regression model through the logistic function to obtain a valid probability predicton
Logistic Regression: Interpretation
P(y|x;θ)= 1+exp(−(θ0 +Ff=1θfxf))
A closer look at the logistic function
Most inputs lead to P(y|x)=0 or P(y|x)=1. That is intended, because all true labels are either 0 or 1.
• (θ0Ff=1θfxf)>0meansy=1 • (θ0Ff=1θfxf)≈0meansmost
uncertainty
• (θ0Ff=1θfxf)<0meansy=0
Logistic Regression: Prediction
• The logistic function returns the probability of P(y = 1) given an in put x 1
P(y=1|x1,x2,...,xF;θ) = 1+exp(−(θ0+Ff=1θfxf)) = σ(x;θ)
• We define a decision boundary, e.g., predict y = 1 if P(y = 1|x1,x2,...,xF;θ) > 0.5 and y = 0 otherwise
11T P(y = 1|x1, x2, …, xF ; θ) = 1+exp(−(θ0+Ff=1 θf xf )) = 1+exp(−(θT x)) = σ(θ x)
Model parameters
θ = [0.1, −3.5, 0.7, 2.1]
Feature Function
(Small) Test Data set
rainy sunny
Temp Humidity
cool normal 0
Class hot high 1
1 (bias term)
1 if outlook=sunny
3 if outlook=rainy
1 if temp=hot
2 if temp=mild 3 if temp=cool
1 if humidity=normal
2 if outlook=overcast
2 if humidity=high
11T P(y = 1|x1, x2, …, xF ; θ) = 1+exp(−(θ0+Ff=1 θf xf )) = 1+exp(−(θT x)) = σ(θ x)
Model parameters
θ = [0.1, −3.5, 0.7, 2.1]
Feature Function
(Small) Test Data set
rainy sunny
Temp Humidity
cool normal 0
Class hot high 1
1 (bias term)
1 if outlook=sunny
3 if outlook=rainy
1 if temp=hot
2 if temp=mild 3 if temp=cool
1 if humidity=normal
2 if outlook=overcast
2 if humidity=high
Parameter Estimation
What are the four steps we would follow in finding the optimal param- eters?
Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
P(y =1|x;θ)=σ(θTx) P(y =0|x;θ)=1−σ(θTx)
Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
i=1 P(y =1|x;θ)=σ(θTx)
P(y =0|x;θ)=1−σ(θTx)
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
=−(σ(θTxi))yi ∗(1−σ(θTxi))1−yi i=1
Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
i=1 P(y =1|x;θ)=σ(θTx)
P(y =0|x;θ)=1−σ(θTx)
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
take the log of this function
=−(σ(θTxi))yi ∗(1−σ(θTxi))1−yi i=1
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Take 1st Derivative of the Objective Function
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
Take 1st Derivative of the Objective Function
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
• Derivative of sum = sum of derivatives → focus on 1 training input
• Compute ∂L for each θj individually, so focus on 1 θj ∂θj
Take 1st Derivative of the Objective Function
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
Take 1st Derivative of the Objective Function
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂logL(θ) =−(y−1−y) ∂p p1−p
( because L(θ) = −[ylogp + (1 − y)log(1 − p)]
Take 1st Derivative of the Objective Function
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂p = ∂σ(z) = σ(z)[1 − σ(z)] ∂z ∂z
Take 1st Derivative of the Objective Function
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂z = ∂θTx=xj ∂θj ∂θj
Take 1st Derivative of the Objective Function
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
=−y−1−y × σ(z)[1−σ(z)] × xj p 1−p
Take 1st Derivative of the Objective Function
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
=−y−1−y × σ(z)[1−σ(z)] × xj p 1−p
=σ(θTx)−y × xj
Logistic Regression: Parameter Estimation III
The derivative of the log likelihood wrt. a single parameter θj for all training examples
logL(θ) =N σ(θTxi)−yixi ∂θ j
• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4) • Unfortunately, that’s not straightforward here (as for Naive Bayes)
• Instead, we will use an iterative method: Gradient Descent
Logistic Regression: Parameter Estimation III
The derivative of the log likelihood wrt. a single parameter θj for all training examples
logL(θ) =N σ(θTxi)−yixi ∂θ j
• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4) • Unfortunately, that’s not straightforward here (as for Naive Bayes)
• Instead, we will use an iterative method: Gradient Descent
θ(new) ← θ(old) − η ∂ log L(θ) j j ∂θj
θ(new) ← θ(old) −ησ(θTxi)−yixi jjj
Multinomial Logistic Regression
• So far we looked at problems where either y = 0 or y = 1 (e.g., spam classification: y ∈ {play, not play})
P(y=1|x;θ)=σ(θTx) = exp(θTx) 1+exp(θTx)
P(y=0|x;θ)=1−σ(θTx) =1− exp(θTx) 1+exp(θTx)
Multinomial Logistic Regression
• So far we looked at problems where either y = 0 or y = 1 (e.g., spam classification: y ∈ {play, not play})
P(y=1|x;θ)=σ(θTx) = exp(θTx) 1+exp(θTx)
P(y=0|x;θ)=1−σ(θTx) =1− exp(θTx) 1+exp(θTx)
• But what if we have more than 2 classes, e.g., y ∈ {positive, negative, neutral}
• we predict the probability of each class c by passing the input representation through the softmax function, a generalization of the sigmoid
exp(θcx) p(y = c|x;θ) = k exp(θkx)
• we learn a parameter vector θc for each class c
Example! Multi-class with 1-hot features
p(y=c|x;θ)= exp(θcx) k exp(θkx)
(Small) Test Data set
Outlook Temp
rainy cool sunny cool sunny hot
Feature Function
normal normal high
0 (don’t play) 1 (maybe play) 2 (play)
x0 = 1 (bias term)
1 if outlook=sunny [100] if outlook=sunny
2 if outlook=overcast x1 = [010] if outlook=overcast
3 if outlook=rainy [001] if outlook=rainy
1 if temp=hot [100] if temp=hot
1 (bias term)
2 if temp=mild 3 if temp=
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com