编程代考 COMP90049.

Lecture 11 (part 1): Iterative Optimization with Gradient Descent
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.

Copyright By PowCoder代写 加微信 powcoder

Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.
Ex3: Predicting housing prices from a weighted combination of house age and house location
Ex4: Find the parameters θ of a spam classifier which lead to the lowest error
Ex5: Find the parameters θ of a spam classifier which lead to the highest data log likelihood

Recipe for finding Minima / Maxima
1. Define your function of interest f(x) (e.g., data log likelihood) 2. Compute its first derivative wrt its input x
3. Set the derivative to zero
4. Solve for x

Closed-form vs Iterative Optimization
Closed-form solutions
• Previously, we computed the closed form solution for the MLE of the
binomial distribution
• We follow our recipe, and arrive at a single solution
Unfortunately, life is not always as easy
• Often, no closed-form solution exists
• Instead, we have to iteratively improve our estimate of θˆ until we arrive at a satisfactory solution
• Gradient descent is one popular iterative optimization method

‘Descending’ the function to find the Optimum
• 1-dimensional case: find parameter θ that minimizes the function • follow the curvature of the line step by step

‘Descending’ the function to find the Optimum
• 2-dimensional case: find parameters θ = [θ0,θ1] that minimize the function J
• follow the curvature step by step along the steepest way
Source: https://medium.com/binaryandmore/
beginners- guide- to- deriving- and- implementing- backpropagation- e3c1a5a1e536 5

Gradient Descent: Intuition
• Descending a mountain (aka. our function) as fast as possible: at every position take the next step that takes you most directly into the valley
• We compute a series of solutions θ(0), θ(2), θ(3), … by ‘walking’ along the function and taking steps in the direction with the steepest local slope (or gradient).
• each solution depends on the current location

Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’) • we would like an algorithm that tells how to update our parameters
θ ← θ + △θ

Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’)
• we would like an algorithm that tells how to update our parameters
θ ← θ + △θ
• △θ is the derivative ∂f ∂θ
• tells us how much f changes in response to a change in θ.
• a measure of the slope or gradient of a function f at point θ
• the gradient points to the greatest increase of a function

Gradient Descent: Details
Learn the model parameters θ
• such that we minimize the error
• traverse over the loss function step by step (‘descending into a valley’) • we would like an algorithm that tells how to update our parameters
•if∂f >0:f(θ):↗ asθ:↗ ∂θ
•if∂f <0:f(θ):↗ asθ:↘ ∂θ θ ← θ + △θ • if ∂f =0: weareataminimum ∂θ • so, to approach the minimum: θ ← θ − η ∂f ∂θ Gradient Descent for multiple parameters • Usually, our models have several parameters which need to be optimized to minimize the error • We compute partial derivatives of f (θ) wrt. individual θi • Partial derivatives measure change in a function of multiple parameters given a change in a single parameter, with all others held constant • For example for f(θ1,θ2) we can compute ∂f and ∂f • We then update each parameter individually θ1 ←θ1 +△θ1 with △θ1 =−η ∂f ∂θ1 θ2 ←θ2 +△θ2 with △θ2 =−η ∂f ∂θ2 Gradient Descent: Recipe Recipe for Gradient Descent (single parameter) 1: 2: 3: 4: 5: Defineobjectivefunctionf(θ) Initializeparameterθ(0) foriterationt∈{0,1,2,...T}do Compute the first derivative of f at that point θ(t) : ∂f ∂θ(t) Update your parameter by subtracting the (scaled) derivative θ(t+1) ←θ(t) −η ∂f • η is the step size or learning rate, a parameter • When to stop? Fix number of iterations, or define other criteria Gradient Descent: Recipe Recipe for Gradient Descent (multiple parameters) 1: 2: 3: 4: 5: 6: Defineobjectivefunctionf(θ) Initializeparameters{θ(0),θ(0),θ(0),...} 123 foriterationt∈{0,1,2,...T}do Initialize vector of gradients ← [] for parameter f ∈ {1, 2, 3, ...F } do Compute the first derivative of f at that point θ(t) : ∂f to gradients Update all parameters by subtracting the (scaled) gradient θ(t+1) ←θ(t) −η ∂f ∂θ(t) Aside: Global and Local Minima and Maxima Possible issue: local maxima and minima! • A function is convex if a line between any two points of the function lies above the function • A global maximum is the single highest value of the function • A global minimum is the single lowest value of the function https://commons.wikimedia.org/wiki/File:Maxima and Minima.svg Gradient Descent Guarantees 1. with an appropriate learning rate, GD will find the global minimum for differentiable convex functions 2. with an appropriate learning rate, GD will find a local minimum for differentiable non-convex functions Equivalently, Gradient Ascent (“GA”) would find the global maximum (case 1.) and local maximum (case 2.) Now you know: • What optimization is, and why it’s important • How to do closed-form optimization (aka. “set the derivative of f (θ) to zero and solve for θ) • That closed-form solutions are not always computable • In that case, iterative optimization can help us • Gradient descent is one instance of an iterative optimization method • How gradient descent works! Next lecture(s) • Logistic Regresion • The perceptron • Neural networks Lecture 11 (part 2): Logistic Regression Introduction to Machine Learning Semester 1, 2022 Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author. • Naive Bayes + KNN • Optimization (closed-form and iterative) • Evaluation + Feature Selection • Detour into un- and semisupervised learning Today : back to classification! • Logistic Regression Logistic Regression Quick Refresher Recall Naive Bayes P(x, y) = P(y)P(x|y) = 􏰊 P(yi ) 􏰊 P(xmi |yi ) • a probabilistic generative model of the joint probability P(x,y) • optimized to maximize the likelihood of the observed data • naive due to unrealistic feature indepencence assumptions NM i=1 m=1 Quick Refresher Recall Naive Bayes P(x, y) = P(y)P(x|y) = 􏰊 P(yi ) 􏰊 P(xmi |yi ) • a probabilistic generative model of the joint probability P(x,y) • optimized to maximize the likelihood of the observed data • naive due to unrealistic feature indepencence assumptions For prediction, we apply Bayes Rule to obtain the conditional distribution P(x, y) = P(y)P(x|y) = P(y|x)P(x) P(y|x) = P(y)P(x|y) P(x) yˆ = argmax P(y|x) ≈ P(y)P(x|y) y How about we model P(y|x) directly? → Logistic Regression NM i=1 m=1 Introduction to Logistic Regression Logistic Regression on a high level • Is a binary classification model • Is a probabilistic discriminative model because it optimizes P(y|x) directly • Learns to optimally discriminate between inputs which belong to different classes • No model of P(x|y) → no conditional feature independence assumption Aside: Linear Regression • Regression: predict a real-valued quantity y given features x, e.g., housing price success of movie ($) air quality given given given {location, size, age, ...} {cast, genre, budget, ...} {temperature, timeOfDay, CO2, ...} Aside: Linear Regression • Regression: predict a real-valued quantity y given features x, e.g., housing price success of movie ($) air quality given given given {location, size, age, ...} {cast, genre, budget, ...} {temperature, timeOfDay, CO2, ...} • linear regression is the simples regression model • a real-valued yˆ is predicted as a linear combination of weighted feature yˆ = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . = θ0 + 􏰂 θi xi • The weights θ0 , θ1 , . . . are model parameters, and need to be optimized during training • Loss(error)isthesumofsquarederrors(SSE): L=􏰅Ni=1(yˆi −yi)2 Logistic Regression: Derivation I • Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach Logistic Regression: Derivation I • Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach Logistic Regression: Derivation I • Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach • How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not. p(x)=θ0 +θ1x1 +...θFxF Logistic Regression: Derivation I • Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach • How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not. Logistic Regression: Derivation I • Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach • How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not. • How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not. logp(x)=θ0 +θ1x1 +...θFxF Logistic Regression: Derivation I • Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach • How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not. • How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not. Logistic Regression: Derivation I • Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach • How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not. • How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not. • How about: minimally modifying log p(x) such that is is unbounded, by applying the logistic transformation log p(x) =θ0 +θ1x1 +...θFxF 1−p(x) Logistic Regression: Derivation II log p(x) =θ0 +θ1x1 +...θFxF 1−p(x) • also called the log odds • the odds are defined as the fraction of success over the fraction of odds = P(success) = P(success) P(failures) 1 − P(success) • e.g., the odds of rolling a 6 with a fair dice are: 1/6 = 0.17 = 0.2 1 − (1/6) 0.83 Logistic Regression: Derivation III log P(x) =θ0 +θ1x1 +...θFxF 1−P(x) If we rearrange and solve for P(x), we get exp(θ0 + θ1x1 + ...θF xF ) exp(θ0 + 􏰅Ff=1 θf xf ) P(x)= 1+exp(θ0 +θ1x1 +...θFxF) = 1+exp(θ0 +􏰅Ff=1θfxf) 11 = 1+exp(−(θ0 +θ1x1 +...θFxF)) = 1+exp(−(θ0 +􏰅Ff=1 θfxf)) • where the RHS is the inverse logit (or logistic function) • we pass a regression model through the logistic function to obtain a valid probability predicton Logistic Regression: Interpretation P(y|x;θ)= 1+exp(−(θ0 +􏰅Ff=1θfxf)) A closer look at the logistic function Most inputs lead to P(y|x)=0 or P(y|x)=1. That is intended, because all true labels are either 0 or 1. • (θ0􏰅Ff=1θfxf)>0meansy=1 • (θ0􏰅Ff=1θfxf)≈0meansmost
uncertainty
• (θ0􏰅Ff=1θfxf)<0meansy=0 Logistic Regression: Prediction • The logistic function returns the probability of P(y = 1) given an in put x 1 P(y=1|x1,x2,...,xF;θ) = 1+exp(−(θ0+􏰅Ff=1θfxf)) = σ(x;θ) • We define a decision boundary, e.g., predict y = 1 if P(y = 1|x1,x2,...,xF;θ) > 0.5 and y = 0 otherwise

11T P(y = 1|x1, x2, …, xF ; θ) = 1+exp(−(θ0+􏰅Ff=1 θf xf )) = 1+exp(−(θT x)) = σ(θ x)
Model parameters
θ = [0.1, −3.5, 0.7, 2.1]
Feature Function
(Small) Test Data set
rainy sunny
Temp Humidity
cool normal 0
Class hot high 1
1 (bias term)
1 if outlook=sunny
3 if outlook=rainy 
1 if temp=hot
2 if temp=mild 3 if temp=cool
􏰍1 if humidity=normal
2 if outlook=overcast
2 if humidity=high

11T P(y = 1|x1, x2, …, xF ; θ) = 1+exp(−(θ0+􏰅Ff=1 θf xf )) = 1+exp(−(θT x)) = σ(θ x)
Model parameters
θ = [0.1, −3.5, 0.7, 2.1]
Feature Function
(Small) Test Data set
rainy sunny
Temp Humidity
cool normal 0
Class hot high 1
1 (bias term)
1 if outlook=sunny
3 if outlook=rainy 
1 if temp=hot
2 if temp=mild 3 if temp=cool
􏰍1 if humidity=normal
2 if outlook=overcast
2 if humidity=high

Parameter Estimation
What are the four steps we would follow in finding the optimal param- eters?

Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y|X;θ) = −􏰊P(yi|xi;θ)
P(y =1|x;θ)=σ(θTx) P(y =0|x;θ)=1−σ(θTx)

Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y|X;θ) = −􏰊P(yi|xi;θ)
i=1 P(y =1|x;θ)=σ(θTx)
P(y =0|x;θ)=1−σ(θTx)
L(θ) = −P(Y|X;θ) = −􏰊P(yi|xi;θ)
=−􏰊(σ(θTxi))yi ∗(1−σ(θTxi))1−yi i=1

Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y|X;θ) = −􏰊P(yi|xi;θ)
i=1 P(y =1|x;θ)=σ(θTx)
P(y =0|x;θ)=1−σ(θTx)
L(θ) = −P(Y|X;θ) = −􏰊P(yi|xi;θ)
take the log of this function
=−􏰊(σ(θTxi))yi ∗(1−σ(θTxi))1−yi i=1
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))

Take 1st Derivative of the Objective Function
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D

Take 1st Derivative of the Objective Function
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
• Derivative of sum = sum of derivatives → focus on 1 training input
• Compute ∂L for each θj individually, so focus on 1 θj ∂θj

Take 1st Derivative of the Objective Function
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj

Take 1st Derivative of the Objective Function
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂logL(θ) =−(y−1−y) ∂p p1−p
( because L(θ) = −[ylogp + (1 − y)log(1 − p)]

Take 1st Derivative of the Objective Function
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂p = ∂σ(z) = σ(z)[1 − σ(z)] ∂z ∂z

Take 1st Derivative of the Objective Function
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂z = ∂θTx=xj ∂θj ∂θj

Take 1st Derivative of the Objective Function
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
=−y−1−y × σ(z)[1−σ(z)] × xj p 1−p

Take 1st Derivative of the Objective Function
logL(θ) = −􏰂yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
=−y−1−y × σ(z)[1−σ(z)] × xj p 1−p
=􏰋σ(θTx)−y􏰌 × xj

Logistic Regression: Parameter Estimation III
The derivative of the log likelihood wrt. a single parameter θj for all training examples
logL(θ) =􏰂N 􏰃σ(θTxi)−yi􏰄xi ∂θ j
• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4) • Unfortunately, that’s not straightforward here (as for Naive Bayes)
• Instead, we will use an iterative method: Gradient Descent

Logistic Regression: Parameter Estimation III
The derivative of the log likelihood wrt. a single parameter θj for all training examples
logL(θ) =􏰂N 􏰃σ(θTxi)−yi􏰄xi ∂θ j
• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4) • Unfortunately, that’s not straightforward here (as for Naive Bayes)
• Instead, we will use an iterative method: Gradient Descent
θ(new) ← θ(old) − η ∂ log L(θ) j j ∂θj
θ(new) ← θ(old) −η􏰂􏰃σ(θTxi)−yi􏰄xi jjj

Multinomial Logistic Regression
• So far we looked at problems where either y = 0 or y = 1 (e.g., spam classification: y ∈ {play, not play})
P(y=1|x;θ)=σ(θTx) = exp(θTx) 1+exp(θTx)
P(y=0|x;θ)=1−σ(θTx) =1− exp(θTx) 1+exp(θTx)

Multinomial Logistic Regression
• So far we looked at problems where either y = 0 or y = 1 (e.g., spam classification: y ∈ {play, not play})
P(y=1|x;θ)=σ(θTx) = exp(θTx) 1+exp(θTx)
P(y=0|x;θ)=1−σ(θTx) =1− exp(θTx) 1+exp(θTx)
• But what if we have more than 2 classes, e.g., y ∈ {positive, negative, neutral}
• we predict the probability of each class c by passing the input representation through the softmax function, a generalization of the sigmoid
exp(θcx) p(y = c|x;θ) = 􏰅k exp(θkx)
• we learn a parameter vector θc for each class c

Example! Multi-class with 1-hot features
p(y=c|x;θ)= exp(θcx) 􏰅k exp(θkx)
(Small) Test Data set
Outlook Temp
rainy cool sunny cool sunny hot
Feature Function
normal normal high
0 (don’t play) 1 (maybe play) 2 (play)
x0 = 1 (bias term) 
1 if outlook=sunny [100] if outlook=sunny 
2 if outlook=overcast x1 = [010] if outlook=overcast
3 if outlook=rainy [001] if outlook=rainy 
1 if temp=hot [100] if temp=hot 
1 (bias term)
2 if temp=mild 3 if temp=

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com