CS计算机代考程序代写 algorithm PowerPoint Presentation

PowerPoint Presentation

Lecturer: Ben Rubinstein

Lecture 4. Iterative Optimisation &
Logistic Regression

COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

COMP90051 Statistical Machine Learning

This lecture

• Iterative optimisation for extremum estimators
∗ First-order method: Gradient descent
∗ Second-order: Newton-Raphson method

Later: Lagrangian duality

• Logistic regression: workhorse linear classifier
∗ Possibly familiar derivation: frequentist
∗ Decision-theoretic derivation
∗ Training with Newton-Raphson looks like

repeated, weighted linear regression

2

COMP90051 Statistical Machine Learning

Gradient Descent

Brief review of most basic
optimisation approach in ML

3

COMP90051 Statistical Machine Learning

Optimisation formulations in ML

• Training = Fitting = Parameter estimation

• Typical formulation
�𝜽𝜽 ∈ argmin

𝜽𝜽∈Θ
𝐿𝐿 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑,𝜽𝜽

∗ argmin because we want a minimiser not the minimum
• Note: argmin can return a set (minimiser not always unique!)

∗ Θ denotes a model family (including constraints)
∗ L denotes some objective function to be optimised

• E.g. MLE: (conditional) likelihood
• E.g. Decision theory: (regularised) empirical risk

4

COMP90051 Statistical Machine Learning

One we’ve seen: Log trick

• Instead of optimising 𝐿𝐿 𝜃𝜃 , try convenient log 𝐿𝐿(𝜃𝜃)

• Why are we allowed to do this?

• Strictly monotonic function: a > 𝑏𝑏 ⟹ 𝑓𝑓 𝑑𝑑 > 𝑓𝑓(𝑏𝑏)
∗ Example: log function!

• Lemma: Consider any objective function 𝐿𝐿 𝜃𝜃 and
any strictly monotonic f. 𝜃𝜃∗ is an optimiser of 𝐿𝐿 𝜃𝜃 if
and only if it is an optimiser of 𝑓𝑓(𝐿𝐿 𝜃𝜃 ).
∗ Proof: Try it at home for fun!

5

COMP90051 Statistical Machine Learning

Two solution approaches
• Analytic (aka closed form) solution

∗ Known only in limited number of cases
∗ Use 1st-order necessary condition for optimality*:

𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃1

= ⋯ = 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃𝑝𝑝

= 0

• Approximate iterative solution
1. Initialisation: choose starting guess 𝜽𝜽(1), set 𝑖𝑖 = 1
2. Update: 𝜽𝜽(𝑖𝑖+1) ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝜽𝜽(𝑖𝑖) , set 𝑖𝑖 ← 𝑖𝑖 + 1
3. Termination: decide whether to Stop
4. Go to Step 2
5. Stop: return �𝜽𝜽 ≈ 𝜽𝜽(𝑖𝑖)

6

Assuming
unconstrained,
differentiable L

* Note: to check for local minimum, need positive 2nd derivative (or Hessian positive definite); this assumes
unconstrained – in general need to also check boundaries. See also Lagrangian techniques later in subject.

COMP90051 Statistical Machine Learning

Reminder: The gradient

7

• Gradient at 𝜽𝜽 defined as 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃1

, … , 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃𝑝𝑝


evaluated at 𝜽𝜽

• The gradient points to the direction of maximal
change of 𝐿𝐿(𝜽𝜽) when departing from point 𝜽𝜽

• Shorthand notation

∗ 𝛁𝛁𝐿𝐿 ≝ 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃1

, … , 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃𝑝𝑝


computed at point 𝜽𝜽

∗ Here 𝛁𝛁 is the “nabla” symbol

• Hessian matrix at 𝜽𝜽: 𝛁𝛁𝐿𝐿𝑖𝑖𝑖𝑖 =
𝜕𝜕2𝐿𝐿

𝜕𝜕𝜃𝜃𝑖𝑖𝜕𝜕𝜃𝜃𝑗𝑗

COMP90051 Statistical Machine Learning

Gradient descent and SGD
1. Choose 𝜽𝜽(1) and some 𝑇𝑇

2. For 𝑖𝑖 from 1 to 𝑇𝑇*
1. 𝜽𝜽(𝑖𝑖+1) = 𝜽𝜽(𝑖𝑖) − 𝜂𝜂𝛁𝛁𝐿𝐿(𝜽𝜽(𝑖𝑖))

3. Return �𝜽𝜽 ≈ 𝜽𝜽(𝑖𝑖)

• Note: 𝜂𝜂 dynamically updated per step

• Variants: Momentum, AdaGrad, …

• Stochastic gradient descent: two loops
∗ Outer for loop: each loop (called epoch)

sweeps through all training data
∗ Within each epoch, randomly shuffle training

data; then for loop: do gradient steps only on
batches of data. Batch size might be 1 or few

*Other stopping criteria can be used
8

Wikimedia Commons. Authors:
Olegalexandrov, Zerodamage

Assuming 𝐿𝐿 is
differentiable

𝜽𝜽(0)

COMP90051 Statistical Machine Learning

𝑤𝑤1

𝑤𝑤2

Convex objective functions

• ‘Bowl shaped’ functions

• Informally: if line segment between
any two points on graph of function
lies above or on graph

• Formally* 𝑓𝑓:𝐷𝐷 → 𝐑𝐑 is convex
if ∀𝒂𝒂,𝒃𝒃 ∈ 𝐷𝐷, 𝑑𝑑 ∈ 0,1 :
𝑓𝑓 𝑑𝑑𝒂𝒂 + 1 − 𝑑𝑑 𝒃𝒃 ≤ 𝑑𝑑𝑓𝑓 𝒂𝒂 + 1 − 𝑑𝑑 𝑓𝑓 𝒃𝒃
Strictly convex if inequality is strict (<) • Gradient descent on (strictly) convex function guaranteed to find a (unique) global minimum! 9* Aside: Equivalently we can look to the second derivative. For f defined on scalars, it should be non-negative; for multivariate f, the Hessian matrix should be positive semi-definite (see linear algebra supplemental deck). COMP90051 Statistical Machine Learning Newton-Raphson A second-order method; Successive root finding in the objective’s derivative. 10 COMP90051 Statistical Machine Learning Newton-Raphson: Derivation (1D) • Critical points of 𝐿𝐿 𝜃𝜃 = Zero-crossings of 𝐿𝐿′ 𝜃𝜃 • Consider case of scalar 𝜃𝜃. Starting at given/random 𝜃𝜃0, iteratively: 1. Fit tangent line to 𝐿𝐿𝐿(𝜃𝜃) at 𝜃𝜃𝑡𝑡 2. Need to find 𝜃𝜃𝑡𝑡+1 = 𝜃𝜃𝑡𝑡 + ∆ using linear approximation’s zero crossing 3. Tangent line given by derivative: rise/run = −𝐿𝐿′′ 𝜃𝜃𝑡𝑡 = 𝐿𝐿𝐿(𝜃𝜃𝑡𝑡)/∆ 4. Therefore iterate is 𝜃𝜃𝑡𝑡+1 = 𝜃𝜃𝑡𝑡 − 𝐿𝐿𝐿(𝜃𝜃𝑡𝑡)/𝐿𝐿′′ 𝜃𝜃𝑡𝑡 11 𝜃𝜃𝑡𝑡 𝐿𝐿𝐿(𝜃𝜃) 𝐿𝐿𝐿(𝜃𝜃𝑡𝑡) ∆ 𝜃𝜃𝑡𝑡+1 𝜃𝜃𝜃𝜃𝑡𝑡+2 0 COMP90051 Statistical Machine Learning Newton-Raphson: General case • Newton-Raphson summary ∗ Finds 𝐿𝐿′ 𝜃𝜃 zero-crossings ∗ By successive linear approximations to 𝐿𝐿′ 𝜃𝜃 ∗ Linear approximations involve derivative of 𝐿𝐿′ 𝜃𝜃 , ie. 𝐿𝐿′′ 𝜃𝜃 • Vector-valued 𝜃𝜃: How to fix scalar 𝜃𝜃𝑡𝑡+1 = 𝜃𝜃𝑡𝑡 − 𝐿𝐿𝐿(𝜃𝜃𝑡𝑡)/𝐿𝐿′′ 𝜃𝜃𝑡𝑡 ??? ∗ 𝐿𝐿′ 𝜃𝜃 is ∇𝐿𝐿 𝜃𝜃 ∗ 𝐿𝐿′′ 𝜃𝜃 is ∇2𝐿𝐿 𝜃𝜃 ∗ Matrix division is matrix inversion • General case: 𝜽𝜽𝑡𝑡+1 = 𝜽𝜽𝑡𝑡 − ∇2𝐿𝐿(𝜽𝜽𝒕𝒕) −1∇𝐿𝐿(𝜽𝜽𝒕𝒕) ∗ Pro: May converge faster; fitting a quadratic with curvature information ∗ Con: Sometimes computationally expensive, unless approximating Hessian 12 public domain wikipedia COMP90051 Statistical Machine Learning …And much much more • What if you have constraints? ∗ See Lagrangian multipliers (let’s you bring constraints into objective) ∗ Or, projected gradient descent (you iterate between GD on objective, and GD on each constraints) • What about speed of convergence? • Do you really need differentiable objectives? (no, subgradients) • Are there more tricks? (Hell yeah! But outside scope here) 13 Free at http://web.stanford.edu/~boyd/cvxbook/ http://web.stanford.edu/%7Eboyd/cvxbook/ COMP90051 Statistical Machine Learning Mini Summary • Iterative optimisation for ML ∗ First-order: Gradient Descent and Stochastic GD ∗ Convex objectives: Convergence to global optima ∗ Second-order: Newton-Raphson can be faster, can be expensive to build/invert full Hessian Next: Logistic regression for binary classification 14 COMP90051 Statistical Machine Learning Logistic Regression Model A workhorse linear, binary classifier; (A review for some of you; new to some.) 15 COMP90051 Statistical Machine Learning Binary classification: Example • Example: given body mass index (BMI) does a patient have type 2 diabetes (T2D)? • This is (supervised) binary classification • One could use linear regression ∗ Fit a line/hyperplane to data (find weights 𝒘𝒘) ∗ Denote 𝑠𝑠 ≡ 𝒙𝒙′𝒘𝒘 ∗ Predict “Yes” if 𝑠𝑠 ≥ 0.5 ∗ Predict “No” if 𝑠𝑠 < 0.5 16 1 0 T2D BMI 0.5 predict “yes” predict “no” COMP90051 Statistical Machine Learning Why not linear regression • Due to the square loss, points far from boundary have loss squared – even if they’re confidently correct! • Such “outliers” will “pull at” the linear regression • Overall, the least- squares criterion looks unnatural in this setting 17 1 0 T2D BMI COMP90051 Statistical Machine Learning Logistic regression model 18 1 0 T2D BMI 0.5 predict “yes” predict “no”• Probabilistic approach to classification ∗ 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 𝑓𝑓 𝒙𝒙 = ? ∗ Use a linear function? E.g., 𝑠𝑠 𝒙𝒙 = 𝒙𝒙′𝒘𝒘 • Problem: the probability needs to be between 0 and 1. • Logistic function 𝑓𝑓 𝑠𝑠 = 1 1+exp −𝑠𝑠 • Logistic regression model 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 1 1 + exp −𝒙𝒙′𝒘𝒘 COMP90051 Statistical Machine Learning How is logistic regression linear? • Logistic regression model: 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 1 1 + exp −𝒙𝒙′𝒘𝒘 • Classification rule: if 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 > 1

2
then class “1”, else class “0”

• Decision boundary is the set of x’s such that:
1

1+exp −𝒙𝒙′𝒘𝒘
= 1

2

19

-10 -5 0 5 10

0.
0

0.
2

0.
4

0.
6

0.
8

1.
0

Logistic function

Reals
P

ro
ba

bi
lit

ie
s

𝑠𝑠
𝑓𝑓
𝑠𝑠

COMP90051 Statistical Machine Learning

Effect of parameter vector (2D problem)

• Decision boundary is the line where 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 0.5
∗ In higher dimensional problems, the decision boundary is a plane or hyperplane

• Vector 𝒘𝒘 is perpendicular to the decision boundary (see linear algebra review topic)
∗ That is, 𝒘𝒘 is a normal to the decision boundary
∗ Note: in this illustration we assume 𝑤𝑤0 = 0 for simplicity

20

Murphy, Fig 8.1, p246

COMP90051 Statistical Machine Learning

Linear vs. logistic probabilistic models

• Linear regression assumes a Normal distribution with a
fixed variance and mean given by linear model

𝑝𝑝 𝑦𝑦|𝒙𝒙 = 𝑁𝑁𝑆𝑆𝑁𝑁𝑆𝑆𝑑𝑑𝑆𝑆 𝒙𝒙′𝒘𝒘,𝜎𝜎2

• Logistic regression assumes a Bernoulli distribution with
parameter given by logistic transform of linear model

𝑝𝑝 𝑦𝑦|𝒙𝒙 = 𝐵𝐵𝑆𝑆𝑁𝑁𝐵𝐵𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑖𝑖 logistic(𝒙𝒙′𝒘𝒘)

• Recall that Bernoulli distribution is defined as

𝑝𝑝 1 = 𝜃𝜃 and 𝑝𝑝 0 = 1 − 𝜃𝜃 for 𝜃𝜃 ∈ 0,1

• Equivalently 𝑝𝑝 𝑦𝑦 = 𝜃𝜃𝑦𝑦 1 − 𝜃𝜃 1−𝑦𝑦 for 𝑦𝑦 ∈ {0,1}

21

COMP90051 Statistical Machine Learning

Training as Max-Likelihood Estimation

• Assuming independence, probability of data

𝑝𝑝 𝑦𝑦1, … ,𝑦𝑦𝑛𝑛|𝒙𝒙1, … ,𝒙𝒙𝑛𝑛 = �
𝑖𝑖=1

𝑛𝑛

𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊

• Assuming Bernoulli distribution we have
𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝑖𝑖 = 𝜃𝜃 𝒙𝒙𝑖𝑖

𝑦𝑦𝑖𝑖 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖
1−𝑦𝑦𝑖𝑖

where 𝜃𝜃 𝒙𝒙𝑖𝑖 =
1

1+exp −𝒙𝒙𝑖𝑖′𝒘𝒘

• Training: maximise this expression wrt weights 𝒘𝒘

22

COMP90051 Statistical Machine Learning

Apply log trick, simplify

• Instead of maximising likelihood, maximise its logarithm

log �
𝑖𝑖=1

𝑛𝑛

𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 = �
𝑖𝑖=1

𝑛𝑛

log𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊

= �
𝑖𝑖=1

𝑛𝑛

log 𝜃𝜃 𝒙𝒙𝑖𝑖
𝑦𝑦𝑖𝑖 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖

1−𝑦𝑦𝑖𝑖

= �
𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 log 𝜃𝜃 𝒙𝒙𝑖𝑖 + 1 − 𝑦𝑦𝑖𝑖 log 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖

= �
𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 − 1 𝒙𝒙𝒊𝒊
′𝒘𝒘 − log 1 + exp −𝒙𝒙𝒊𝒊

′𝒘𝒘

23

Can’t do this
analytically

COMP90051 Statistical Machine Learning

Logistic Regression:
Decision-Theoretic View

Via cross-entropy loss

24

COMP90051 Statistical Machine Learning

Background: Cross entropy
• Cross entropy is an information-theoretic method for

comparing two distributions

• Cross entropy is a measure of a divergence between
reference distribution 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟(𝑑𝑑) and estimated
distribution 𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 𝑑𝑑 . For discrete distributions:

𝐻𝐻 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 ,𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 = −�
𝑎𝑎∈𝐴𝐴

𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 𝑑𝑑 log𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 𝑑𝑑

𝐴𝐴 is support of the distributions, e.g., 𝐴𝐴 = 0,1

25

COMP90051 Statistical Machine Learning

Training as cross-entropy minimisation

• Consider log-likelihood for a single data point
log𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 = 𝑦𝑦𝑖𝑖 log 𝜃𝜃 𝒙𝒙𝑖𝑖 + 1 − 𝑦𝑦𝑖𝑖 log 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖

• Cross entropy 𝐻𝐻 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟,𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 = −∑𝑎𝑎 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 𝑑𝑑 log𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 𝑑𝑑
∗ If reference (true) distribution is

𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 1 = 𝑦𝑦𝑖𝑖 and 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 0 = 1 − 𝑦𝑦𝑖𝑖
∗ With logistic regression estimating this distribution as

𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 1 = 𝜃𝜃 𝒙𝒙𝑖𝑖 and 𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 0 = 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖

It finds 𝒘𝒘 that minimises sum of cross entropies per training point

26

COMP90051 Statistical Machine Learning

Mini Summary

• Logistic regression formulation
∗ A workhorse linear binary classifier
∗ Frequentist: Bernoulli label with coin bias logistic-linear in x
∗ Decision theory: Minimising cross entropy with labels

Next: Training quickly with Newton-Raphson, and how that is
repeated (weighted) linear regression under the hood!

27

COMP90051 Statistical Machine Learning

Training Logistic Regression:
the IRLS Algorithm

Analytical? Newton-Raphson!

28

COMP90051 Statistical Machine Learning

Iterative optimisation
• Training logistic regression: 𝐰𝐰 maximising log-

likelihood L(w) or cross-entropy loss

• Bad news: No closed form solution
• Good news: Problem is strictly convex, if no

irrelevant features  convergence!

Look ahead: regularisation for irrelevant features

How does gradient descent work?

• 𝜇𝜇 𝑧𝑧 = 1
1+exp(−𝑧𝑧)

then 𝑑𝑑𝜇𝜇
𝑑𝑑𝑧𝑧

= 𝜇𝜇(𝑧𝑧)(1 − 𝜇𝜇 𝑧𝑧 )

• Then ∇𝐿𝐿 𝐰𝐰 = ∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑛𝑛 − 𝜇𝜇 𝐱𝐱𝑛𝑛 𝐱𝐱𝑛𝑛 = 𝐗𝐗𝐿(𝐲𝐲 − 𝝁𝝁),

stacking instances in 𝐗𝐗, labels in 𝐲𝐲, 𝜇𝜇 𝐱𝐱𝑛𝑛 in 𝝁𝝁

29

Murphy, Fig 8.3, p247
𝑤𝑤1

𝑤𝑤2

Note I’m abusing notation:
𝜇𝜇 𝐱𝐱𝑛𝑛 = 𝜇𝜇 𝑧𝑧 where 𝑧𝑧 = 𝐰𝐰𝐿𝐱𝐱𝑛𝑛

Meaning by input type

COMP90051 Statistical Machine Learning

Iteratively-Reweighted Least Squares

• Instead of GD, let’s apply Newton-Raphson  IRLS algorithm

• Recall: ∇𝐿𝐿 𝐰𝐰 = 𝐗𝐗𝐿(𝐲𝐲 − 𝝁𝝁). Differentiate again for Hessian:
∇2𝐿𝐿 𝐰𝐰 = −∑𝑖𝑖

𝑑𝑑𝑑𝑑
𝑑𝑑𝑧𝑧𝑖𝑖
𝐱𝐱𝑛𝑛𝐱𝐱𝑛𝑛𝐿 = −∑𝑖𝑖 𝜇𝜇 𝐱𝐱𝑖𝑖 1 − 𝜇𝜇 𝐱𝐱𝑖𝑖 𝐱𝐱𝑛𝑛𝐱𝐱𝑛𝑛

= −𝐗𝐗′𝐌𝐌𝐗𝐗 , where 𝑀𝑀𝑖𝑖𝑖𝑖 = 𝜇𝜇𝑖𝑖(1 − 𝜇𝜇𝑖𝑖) otherwise 0

• Newton-Raphson then says (now with 𝐌𝐌𝑡𝑡 ,𝝁𝝁𝑡𝑡 dependence on 𝐰𝐰𝑡𝑡)
𝐰𝐰𝑡𝑡+1 = 𝐰𝐰𝑡𝑡 − ∇2𝐿𝐿 −1∇𝐿𝐿 = 𝐰𝐰𝑡𝑡 + 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1𝐗𝐗′ 𝐲𝐲 − 𝝁𝝁𝑡𝑡

= 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗𝐰𝐰𝑡𝑡 + 𝐗𝐗′ 𝐲𝐲 − 𝝁𝝁𝑡𝑡
= 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1𝐗𝐗′𝐌𝐌𝑡𝑡𝐛𝐛𝑡𝑡, where 𝐛𝐛𝑡𝑡 = 𝐗𝐗𝐰𝐰𝑡𝑡 + 𝐌𝐌𝑡𝑡

−1(𝐲𝐲 − 𝝁𝝁𝑡𝑡)

• Each IRLS iteration solves a least squares problem weighted by 𝐌𝐌𝑡𝑡,
which are reweighted iteratively!

30

Compare to
normal

equations

COMP90051 Statistical Machine Learning

IRLS intuition: Putting labels on linear scale

IRLS: 𝐰𝐰𝑡𝑡+1 = 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1𝐗𝐗′𝐌𝐌𝑡𝑡𝐛𝐛𝑡𝑡
where 𝐛𝐛𝑡𝑡 = 𝐗𝐗𝐰𝐰𝑡𝑡 + 𝐌𝐌𝑡𝑡

−1 𝐲𝐲 − 𝝁𝝁𝑡𝑡
and 𝑀𝑀𝑖𝑖𝑖𝑖 = 𝜇𝜇𝑡𝑡 𝐱𝐱𝑖𝑖 [1 − 𝜇𝜇𝑡𝑡(𝐱𝐱𝑖𝑖)] otherwise 0
and 𝜇𝜇𝑡𝑡(𝐱𝐱) = 1 + exp(−𝐰𝐰𝑡𝑡𝐿𝐱𝐱) −1

• The 𝐲𝐲 are not on linear scale. Invert logistic function?

• The 𝐛𝐛𝑡𝑡 are a “linearised” approximation to these: the
𝐛𝐛𝑡𝑡 equation matches a linear approx. to 𝜇𝜇𝑡𝑡−1 𝐲𝐲 .

• Linear regression on new labels!

31

COMP90051 Statistical Machine Learning

IRLS intuition: Equalising label variance

IRLS: 𝐰𝐰𝑡𝑡+1 = 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1𝐗𝐗′𝐌𝐌𝑡𝑡𝐛𝐛𝑡𝑡
where 𝐛𝐛𝑡𝑡 = 𝐗𝐗𝐰𝐰𝑡𝑡 + 𝐌𝐌𝑡𝑡

−1 𝐲𝐲 − 𝝁𝝁𝑡𝑡
and 𝑀𝑀𝑖𝑖𝑖𝑖 = 𝜇𝜇𝑡𝑡 𝐱𝐱𝑖𝑖 [1 − 𝜇𝜇𝑡𝑡(𝐱𝐱𝑖𝑖)] otherwise 0
and 𝜇𝜇𝑡𝑡(𝐱𝐱) = 1 + exp(−𝐰𝐰𝑡𝑡𝐿𝐱𝐱) −1

• In linear regression, each 𝑦𝑦𝑖𝑖 has equal variance 𝜎𝜎2

• Our 𝑦𝑦𝑖𝑖 are Bernoulli, variance: 𝜇𝜇𝑡𝑡 𝐱𝐱𝑖𝑖 1 − 𝜇𝜇𝑡𝑡 𝐱𝐱𝑖𝑖
• Our reweighting standardises, dividing by variances!!

32

Fun exercise: Show that Newton-Raphson for
linear regression gives you the normal equations!

COMP90051 Statistical Machine Learning

Mini Summary

• Training logistic regression
∗ No analytical solution
∗ Gradient descent possible, but convergence rate not ideal
∗ Newton-Raphson: iteratively reweighted least squares

Next time: Regularised linear regression for avoiding
overfitting and ill-posed optimisation

33

�Lecturer: Ben Rubinstein
This lecture
Gradient Descent
Optimisation formulations in ML
One we’ve seen: Log trick
Two solution approaches
Reminder: The gradient
Gradient descent and SGD
Convex objective functions
Newton-Raphson
Newton-Raphson: Derivation (1D)
Newton-Raphson: General case
…And much much more
Mini Summary
Logistic Regression Model
Binary classification: Example
Why not linear regression
Logistic regression model
How is logistic regression linear?
Effect of parameter vector (2D problem)
Linear vs. logistic probabilistic models
Training as Max-Likelihood Estimation
Apply log trick, simplify
Logistic Regression:�Decision-Theoretic View
Background: Cross entropy
Training as cross-entropy minimisation
Mini Summary
Training Logistic Regression:�the IRLS Algorithm
Iterative optimisation
Iteratively-Reweighted Least Squares
IRLS intuition: Putting labels on linear scale
IRLS intuition: Equalising label variance
Mini Summary