PowerPoint Presentation
Lecturer: Ben Rubinstein
Lecture 4. Iterative Optimisation &
Logistic Regression
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Iterative optimisation for extremum estimators
∗ First-order method: Gradient descent
∗ Second-order: Newton-Raphson method
Later: Lagrangian duality
• Logistic regression: workhorse linear classifier
∗ Possibly familiar derivation: frequentist
∗ Decision-theoretic derivation
∗ Training with Newton-Raphson looks like
repeated, weighted linear regression
2
COMP90051 Statistical Machine Learning
Gradient Descent
Brief review of most basic
optimisation approach in ML
3
COMP90051 Statistical Machine Learning
Optimisation formulations in ML
• Training = Fitting = Parameter estimation
• Typical formulation
�𝜽𝜽 ∈ argmin
𝜽𝜽∈Θ
𝐿𝐿 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑,𝜽𝜽
∗ argmin because we want a minimiser not the minimum
• Note: argmin can return a set (minimiser not always unique!)
∗ Θ denotes a model family (including constraints)
∗ L denotes some objective function to be optimised
• E.g. MLE: (conditional) likelihood
• E.g. Decision theory: (regularised) empirical risk
4
COMP90051 Statistical Machine Learning
One we’ve seen: Log trick
• Instead of optimising 𝐿𝐿 𝜃𝜃 , try convenient log 𝐿𝐿(𝜃𝜃)
• Why are we allowed to do this?
• Strictly monotonic function: a > 𝑏𝑏 ⟹ 𝑓𝑓 𝑑𝑑 > 𝑓𝑓(𝑏𝑏)
∗ Example: log function!
• Lemma: Consider any objective function 𝐿𝐿 𝜃𝜃 and
any strictly monotonic f. 𝜃𝜃∗ is an optimiser of 𝐿𝐿 𝜃𝜃 if
and only if it is an optimiser of 𝑓𝑓(𝐿𝐿 𝜃𝜃 ).
∗ Proof: Try it at home for fun!
5
COMP90051 Statistical Machine Learning
Two solution approaches
• Analytic (aka closed form) solution
∗ Known only in limited number of cases
∗ Use 1st-order necessary condition for optimality*:
𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃1
= ⋯ = 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃𝑝𝑝
= 0
• Approximate iterative solution
1. Initialisation: choose starting guess 𝜽𝜽(1), set 𝑖𝑖 = 1
2. Update: 𝜽𝜽(𝑖𝑖+1) ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝜽𝜽(𝑖𝑖) , set 𝑖𝑖 ← 𝑖𝑖 + 1
3. Termination: decide whether to Stop
4. Go to Step 2
5. Stop: return �𝜽𝜽 ≈ 𝜽𝜽(𝑖𝑖)
6
Assuming
unconstrained,
differentiable L
* Note: to check for local minimum, need positive 2nd derivative (or Hessian positive definite); this assumes
unconstrained – in general need to also check boundaries. See also Lagrangian techniques later in subject.
COMP90051 Statistical Machine Learning
Reminder: The gradient
7
• Gradient at 𝜽𝜽 defined as 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃1
, … , 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃𝑝𝑝
′
evaluated at 𝜽𝜽
• The gradient points to the direction of maximal
change of 𝐿𝐿(𝜽𝜽) when departing from point 𝜽𝜽
• Shorthand notation
∗ 𝛁𝛁𝐿𝐿 ≝ 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃1
, … , 𝜕𝜕𝜕𝜕
𝜕𝜕𝜃𝜃𝑝𝑝
′
computed at point 𝜽𝜽
∗ Here 𝛁𝛁 is the “nabla” symbol
• Hessian matrix at 𝜽𝜽: 𝛁𝛁𝐿𝐿𝑖𝑖𝑖𝑖 =
𝜕𝜕2𝐿𝐿
𝜕𝜕𝜃𝜃𝑖𝑖𝜕𝜕𝜃𝜃𝑗𝑗
COMP90051 Statistical Machine Learning
Gradient descent and SGD
1. Choose 𝜽𝜽(1) and some 𝑇𝑇
2. For 𝑖𝑖 from 1 to 𝑇𝑇*
1. 𝜽𝜽(𝑖𝑖+1) = 𝜽𝜽(𝑖𝑖) − 𝜂𝜂𝛁𝛁𝐿𝐿(𝜽𝜽(𝑖𝑖))
3. Return �𝜽𝜽 ≈ 𝜽𝜽(𝑖𝑖)
• Note: 𝜂𝜂 dynamically updated per step
• Variants: Momentum, AdaGrad, …
• Stochastic gradient descent: two loops
∗ Outer for loop: each loop (called epoch)
sweeps through all training data
∗ Within each epoch, randomly shuffle training
data; then for loop: do gradient steps only on
batches of data. Batch size might be 1 or few
*Other stopping criteria can be used
8
Wikimedia Commons. Authors:
Olegalexandrov, Zerodamage
Assuming 𝐿𝐿 is
differentiable
𝜽𝜽(0)
COMP90051 Statistical Machine Learning
𝑤𝑤1
𝑤𝑤2
Convex objective functions
• ‘Bowl shaped’ functions
• Informally: if line segment between
any two points on graph of function
lies above or on graph
• Formally* 𝑓𝑓:𝐷𝐷 → 𝐑𝐑 is convex
if ∀𝒂𝒂,𝒃𝒃 ∈ 𝐷𝐷, 𝑑𝑑 ∈ 0,1 :
𝑓𝑓 𝑑𝑑𝒂𝒂 + 1 − 𝑑𝑑 𝒃𝒃 ≤ 𝑑𝑑𝑓𝑓 𝒂𝒂 + 1 − 𝑑𝑑 𝑓𝑓 𝒃𝒃
Strictly convex if inequality is strict (<)
• Gradient descent on (strictly) convex function
guaranteed to find a (unique) global minimum!
9* Aside: Equivalently we can look to the second derivative. For f defined on scalars, it should be non-negative;
for multivariate f, the Hessian matrix should be positive semi-definite (see linear algebra supplemental deck).
COMP90051 Statistical Machine Learning
Newton-Raphson
A second-order method;
Successive root finding in the
objective’s derivative.
10
COMP90051 Statistical Machine Learning
Newton-Raphson: Derivation (1D)
• Critical points of 𝐿𝐿 𝜃𝜃 = Zero-crossings of 𝐿𝐿′ 𝜃𝜃
• Consider case of scalar 𝜃𝜃. Starting at given/random 𝜃𝜃0, iteratively:
1. Fit tangent line to 𝐿𝐿𝐿(𝜃𝜃) at 𝜃𝜃𝑡𝑡
2. Need to find 𝜃𝜃𝑡𝑡+1 = 𝜃𝜃𝑡𝑡 + ∆ using linear approximation’s zero crossing
3. Tangent line given by derivative: rise/run = −𝐿𝐿′′ 𝜃𝜃𝑡𝑡 = 𝐿𝐿𝐿(𝜃𝜃𝑡𝑡)/∆
4. Therefore iterate is 𝜃𝜃𝑡𝑡+1 = 𝜃𝜃𝑡𝑡 − 𝐿𝐿𝐿(𝜃𝜃𝑡𝑡)/𝐿𝐿′′ 𝜃𝜃𝑡𝑡
11
𝜃𝜃𝑡𝑡
𝐿𝐿𝐿(𝜃𝜃)
𝐿𝐿𝐿(𝜃𝜃𝑡𝑡)
∆ 𝜃𝜃𝑡𝑡+1 𝜃𝜃𝜃𝜃𝑡𝑡+2
0
COMP90051 Statistical Machine Learning
Newton-Raphson: General case
• Newton-Raphson summary
∗ Finds 𝐿𝐿′ 𝜃𝜃 zero-crossings
∗ By successive linear approximations to 𝐿𝐿′ 𝜃𝜃
∗ Linear approximations involve derivative of 𝐿𝐿′ 𝜃𝜃 , ie. 𝐿𝐿′′ 𝜃𝜃
• Vector-valued 𝜃𝜃:
How to fix scalar 𝜃𝜃𝑡𝑡+1 = 𝜃𝜃𝑡𝑡 − 𝐿𝐿𝐿(𝜃𝜃𝑡𝑡)/𝐿𝐿′′ 𝜃𝜃𝑡𝑡 ???
∗ 𝐿𝐿′ 𝜃𝜃 is ∇𝐿𝐿 𝜃𝜃
∗ 𝐿𝐿′′ 𝜃𝜃 is ∇2𝐿𝐿 𝜃𝜃
∗ Matrix division is matrix inversion
• General case: 𝜽𝜽𝑡𝑡+1 = 𝜽𝜽𝑡𝑡 − ∇2𝐿𝐿(𝜽𝜽𝒕𝒕) −1∇𝐿𝐿(𝜽𝜽𝒕𝒕)
∗ Pro: May converge faster; fitting a quadratic with curvature information
∗ Con: Sometimes computationally expensive, unless approximating Hessian
12
public domain wikipedia
COMP90051 Statistical Machine Learning
…And much much more
• What if you have constraints?
∗ See Lagrangian multipliers (let’s you
bring constraints into objective)
∗ Or, projected gradient descent (you
iterate between GD on objective, and
GD on each constraints)
• What about speed of convergence?
• Do you really need differentiable
objectives? (no, subgradients)
• Are there more tricks? (Hell yeah!
But outside scope here)
13
Free at http://web.stanford.edu/~boyd/cvxbook/
http://web.stanford.edu/%7Eboyd/cvxbook/
COMP90051 Statistical Machine Learning
Mini Summary
• Iterative optimisation for ML
∗ First-order: Gradient Descent and Stochastic GD
∗ Convex objectives: Convergence to global optima
∗ Second-order: Newton-Raphson can be faster, can be
expensive to build/invert full Hessian
Next: Logistic regression for binary classification
14
COMP90051 Statistical Machine Learning
Logistic Regression Model
A workhorse linear, binary classifier;
(A review for some of you; new to some.)
15
COMP90051 Statistical Machine Learning
Binary classification: Example
• Example: given body mass
index (BMI) does a patient
have type 2 diabetes (T2D)?
• This is (supervised) binary
classification
• One could use linear regression
∗ Fit a line/hyperplane to data (find
weights 𝒘𝒘)
∗ Denote 𝑠𝑠 ≡ 𝒙𝒙′𝒘𝒘
∗ Predict “Yes” if 𝑠𝑠 ≥ 0.5
∗ Predict “No” if 𝑠𝑠 < 0.5
16
1
0
T2D
BMI
0.5
predict
“yes”
predict
“no”
COMP90051 Statistical Machine Learning
Why not linear regression
• Due to the square loss,
points far from boundary
have loss squared – even
if they’re confidently
correct!
• Such “outliers” will “pull
at” the linear regression
• Overall, the least-
squares criterion looks
unnatural in this setting
17
1
0
T2D
BMI
COMP90051 Statistical Machine Learning
Logistic regression model
18
1
0
T2D
BMI
0.5
predict
“yes”
predict
“no”• Probabilistic approach to classification
∗ 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 𝑓𝑓 𝒙𝒙 = ?
∗ Use a linear function? E.g., 𝑠𝑠 𝒙𝒙 = 𝒙𝒙′𝒘𝒘
• Problem: the probability needs to be
between 0 and 1.
• Logistic function 𝑓𝑓 𝑠𝑠 = 1
1+exp −𝑠𝑠
• Logistic regression model
𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 =
1
1 + exp −𝒙𝒙′𝒘𝒘
COMP90051 Statistical Machine Learning
How is logistic regression linear?
• Logistic regression model:
𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 =
1
1 + exp −𝒙𝒙′𝒘𝒘
• Classification rule:
if 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 > 1
2
then class “1”, else class “0”
• Decision boundary is the set of x’s such that:
1
1+exp −𝒙𝒙′𝒘𝒘
= 1
2
19
-10 -5 0 5 10
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Logistic function
Reals
P
ro
ba
bi
lit
ie
s
𝑠𝑠
𝑓𝑓
𝑠𝑠
COMP90051 Statistical Machine Learning
Effect of parameter vector (2D problem)
• Decision boundary is the line where 𝑃𝑃 𝑌𝑌 = 1|𝒙𝒙 = 0.5
∗ In higher dimensional problems, the decision boundary is a plane or hyperplane
• Vector 𝒘𝒘 is perpendicular to the decision boundary (see linear algebra review topic)
∗ That is, 𝒘𝒘 is a normal to the decision boundary
∗ Note: in this illustration we assume 𝑤𝑤0 = 0 for simplicity
20
Murphy, Fig 8.1, p246
COMP90051 Statistical Machine Learning
Linear vs. logistic probabilistic models
• Linear regression assumes a Normal distribution with a
fixed variance and mean given by linear model
𝑝𝑝 𝑦𝑦|𝒙𝒙 = 𝑁𝑁𝑆𝑆𝑁𝑁𝑆𝑆𝑑𝑑𝑆𝑆 𝒙𝒙′𝒘𝒘,𝜎𝜎2
• Logistic regression assumes a Bernoulli distribution with
parameter given by logistic transform of linear model
𝑝𝑝 𝑦𝑦|𝒙𝒙 = 𝐵𝐵𝑆𝑆𝑁𝑁𝐵𝐵𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑖𝑖 logistic(𝒙𝒙′𝒘𝒘)
• Recall that Bernoulli distribution is defined as
𝑝𝑝 1 = 𝜃𝜃 and 𝑝𝑝 0 = 1 − 𝜃𝜃 for 𝜃𝜃 ∈ 0,1
• Equivalently 𝑝𝑝 𝑦𝑦 = 𝜃𝜃𝑦𝑦 1 − 𝜃𝜃 1−𝑦𝑦 for 𝑦𝑦 ∈ {0,1}
21
COMP90051 Statistical Machine Learning
Training as Max-Likelihood Estimation
• Assuming independence, probability of data
𝑝𝑝 𝑦𝑦1, … ,𝑦𝑦𝑛𝑛|𝒙𝒙1, … ,𝒙𝒙𝑛𝑛 = �
𝑖𝑖=1
𝑛𝑛
𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊
• Assuming Bernoulli distribution we have
𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝑖𝑖 = 𝜃𝜃 𝒙𝒙𝑖𝑖
𝑦𝑦𝑖𝑖 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖
1−𝑦𝑦𝑖𝑖
where 𝜃𝜃 𝒙𝒙𝑖𝑖 =
1
1+exp −𝒙𝒙𝑖𝑖′𝒘𝒘
• Training: maximise this expression wrt weights 𝒘𝒘
22
COMP90051 Statistical Machine Learning
Apply log trick, simplify
• Instead of maximising likelihood, maximise its logarithm
log �
𝑖𝑖=1
𝑛𝑛
𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 = �
𝑖𝑖=1
𝑛𝑛
log𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊
= �
𝑖𝑖=1
𝑛𝑛
log 𝜃𝜃 𝒙𝒙𝑖𝑖
𝑦𝑦𝑖𝑖 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖
1−𝑦𝑦𝑖𝑖
= �
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 log 𝜃𝜃 𝒙𝒙𝑖𝑖 + 1 − 𝑦𝑦𝑖𝑖 log 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖
= �
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 1 𝒙𝒙𝒊𝒊
′𝒘𝒘 − log 1 + exp −𝒙𝒙𝒊𝒊
′𝒘𝒘
23
Can’t do this
analytically
COMP90051 Statistical Machine Learning
Logistic Regression:
Decision-Theoretic View
Via cross-entropy loss
24
COMP90051 Statistical Machine Learning
Background: Cross entropy
• Cross entropy is an information-theoretic method for
comparing two distributions
• Cross entropy is a measure of a divergence between
reference distribution 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟(𝑑𝑑) and estimated
distribution 𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 𝑑𝑑 . For discrete distributions:
𝐻𝐻 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 ,𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 = −�
𝑎𝑎∈𝐴𝐴
𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 𝑑𝑑 log𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 𝑑𝑑
𝐴𝐴 is support of the distributions, e.g., 𝐴𝐴 = 0,1
25
COMP90051 Statistical Machine Learning
Training as cross-entropy minimisation
• Consider log-likelihood for a single data point
log𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 = 𝑦𝑦𝑖𝑖 log 𝜃𝜃 𝒙𝒙𝑖𝑖 + 1 − 𝑦𝑦𝑖𝑖 log 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖
• Cross entropy 𝐻𝐻 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟,𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 = −∑𝑎𝑎 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 𝑑𝑑 log𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 𝑑𝑑
∗ If reference (true) distribution is
𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 1 = 𝑦𝑦𝑖𝑖 and 𝑔𝑔𝑟𝑟𝑟𝑟𝑟𝑟 0 = 1 − 𝑦𝑦𝑖𝑖
∗ With logistic regression estimating this distribution as
𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 1 = 𝜃𝜃 𝒙𝒙𝑖𝑖 and 𝑔𝑔𝑟𝑟𝑠𝑠𝑡𝑡 0 = 1 − 𝜃𝜃 𝒙𝒙𝑖𝑖
It finds 𝒘𝒘 that minimises sum of cross entropies per training point
26
COMP90051 Statistical Machine Learning
Mini Summary
• Logistic regression formulation
∗ A workhorse linear binary classifier
∗ Frequentist: Bernoulli label with coin bias logistic-linear in x
∗ Decision theory: Minimising cross entropy with labels
Next: Training quickly with Newton-Raphson, and how that is
repeated (weighted) linear regression under the hood!
27
COMP90051 Statistical Machine Learning
Training Logistic Regression:
the IRLS Algorithm
Analytical? Newton-Raphson!
28
COMP90051 Statistical Machine Learning
Iterative optimisation
• Training logistic regression: 𝐰𝐰 maximising log-
likelihood L(w) or cross-entropy loss
• Bad news: No closed form solution
• Good news: Problem is strictly convex, if no
irrelevant features convergence!
Look ahead: regularisation for irrelevant features
How does gradient descent work?
• 𝜇𝜇 𝑧𝑧 = 1
1+exp(−𝑧𝑧)
then 𝑑𝑑𝜇𝜇
𝑑𝑑𝑧𝑧
= 𝜇𝜇(𝑧𝑧)(1 − 𝜇𝜇 𝑧𝑧 )
• Then ∇𝐿𝐿 𝐰𝐰 = ∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑛𝑛 − 𝜇𝜇 𝐱𝐱𝑛𝑛 𝐱𝐱𝑛𝑛 = 𝐗𝐗𝐿(𝐲𝐲 − 𝝁𝝁),
stacking instances in 𝐗𝐗, labels in 𝐲𝐲, 𝜇𝜇 𝐱𝐱𝑛𝑛 in 𝝁𝝁
29
Murphy, Fig 8.3, p247
𝑤𝑤1
𝑤𝑤2
Note I’m abusing notation:
𝜇𝜇 𝐱𝐱𝑛𝑛 = 𝜇𝜇 𝑧𝑧 where 𝑧𝑧 = 𝐰𝐰𝐿𝐱𝐱𝑛𝑛
Meaning by input type
COMP90051 Statistical Machine Learning
Iteratively-Reweighted Least Squares
• Instead of GD, let’s apply Newton-Raphson IRLS algorithm
• Recall: ∇𝐿𝐿 𝐰𝐰 = 𝐗𝐗𝐿(𝐲𝐲 − 𝝁𝝁). Differentiate again for Hessian:
∇2𝐿𝐿 𝐰𝐰 = −∑𝑖𝑖
𝑑𝑑𝑑𝑑
𝑑𝑑𝑧𝑧𝑖𝑖
𝐱𝐱𝑛𝑛𝐱𝐱𝑛𝑛𝐿 = −∑𝑖𝑖 𝜇𝜇 𝐱𝐱𝑖𝑖 1 − 𝜇𝜇 𝐱𝐱𝑖𝑖 𝐱𝐱𝑛𝑛𝐱𝐱𝑛𝑛
′
= −𝐗𝐗′𝐌𝐌𝐗𝐗 , where 𝑀𝑀𝑖𝑖𝑖𝑖 = 𝜇𝜇𝑖𝑖(1 − 𝜇𝜇𝑖𝑖) otherwise 0
• Newton-Raphson then says (now with 𝐌𝐌𝑡𝑡 ,𝝁𝝁𝑡𝑡 dependence on 𝐰𝐰𝑡𝑡)
𝐰𝐰𝑡𝑡+1 = 𝐰𝐰𝑡𝑡 − ∇2𝐿𝐿 −1∇𝐿𝐿 = 𝐰𝐰𝑡𝑡 + 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1𝐗𝐗′ 𝐲𝐲 − 𝝁𝝁𝑡𝑡
= 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗𝐰𝐰𝑡𝑡 + 𝐗𝐗′ 𝐲𝐲 − 𝝁𝝁𝑡𝑡
= 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1𝐗𝐗′𝐌𝐌𝑡𝑡𝐛𝐛𝑡𝑡, where 𝐛𝐛𝑡𝑡 = 𝐗𝐗𝐰𝐰𝑡𝑡 + 𝐌𝐌𝑡𝑡
−1(𝐲𝐲 − 𝝁𝝁𝑡𝑡)
• Each IRLS iteration solves a least squares problem weighted by 𝐌𝐌𝑡𝑡,
which are reweighted iteratively!
30
Compare to
normal
equations
COMP90051 Statistical Machine Learning
IRLS intuition: Putting labels on linear scale
IRLS: 𝐰𝐰𝑡𝑡+1 = 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1𝐗𝐗′𝐌𝐌𝑡𝑡𝐛𝐛𝑡𝑡
where 𝐛𝐛𝑡𝑡 = 𝐗𝐗𝐰𝐰𝑡𝑡 + 𝐌𝐌𝑡𝑡
−1 𝐲𝐲 − 𝝁𝝁𝑡𝑡
and 𝑀𝑀𝑖𝑖𝑖𝑖 = 𝜇𝜇𝑡𝑡 𝐱𝐱𝑖𝑖 [1 − 𝜇𝜇𝑡𝑡(𝐱𝐱𝑖𝑖)] otherwise 0
and 𝜇𝜇𝑡𝑡(𝐱𝐱) = 1 + exp(−𝐰𝐰𝑡𝑡𝐿𝐱𝐱) −1
• The 𝐲𝐲 are not on linear scale. Invert logistic function?
• The 𝐛𝐛𝑡𝑡 are a “linearised” approximation to these: the
𝐛𝐛𝑡𝑡 equation matches a linear approx. to 𝜇𝜇𝑡𝑡−1 𝐲𝐲 .
• Linear regression on new labels!
31
COMP90051 Statistical Machine Learning
IRLS intuition: Equalising label variance
IRLS: 𝐰𝐰𝑡𝑡+1 = 𝐗𝐗′𝐌𝐌𝑡𝑡𝐗𝐗 −1𝐗𝐗′𝐌𝐌𝑡𝑡𝐛𝐛𝑡𝑡
where 𝐛𝐛𝑡𝑡 = 𝐗𝐗𝐰𝐰𝑡𝑡 + 𝐌𝐌𝑡𝑡
−1 𝐲𝐲 − 𝝁𝝁𝑡𝑡
and 𝑀𝑀𝑖𝑖𝑖𝑖 = 𝜇𝜇𝑡𝑡 𝐱𝐱𝑖𝑖 [1 − 𝜇𝜇𝑡𝑡(𝐱𝐱𝑖𝑖)] otherwise 0
and 𝜇𝜇𝑡𝑡(𝐱𝐱) = 1 + exp(−𝐰𝐰𝑡𝑡𝐿𝐱𝐱) −1
• In linear regression, each 𝑦𝑦𝑖𝑖 has equal variance 𝜎𝜎2
• Our 𝑦𝑦𝑖𝑖 are Bernoulli, variance: 𝜇𝜇𝑡𝑡 𝐱𝐱𝑖𝑖 1 − 𝜇𝜇𝑡𝑡 𝐱𝐱𝑖𝑖
• Our reweighting standardises, dividing by variances!!
32
Fun exercise: Show that Newton-Raphson for
linear regression gives you the normal equations!
COMP90051 Statistical Machine Learning
Mini Summary
• Training logistic regression
∗ No analytical solution
∗ Gradient descent possible, but convergence rate not ideal
∗ Newton-Raphson: iteratively reweighted least squares
Next time: Regularised linear regression for avoiding
overfitting and ill-posed optimisation
33
�Lecturer: Ben Rubinstein
This lecture
Gradient Descent
Optimisation formulations in ML
One we’ve seen: Log trick
Two solution approaches
Reminder: The gradient
Gradient descent and SGD
Convex objective functions
Newton-Raphson
Newton-Raphson: Derivation (1D)
Newton-Raphson: General case
…And much much more
Mini Summary
Logistic Regression Model
Binary classification: Example
Why not linear regression
Logistic regression model
How is logistic regression linear?
Effect of parameter vector (2D problem)
Linear vs. logistic probabilistic models
Training as Max-Likelihood Estimation
Apply log trick, simplify
Logistic Regression:�Decision-Theoretic View
Background: Cross entropy
Training as cross-entropy minimisation
Mini Summary
Training Logistic Regression:�the IRLS Algorithm
Iterative optimisation
Iteratively-Reweighted Least Squares
IRLS intuition: Putting labels on linear scale
IRLS intuition: Equalising label variance
Mini Summary