Lecture 3. Linear Regression. Optimisation.
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Linearregression
∗ Simple model (convenient maths at expense of flexibility)
∗ Often needs less data, “interpretable”, lifts to non-linear
∗ Derivable under all Statistical Schools: Lect 2 case study • Today: Frequentist + Decision theory derivations
Later in semester: Bayesian approach
• Optimisation for ML (first of 2 parts or so) ∗ Analytic solutions
∗ Gradient descent ∗ Convexity
Later: Lagrangian duality
2
COMP90051 Statistical Machine Learning
Linear Regression via Decision Theory
A warm-up example
3
COMP90051 Statistical Machine Learning
Example: Predict humidity from temperature
??
In regression, the task is to predict numeric response (aka dependent variable) from features (aka predictors or independent variables)
Assume a linear relation: 𝐻𝐻 = 𝑎𝑎 + 𝑏𝑏𝑏𝑏
(𝐻𝐻 – humidity; 𝑏𝑏 – temperature; 𝑎𝑎, 𝑏𝑏 – parameters)
4
Temperature
Humidity
TRAINING DATA
85
85
80
90
83
86
70
96
68
80
65
70
64
65
72
95
69
70
75
80
TEST DATA
75
70
COMP90051 Statistical Machine Learning
Example: Problem statement
• The model is 𝐻𝐻 = 𝑎𝑎 + 𝑏𝑏𝑏𝑏
• Fitting the model = finding “best” 𝑎𝑎, 𝑏𝑏 values for data at hand
??
• Popular criterion: minimise the sum of squared errors (aka residual sum of squares)
5
COMP90051 Statistical Machine Learning
Example: Minimise Sum Squared Errors
Tofind𝑎𝑎,𝑏𝑏thatminimise𝐿𝐿=∑10 𝐻𝐻 − 𝑎𝑎+𝑏𝑏𝑏𝑏 𝜕𝜕𝐿𝐿 10 𝑖𝑖=1𝑖𝑖 𝑖𝑖
2 =−2� 𝐻𝐻−𝑎𝑎−𝑏𝑏𝑏𝑏=0
set derivatives to zero:
𝜕𝜕𝑎𝑎
𝑖𝑖=1
𝑖𝑖𝑖𝑖
High-school optimisation: • Write derivative
• Set to zero
• Solve for model
• (Check 2nd derivatives) Will cover again later
1 ∑10 𝐻𝐻 − 𝑏𝑏 𝑏𝑏 𝜕𝜕𝐿𝐿 10 10 𝑖𝑖=1 𝑖𝑖 𝑖𝑖
if we know 𝑏𝑏, then 𝑎𝑎� =
𝜕𝜕𝑏𝑏=−2�𝑖𝑖=1𝑏𝑏𝑖𝑖 𝐻𝐻𝑖𝑖−𝑎𝑎−𝑏𝑏𝑏𝑏𝑖𝑖 =0
if we know 𝑎𝑎, then 𝑏𝑏� = 1 2 ∑10 𝑏𝑏𝑖𝑖 𝐻𝐻𝑖𝑖 − 𝑎𝑎 ∑10 𝑇𝑇 𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖
Can we be more systematic?
6
COMP90051 Statistical Machine Learning
Example: Analytic solution
• Wehavetwoequationsandtwounknowns𝑎𝑎,𝑏𝑏 • Rewriteasasystemoflinearequations
10 𝑖𝑖 = 10 𝑖𝑖 10 ∑ 𝑏𝑏 𝑎𝑎 ∑ 𝐻𝐻
𝑖𝑖=1 𝑏𝑏 𝑖𝑖=1
∑10 𝑏𝑏 ∑10 𝑏𝑏2 ∑10 𝑏𝑏𝐻𝐻 𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑖𝑖 𝑖𝑖=1 𝑖𝑖 𝑖𝑖
• Analyticsolution:𝑎𝑎=25.3,𝑏𝑏=0.77
• (Solveusingnumpy.linalg.solveorsim.)
7
COMP90051 Statistical Machine Learning
•
•
Here 𝑤𝑤1, … , 𝑤𝑤𝑚𝑚 ∈ R denote weights (model parameters) Trick: add a dummy feature 𝑥𝑥0 = 1 and use vector
More general decision rule
Adopt a linear relationship between response 𝑦𝑦 ∈ R and an instance with features 𝑥𝑥1, … , 𝑥𝑥𝑚𝑚 ∈ R
notation 𝑚𝑚
𝑦𝑦 � = �𝑖𝑖 = 0 𝑥𝑥 𝑖𝑖 𝑤𝑤 𝑖𝑖 = 𝒙𝒙 ′ 𝒘𝒘
𝑚𝑚
𝑦𝑦 � = 𝑤𝑤 0 + �𝑖𝑖 = 1 𝑥𝑥 𝑖𝑖 𝑤𝑤 𝑖𝑖
A lowercase symbol in bold face indicates a vector; 𝒙𝒙′ denotes transpose 8
COMP90051 Statistical Machine Learning
Linear Regression
via Frequentist Probabilistic Model
Max Likelihood Estimation
9
COMP90051 Statistical Machine Learning
Data is noisy!
Example: predict mark for Statistical Machine Learning (SML) from mark for Knowledge Technologies (KT)
Training data*
KT mark
KT mark
* synthetic data 🙂
10
SML mark
SML mark
COMP90051 Statistical Machine Learning
𝑦𝑦 Regression as a probabilistic model
𝑥𝑥
2
• Assume a probabilistic model: 𝑌𝑌 = 𝑿𝑿′𝒘𝒘 + 𝜀𝜀
∗ Here𝑿𝑿,𝑌𝑌and𝜀𝜀arer.v.’s ∗ Variable𝜀𝜀encodesnoise
𝜀𝜀~𝒩𝒩 0,𝜎𝜎
• Next, assume Gaussian noise (indep. of X):
• Recall that 𝒩𝒩 𝑥𝑥; 𝜇𝜇, 𝜎𝜎2 ≡ 1 exp − 𝑥𝑥−𝜇𝜇 2 2𝜋𝜋𝜎𝜎2 2𝜎𝜎2
this is a squared error!
•Therefore𝑝𝑝 2𝑦𝑦|𝒙𝒙= 1 exp−𝑦𝑦−𝒙𝒙′𝒘𝒘2 𝒘𝒘,𝜎𝜎 2𝜋𝜋𝜎𝜎2 2𝜎𝜎2
11
COMP90051 Statistical Machine Learning
Parametric probabilistic model
•
𝑦𝑦
Using simplified notation, discriminative
model is:
𝑝𝑝 𝑦𝑦|𝒙𝒙 = 1 exp − 𝑦𝑦−𝒙𝒙′𝒘𝒘 2
𝑥𝑥
2𝜎𝜎2 • Given observed data { 𝑿𝑿1,𝑌𝑌1 ,…, 𝑿𝑿𝑛𝑛,𝑌𝑌𝑛𝑛 }, we want to find
•
2𝜋𝜋𝜎𝜎2
Unknown parameters: 𝒘𝒘, 𝜎𝜎2
parameter values that “best” explain the data
• Maximum likelihood estimation: choose parameter values that maximise the probability of observed data
12
COMP90051 Statistical Machine Learning
Maximum likelihood estimation
𝑛𝑛
𝑝𝑝 𝑦𝑦1,…,𝑦𝑦𝑛𝑛|𝒙𝒙1,…,𝒙𝒙𝑛𝑛 = �𝑖𝑖=1 𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊
• Assuming independence of data points, the probability of data is
•For𝑝𝑝𝑦𝑦𝑖𝑖|𝒙𝒙𝑖𝑖=1 exp−𝑦𝑦𝑖𝑖−𝒙𝒙𝑖𝑖′𝒘𝒘2 2𝜋𝜋𝜎𝜎2 2𝜎𝜎2
• “Log trick”: Instead of maximisin1g this quantity, we can maximise its logarithm (why?)
𝑛𝑛𝑛𝑛
�𝑖𝑖=1log𝑝𝑝𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 =−2𝜎𝜎2�𝑖𝑖=1 𝑦𝑦𝑖𝑖−𝒙𝒙𝑖𝑖′𝒘𝒘2+𝐶𝐶
here 𝐶𝐶 doesn’t depend on 𝒘𝒘 (it’s a constant)
the sum of squared errors!
• Under this model, maximising log-likelihood as a function of 𝒘𝒘 is equivalent to minimising the sum of squared errors
13
COMP90051 Statistical Machine Learning
Non-linear Continuous Optimisation
Brief summary of a few basic optimisation methods vital for ML
14
COMP90051 Statistical Machine Learning
Optimisation formulations in ML
• Training = Fitti�ng = Parameter estimation
• Typicalformulation
𝜽𝜽∈argmin𝐿𝐿 𝑑𝑑𝑎𝑎𝑑𝑑𝑎𝑎,𝜽𝜽
𝜽𝜽∈Θ
∗ argmin because we want a minimiser not the minimum • Note: argmin can return a set (minimiser not always unique!)
∗ Θ denotes a model family (including constraints)
∗ L denotes some objective function to be optimised • E.g.MLE:(conditional)likelihood
• E.g. Decision theory: (regularised) empirical risk
15
COMP90051 Statistical Machine Learning
Two solution approaches
• Analytic(akaclosedform)solution
∗ Known only in limited number of cases
𝜕𝜕𝜕𝜕 = ⋯ = 𝜕𝜕𝜕𝜕 = 0 𝜕𝜕𝜃𝜃1 𝜕𝜕𝜃𝜃𝑝𝑝
∗ Use 1st-order necessary condition for optimality*:
• Approximate iterative solution
1. Initialisation:�choose starting guess 𝜽𝜽 , set 𝑖𝑖 = 1
2. Update: 𝜽𝜽(𝑖𝑖+1) ← 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝜽𝜽(𝑖𝑖) , set 𝑖𝑖 ← 𝑖𝑖 + 1
3. Termination: decide whether to Stop
4. GotoStep2
5. Stop: return 𝜽𝜽 ≈ 𝜽𝜽(𝑖𝑖)
(1)
* Note: to check for local minimum, need positive 2nd derivative (or Hessian positive definite); this assumes unconstrained – in general need to also check boundaries. See also Lagrangian techniques later in subject.
16
Assuming unconstrained, differentiable L
COMP90051 Statistical Machine Learning
Finding the deepest point
𝜃𝜃2
𝜃𝜃1
• In this example, a model has 2 parameters
• Each location corresponds to a particular combination of parameter values
• Depth indicates objective value (e.g. loss) of that candidate model on data
17
COMP90051 Statistical Machine Learning
• Suppose𝜽𝜽= 𝜃𝜃1,…,𝜃𝜃𝐾𝐾 2. For𝑖𝑖from1to𝑏𝑏 (*)
′ 1. Choose 𝜽𝜽(1) and some 𝑏𝑏
Coordinate descent
1. 𝜽𝜽(𝑖𝑖+1) ← 𝜽𝜽(𝑖𝑖)
2. For𝑗𝑗from1to𝐾𝐾
𝜃𝜃2
1. Fix components of 𝜽𝜽(𝑖𝑖+1), except 𝑗𝑗-th component
2. Find 𝜃𝜃� 𝑖𝑖+1 that minimises 𝑗𝑗 𝑖𝑖+1
𝐿𝐿𝜃𝜃𝑗𝑗
3. Up�date 𝑗𝑗-th component of 𝜽𝜽(𝑖𝑖+1)
3. Return 𝜽𝜽 ≈ 𝜽𝜽(𝑖𝑖)
*Other stopping criteria can be used
𝜃𝜃1
Wikimedia Commons. Author: Nicoguaro (CC4) 18
COMP90051 Statistical Machine Learning
Reminder: The gradient
• Gradient at 𝜽𝜽 defined as 𝜕𝜕𝜕𝜕 ,…, 𝜕𝜕𝜕𝜕 ′ evaluated at 𝜽𝜽
• Thegradientpointstothedirectionofmaximal change of 𝐿𝐿(𝜽𝜽) when departing from point 𝜽𝜽
𝜕𝜕𝜃𝜃1 𝜕𝜕𝜃𝜃𝑝𝑝
• Shorthandnotation
∗ 𝛁𝛁𝐿𝐿 ≝ 𝜕𝜕𝜕𝜕 , … , 𝜕𝜕𝜕𝜕 ′ computed at point 𝜽𝜽
𝜕𝜕𝜃𝜃1 𝜕𝜕𝜃𝜃𝑝𝑝
∗ Here 𝛁𝛁 is the “nabla” symbol
19
COMP90051 Statistical Machine Learning
Gradient descent 1. Choose 𝜽𝜽(1) and some 𝑏𝑏
Assuming 𝐿𝐿 is differentiable
2. For𝑖𝑖from1to𝑏𝑏*
3.
• •
1. 𝜽𝜽(𝑖𝑖+1) = 𝜽𝜽(𝑖𝑖) − 𝜂𝜂𝛁𝛁𝐿𝐿(𝜽𝜽(𝑖𝑖)) Return 𝜽𝜽� ≈ 𝜽𝜽(𝑖𝑖)
Note: 𝜂𝜂 is dynamically updated in each step
Variants: Stochastic GD, Mini batches, Momentum, AdaGrad, ….
𝜽𝜽(0)
*Other stopping criteria can be used
Wikimedia Commons. Authors: Olegalexandrov, Zerodamage
20
COMP90051 Statistical Machine Learning
Convex objective functions • ‘Bowl shaped’ functions
𝑤𝑤2
• Informally: if line segment between any two points on graph of function lies above or on graph
• Formally* 𝑓𝑓: 𝐷𝐷 → 𝐑𝐑 is convex if∀𝒂𝒂,𝒃𝒃∈𝐷𝐷,𝑑𝑑∈ 0,1:
𝑓𝑓𝑑𝑑𝒂𝒂+ 1−𝑑𝑑𝒃𝒃 ≤𝑑𝑑𝑓𝑓𝒂𝒂 + 1−𝑑𝑑𝑓𝑓𝒃𝒃 Strictly convex if inequality is strict (<)
𝑤𝑤1
• Gradient descent on (strictly) convex function guaranteed to find a (unique) global minimum!
* Aside: Equivalently we can look to the second derivative. For f defined on scalars, it should be non-negative; for multivariate f, the Hessian matrix should be positive semi-definite (see linear algebra supplemental deck).
21
COMP90051 Statistical Machine Learning
L1 and L2 norms
• Throughout the course we will often encounter norms that are functions R𝑛𝑛 → R of a particular form
∗ Intuitively, norms measure lengths of vectors in some sense
∗ Often component of objectives or stopping criteria in optimisation-for-ML
𝒂𝒂=𝒂𝒂≡𝑎𝑎+⋯+𝑎𝑎 2 122 𝑛𝑛2
• More specifically, we will often use the L norm (aka Euclidean distance)
𝒂𝒂≡𝑎𝑎+⋯+𝑎𝑎 11𝑛𝑛
• And also the L norm (aka absolute norm or Manhattan distance)
1
• For example, the sum of squared errors is a squared norm
𝑛𝑛𝑚𝑚
𝐿𝐿 = � 𝑖𝑖 = 1 𝑦𝑦 𝑖𝑖 − �𝑗𝑗 = 0 𝑋𝑋 𝑖𝑖 𝑗𝑗 𝑤𝑤 𝑗𝑗
2
= 𝒚𝒚 − 𝑿𝑿 𝒘𝒘
2
22
COMP90051 Statistical Machine Learning
...And much much more
• What if you have constraints?
• What about speed of convergence?
• Is there anything faster than gradient descent (yes, but can be expensive)
• Do you really need differentiable objectives? (no)
• Are there more tricks? (hell yeah!) We’ll see Lagrangian duality later on
Free at http://web.stanford.edu/~boyd/cvxbook/
23
COMP90051 Statistical Machine Learning
One we’ve seen: Log trick
• Instead of optimising 𝐿𝐿 𝜃𝜃 , try convenient log 𝐿𝐿(𝜃𝜃)
• Strictly monotonic function: a > 𝑏𝑏 ⟹ 𝑓𝑓 𝑎𝑎 > 𝑓𝑓(𝑏𝑏) ∗ Example: log function!
• Whyareweallowedtodothis?
• Lemma: Consider any objective function 𝐿𝐿 𝜃𝜃 and any strictly monotonic f. 𝜃𝜃∗ is an optimiser of 𝐿𝐿 𝜃𝜃 if and only if it is an optimiser of 𝑓𝑓(𝐿𝐿 𝜃𝜃 ).
∗ Proof: Try it at home for fun!
24
COMP90051 Statistical Machine Learning
Linear Regression Optimisation
For either MLE/decision-theoretic derivations
25
COMP90051 Statistical Machine Learning
Method of least squares
Analytic solution:
• Write derivative • Set to zero
• Solve for model
• Training data: { 𝒙𝒙1,𝑦𝑦1 ,…, 𝒙𝒙𝑛𝑛,𝑦𝑦𝑛𝑛 }. Note bold face in 𝒙𝒙𝑖𝑖
• For convenience, place instances in rows (so attributes go in columns),
representing training data as an 𝑛𝑛 × (𝑆𝑆 + 1) matrix 𝑿𝑿, and 𝑛𝑛 vector 𝒚𝒚 • Probabilistic model/decision rule assumes 𝒚𝒚 ≈ 𝑿𝑿𝒘𝒘
• To find 𝒘𝒘, minimise the sum of squared errors 𝑛𝑛𝑚𝑚
2
𝐿𝐿=�𝑖𝑖=1 𝑦𝑦𝑖𝑖−�𝑗𝑗=0𝑋𝑋𝑖𝑖𝑗𝑗𝑤𝑤𝑗𝑗
• Setting derivative to zero and solving for 𝒘𝒘 yields 𝒘𝒘� = 𝑿𝑿 ′ 𝑿𝑿 − 1 𝑿𝑿 ′ 𝒚𝒚
∗ This system of equations called the normal equations ∗ System is well defined only if the inverse exists
In this slide: UPPERCASE symbol in bold face means matrix; 𝑿𝑿′ denotes transpose; 𝑿𝑿−1 denotes matrix inverse 26
COMP90051 Statistical Machine Learning
Wherefore art thou: Bayesian derivation?
• Later in the semester: return of linear regression • FullyBayesian,withaposterior:
∗ Bayesian linear regression
• Bayesian (MAP) point estimate of weight vector: ∗ Adds a penalty term to sum of squared losses
∗ Equivalent to L2 “regularisation” to be covered soon! ∗ Called: ridge regression
27
COMP90051 Statistical Machine Learning
Summary
• Linearregression
∗ Probabilistic frequentist derivation
∗ Decision-theoretic frequentist derivation
Later in semester: Bayesian approaches • OptimisationforML
Next time:
logistic regression – linear probabilistic model for classification; basis expansion for non-linear extensions
28