CS计算机代考程序代写 Bayesian flex case study PowerPoint Presentation

PowerPoint Presentation

Lecturer: Ben Rubinstein

Lecture 3. Linear Regression

COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

COMP90051 Statistical Machine Learning

This lecture

• Linear regression
∗ Simple model (convenient maths at expense of flexibility)
∗ Often needs less data, “interpretable”, lifts to non-linear
∗ Derivable under all Statistical Schools: Lect 2 case study

• This week: Frequentist + Decision theory derivations
Later in semester: Bayesian approach

∗ Convenient optimisation: Training by “analytic” (exact) solution

• Basis expansion: Data transform for more expressive models

2

COMP90051 Statistical Machine Learning

Linear Regression
via Decision Theory

A warm-up example

3

COMP90051 Statistical Machine Learning

Example: Predict humidity from temperature
Temperature Humidity

TRAINING DATA

85 85

80 90

83 86

70 96

68 80

65 70

64 65

72 95

69 70

75 80

TEST DATA

75 70

In regression, the task is to predict numeric
response (aka dependent variable) from features
(aka predictors or independent variables)
Assume a linear relation: 𝐻𝐻 = 𝑎𝑎 + 𝑏𝑏𝑏𝑏
(𝐻𝐻 – humidity; 𝑏𝑏 – temperature; 𝑎𝑎, 𝑏𝑏 – parameters)

4

? ?

COMP90051 Statistical Machine Learning

5

• The model is
𝐻𝐻 = 𝑎𝑎 + 𝑏𝑏𝑏𝑏

• Fitting the model =
finding “best” 𝑎𝑎, 𝑏𝑏
values for data at
hand

• Important criterion:
minimise the sum
of squared errors
(aka residual sum of
squares)

Example: Problem statement

? ?

COMP90051 Statistical Machine Learning

To find 𝑎𝑎, 𝑏𝑏 that minimise 𝐿𝐿 = ∑𝑖𝑖=1
10 𝐻𝐻𝑖𝑖 − 𝑎𝑎 + 𝑏𝑏 𝑏𝑏𝑖𝑖

2

set derivatives to zero:
𝜕𝜕𝐿𝐿
𝜕𝜕𝑎𝑎

= −2�
𝑖𝑖=1

10
𝐻𝐻𝑖𝑖 − 𝑎𝑎 − 𝑏𝑏 𝑏𝑏𝑖𝑖 = 0

if we know 𝑏𝑏, then �𝑎𝑎 = 1
10
∑𝑖𝑖=1
10 𝐻𝐻𝑖𝑖 − 𝑏𝑏 𝑏𝑏𝑖𝑖

𝜕𝜕𝐿𝐿
𝜕𝜕𝑏𝑏

= −2�
𝑖𝑖=1

10
𝑏𝑏𝑖𝑖 𝐻𝐻𝑖𝑖 − 𝑎𝑎 − 𝑏𝑏 𝑏𝑏𝑖𝑖 = 0

if we know 𝑎𝑎, then �𝑏𝑏 = 1
∑𝑖𝑖=1
10 𝑇𝑇𝑖𝑖

2 ∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖 𝐻𝐻𝑖𝑖 − 𝑎𝑎

Can we be more systematic?

6

High-school optimisation:
• Write derivative
• Set to zero
• Solve for model
• (Check 2nd derivatives)

Example: Minimise Sum Squared Errors

COMP90051 Statistical Machine Learning

• We have two equations and two unknowns 𝑎𝑎, 𝑏𝑏

• Rewrite as a system of linear equations

10 ∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖

∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖 ∑𝑖𝑖=1

10 𝑏𝑏𝑖𝑖
2

𝑎𝑎
𝑏𝑏 =

∑𝑖𝑖=1
10 𝐻𝐻𝑖𝑖

∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖𝐻𝐻𝑖𝑖

• Analytic solution: 𝑎𝑎 = 25.3, 𝑏𝑏 = 0.77

• (Solve using numpy.linalg.solve or sim.)

7

Example: Analytic solution

COMP90051 Statistical Machine Learning

• Adopt a linear relationship between response 𝑦𝑦 ∈ ℝ and
an instance with features 𝑥𝑥1, … , 𝑥𝑥𝑚𝑚 ∈ ℝ

�𝑦𝑦 = 𝑤𝑤0 + �
𝑖𝑖=1

𝑚𝑚

𝑥𝑥𝑖𝑖𝑤𝑤𝑖𝑖

Here 𝑤𝑤0, … ,𝑤𝑤𝑚𝑚 ∈ ℝ denote weights (model parameters)

• Trick: add a dummy feature 𝑥𝑥0 = 1 and use vector
notation

�𝑦𝑦 = �
𝑖𝑖=0

𝑚𝑚

𝑥𝑥𝑖𝑖𝑤𝑤𝑖𝑖 = 𝒙𝒙′𝒘𝒘

8

More general decision rule

A lowercase symbol in bold face indicates a vector; 𝒙𝒙′ denotes transpose

COMP90051 Statistical Machine Learning

Mini Summary

• Linear regression
∗ Simple, effective, “interpretable”, basis for many

approaches
∗ Decision-theoretic frequentist derivation

Next:
Frequentist derivation; Solution/training approach

9

COMP90051 Statistical Machine Learning

Linear Regression
via Frequentist Probabilistic Model

Max-Likelihood Estimation

10

COMP90051 Statistical Machine Learning

Data is noisy!

11

Example: predict mark for Statistical Machine Learning (SML)
from mark for Intro ML (IML aka KT)

IML mark

SM
L

m
ar

k

* synthetic data 🙂

Training
data*

IML mark
SM

L
m

ar
k

COMP90051 Statistical Machine Learning

• Assume a probabilistic
model: 𝑌𝑌 = 𝑿𝑿′𝒘𝒘 + 𝜀𝜀
∗ Here 𝑿𝑿,𝑌𝑌 and 𝜀𝜀 are r.v.’s
∗ Variable 𝜀𝜀 encodes noise

• Next, assume Gaussian
noise (indep. of X):
𝜀𝜀~𝒩𝒩 0,𝜎𝜎2

12

Regression as a probabilistic model

• Recall that 𝒩𝒩 𝑥𝑥;𝜇𝜇,𝜎𝜎2 ≡ 1
2𝜋𝜋𝜎𝜎2

exp − 𝑥𝑥−𝜇𝜇
2

2𝜎𝜎2

• Therefore

𝑝𝑝𝒘𝒘,𝜎𝜎2 𝑦𝑦|𝒙𝒙 =
1
2𝜋𝜋𝜎𝜎2

exp −
𝑦𝑦 − 𝒙𝒙′𝒘𝒘 2

2𝜎𝜎2

𝑥𝑥

𝑦𝑦

this is a
squared
error!

COMP90051 Statistical Machine Learning

13

Parametric probabilistic model

• Using simplified notation, discriminative
model is:

𝑝𝑝 𝑦𝑦|𝒙𝒙 =
1
2𝜋𝜋𝜎𝜎2

exp −
𝑦𝑦 − 𝒙𝒙′𝒘𝒘 2

2𝜎𝜎2

• Unknown parameters: 𝒘𝒘,𝜎𝜎2

𝑥𝑥

𝑦𝑦

• Given observed data { 𝑿𝑿1,𝑌𝑌1 , … , 𝑿𝑿𝑛𝑛,𝑌𝑌𝑛𝑛 }, we want to find
parameter values that “best” explain the data

• Maximum-likelihood estimation: choose parameter values that
maximise the probability of observed data

COMP90051 Statistical Machine Learning

Maximum likelihood estimation
• Assuming independence of data points, the probability of data is

𝑝𝑝 𝑦𝑦1, … ,𝑦𝑦𝑛𝑛|𝒙𝒙1, … ,𝒙𝒙𝑛𝑛 = �
𝑖𝑖=1

𝑛𝑛

𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊

• For 𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝑖𝑖 =
1
2𝜋𝜋𝜎𝜎2

exp − 𝑦𝑦𝑖𝑖−𝒙𝒙𝑖𝑖′𝒘𝒘
2

2𝜎𝜎2

• “Log trick”: Instead of maximising this quantity, we can maximise its
logarithm (Why? Explained soon)


𝑖𝑖=1

𝑛𝑛

log𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 = −
1

2𝜎𝜎2

𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 − 𝒙𝒙𝑖𝑖′𝒘𝒘 2 + 𝐶𝐶

here 𝐶𝐶 doesn’t depend on 𝒘𝒘 (it’s a constant)

• Under this model, maximising log-likelihood as a function of 𝒘𝒘 is
equivalent to minimising the sum of squared errors

14

the sum of
squared
errors!

COMP90051 Statistical Machine Learning

• Training data: { 𝒙𝒙1, 𝑦𝑦1 , … , 𝒙𝒙𝑛𝑛,𝑦𝑦𝑛𝑛 }. Note bold face in 𝒙𝒙𝑖𝑖
• For convenience, place instances in rows (so attributes go in columns),

representing training data as an 𝑛𝑛 × (𝑚𝑚 + 1) matrix 𝑿𝑿, and 𝑛𝑛 vector 𝒚𝒚

• Probabilistic model/decision rule assumes 𝒚𝒚 ≈ 𝑿𝑿𝒘𝒘

• To find 𝒘𝒘, minimise the sum of squared errors

𝐿𝐿 = �
𝑖𝑖=1

𝑛𝑛
𝑦𝑦𝑖𝑖 −�

𝑗𝑗=0

𝑚𝑚

𝑋𝑋𝑖𝑖𝑗𝑗𝑤𝑤𝑗𝑗

2

• Setting derivative to zero and solving for 𝒘𝒘 yields
�𝒘𝒘 = 𝑿𝑿′𝑿𝑿 −1𝑿𝑿′𝒚𝒚

∗ This system of equations called the normal equations
∗ System is well defined only if the inverse exists

15

Method of least squares Analytic solution:• Write derivative
• Set to zero
• Solve for model

In this slide: UPPERCASE symbol in bold face means matrix; 𝑿𝑿′ denotes transpose; 𝑿𝑿−1 denotes matrix inverse

COMP90051 Statistical Machine Learning

Wherefore art thou: Bayesian derivation?

• Later in the semester: return of linear regression

• Fully Bayesian, with a posterior:
∗ Bayesian linear regression

• Bayesian (MAP) point estimate of weight vector:
∗ Adds a penalty term to sum of squared losses
∗ Equivalent to L2 “regularisation” to be covered next week
∗ Called: ridge regression

16

COMP90051 Statistical Machine Learning

Mini Summary

• Linear regression
∗ Simple, effective, “interpretable”, basis for many

approaches
∗ Probabilistic frequentist derivation
∗ Solution by normal equations
Later in semester: Bayesian approaches

Next: Basis expansion for non-linear regression

17

COMP90051 Statistical Machine Learning

Basis Expansion

Extending the utility of models via
data transformation

18

COMP90051 Statistical Machine Learning

Basis expansion for linear regression
• Real data is likely to be non-linear

• What if we still wanted to use a
linear regression?
∗ Simple, easy to understand,

computationally efficient, etc.

• How to marry non-linear data to a
linear method?

19

𝑦𝑦

𝑥𝑥

If you can’t beat’em, join’em

COMP90051 Statistical Machine Learning

Transform the data
• The trick is to transform the data: Map data into another

features space, s.t. data is linear in that space

• Denote this transformation 𝜑𝜑:ℝ𝑚𝑚 → ℝ𝑘𝑘. If 𝒙𝒙 is the
original set of features, 𝜑𝜑 𝒙𝒙 denotes new feature set

• Example: suppose there is just one feature 𝑥𝑥, and the
data is scattered around a parabola rather than a straight
line

20

𝑦𝑦

𝑥𝑥

COMP90051 Statistical Machine Learning

Example: Polynomial regression
• No worries, mate: define

𝜑𝜑1 𝑥𝑥 = 𝑥𝑥
𝜑𝜑2 𝑥𝑥 = 𝑥𝑥2

• Next, apply linear regression to 𝜑𝜑1,𝜑𝜑2
𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝜑𝜑1 𝑥𝑥 + 𝑤𝑤2𝜑𝜑2 𝑥𝑥 = 𝑤𝑤0 + 𝑤𝑤1𝑥𝑥 + 𝑤𝑤2𝑥𝑥2

and here you have quadratic regression

• More generally, obtain polynomial regression if the
new set of attributes are powers of 𝑥𝑥

• Similar idea basis of autoregression for time series
21

𝑦𝑦

𝑥𝑥

COMP90051 Statistical Machine Learning

Example: linear classification
• Example binary classification problem: Dataset not linearly separable

• Define transformation as

𝜑𝜑𝑖𝑖 𝒙𝒙 = 𝒙𝒙 − 𝒛𝒛𝑖𝑖 , where 𝒛𝒛𝑖𝑖 some pre-defined constants

• Choose 𝒛𝒛1 = 0,0 ′, 𝒛𝒛2 = 0,1 ′, 𝒛𝒛3 = 1,0 ′, 𝒛𝒛4 = 1,1 ′

22

𝑥𝑥1 𝑥𝑥2 𝑦𝑦

0 0 Class A

0 1 Class B

1 0 Class B

1 1 Class A

𝑥𝑥1

𝑥𝑥2

𝜑𝜑1 𝜑𝜑2 𝜑𝜑3 𝜑𝜑4
0 1 1 2

1 0 2 1

1 2 0 1

2 1 1 0

𝑤𝑤1 𝑤𝑤2 𝑤𝑤3 𝑤𝑤4
1 0 0 1

𝝋𝝋′𝒘𝒘 𝑦𝑦

2 Class A

2 Class B

2 Class B

2 Class A

The transformed
data is linearly

separable!

there exist weights that make
new data separable, e.g.:

COMP90051 Statistical Machine Learning

Radial basis functions
• Previous example: motivated by approximation

theory where sums of RBFs approx. functions

• A radial basis function is a function of the form
𝜑𝜑 𝒙𝒙 = 𝜓𝜓 𝒙𝒙 − 𝒛𝒛 , where 𝒛𝒛 is a constant

• Examples:

• 𝜑𝜑 𝒙𝒙 = 𝒙𝒙 − 𝒛𝒛

• 𝜑𝜑 𝒙𝒙 = exp − 1
𝜎𝜎
𝒙𝒙 − 𝒛𝒛 2

23

𝜑𝜑 𝑥𝑥

𝑥𝑥

COMP90051 Statistical Machine Learning

Challenges of basis expansion
• Basis expansion can significantly increase the utility of

methods, especially, linear methods

• In the above examples, one limitation is that the
transformation needs to be defined beforehand
∗ Need to choose the size of the new feature set
∗ If using RBFs, need to choose 𝒛𝒛𝑖𝑖

• Regarding 𝒛𝒛𝑖𝑖, one can choose uniformly spaced points, or
cluster training data and use cluster centroids

• Another popular idea is to use training data 𝒛𝒛𝑖𝑖 ≡ 𝒙𝒙𝑖𝑖
∗ E.g., 𝜑𝜑𝑖𝑖 𝒙𝒙 = 𝜓𝜓 𝒙𝒙 − 𝒙𝒙𝑖𝑖
∗ However, for large datasets, this results in a large number of

features  computational hurdle

24

COMP90051 Statistical Machine Learning

Further directions
• There are several avenues for taking the idea of basis

expansion to the next level
∗ Will be covered later in this subject

• One idea is to learn the transformation 𝜑𝜑 from data
∗ E.g., Artificial Neural Networks

• Another powerful extension is the use of the kernel trick
∗ “Kernelised” methods, e.g., kernelised perceptron

• Finally, in sparse kernel machines, training depends only
on a few data points
∗ E.g., SVM

25

COMP90051 Statistical Machine Learning

Mini Summary

• Basis expansion
∗ Extending model expressiveness via data transformation
∗ Examples for linear and logistic regression
∗ Theoretical notes

Next time:
First/second-order iteration optimisation;
Logistic regression – linear probabilistic model for
classification.

26

�Lecturer: Ben Rubinstein
This lecture
Linear Regression�via Decision Theory
Example: Predict humidity from temperature
Example: Problem statement
Example: Minimise Sum Squared Errors
Example: Analytic solution
More general decision rule
Mini Summary
Linear Regression�via Frequentist Probabilistic Model
Data is noisy!
Regression as a probabilistic model
Parametric probabilistic model
Maximum likelihood estimation
Method of least squares
Wherefore art thou: Bayesian derivation?
Mini Summary
Basis Expansion
Basis expansion for linear regression
Transform the data
Example: Polynomial regression
Example: linear classification
Radial basis functions
Challenges of basis expansion
Further directions
Mini Summary