Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
195of 825
Part V
Linear Classification 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
196of 825
Strategy in this course
Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data
2 Propose a (linear) mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
197of 825
Classification
Goal : Given input data x, assign it to one of K discrete
classes Ck where k = 1, . . . ,K.
Divide the input space into different regions.
Equivalently: map each point to a categorical label.
Length of petal [in cm] vs sepal [cm] for three types of flowers
(Iris Setosa, Iris Versicolor, Iris Virginica).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
198of 825
How to represent binary class labels?
Class labels are no longer real values as in regression, but
a discrete set.
Two classes : t ∈ {0, 1}
( t = 1 represents class C1 and t = 0 represents class C2)
Can interpret the value of t as the probability of class C1,
with only two values possible for the probability, 0 or 1.
Note: Other conventions to map classes into integers
possible, check the setup.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
199of 825
How to represent multi-class labels?
If there are more than two classes ( K > 2), we call it a
multi-class setup.
Often used: 1-of-K coding scheme in which t is a vector of
length K which has all values 0 except for tj = 1, where j
comes from the membership in class Cj to encode.
Example: Given 5 classes, {C1, . . . , C5}. Membership in
class C2 will be encoded as the target vector
t = (0, 1, 0, 0, 0)T
Note: Other conventions to map multi-classes into integers
possible, check the setup.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
200of 825
Linear Model
Idea: Use again a Linear Model as in regression: y(x,w) is
a linear function of the parameters w
y(xn,w) = w>φ(xn)
But generally y(xn,w) ∈ R.
Example: Which class is y(x,w) = 0.71623 ?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
201of 825
Generalised Linear Model
Apply a mapping f : R→ Z to the linear model to get the
discrete class labels.
Generalised Linear Model
y(xn,w) = f (w>φ(xn))
Activation function: f (·)
Link function : f−1(·)
-0.5 0.0 0.5 1.0
z
-1.0
-0.5
0.5
1.0
signHzL
Figure: Example of an activation function f (z) = sign (z) .
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
202of 825
Three Models for Decision Problems
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior
class probabilities p(Ck | x).
2 Use decision theory to assign each new x to one of the
classes.
Generative Models
1 Solve the inference problem of determining the
class-conditional probabilities p(x | Ck).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck) directly.
5 Use decision theory to assign each new x to one of the
classes.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
203of 825
Two Classes
Definition
A discriminant is a function that maps from an input vector x to
one of K classes, denoted by Ck.
Consider first two classes ( K = 2 ).
Construct a linear function of the inputs x
y(x) = w>x + w0
such that x being assigned to class C1 if y(x) ≥ 0, and to
class C2 otherwise.
weight vector w
bias w0 ( sometimes −w0 called threshold )
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
204of 825
Linear Functions
−5 0
x1
−5
0x
2
w>x for w = (1, 2)
−16
−12
−8
−4
0
4
8
12
16
−5 0
x1
−5
0x
2
w>x for w = (−1,−2)
−16
−12
−8
−4
0
4
8
12
16
−5 0
x1
−5
0x
2
w>x for w = (2, 4)
−32
−24
−16
−8
0
8
16
24
32
Gradient = direction of steepest ascent = ∇xw>x = w .
The set w>x + w0 = 0 is a hyper-plane.
Projecting x on that hyper-plane means finding
arg minx⊥ ‖x− x⊥‖ subject to the constraint
w>x⊥ + w0 = 0. Geometrically: move in the direction w‖w‖ .
Rate of change of function value in that direction is
d
da
(
a w‖w‖
)>
w = a‖w‖.
The length
∥∥∥a w‖w‖∥∥∥ = a‖w‖ ‖w‖ = a.
For a fixed change in w>
(
a w‖w‖
)
, a ∝ 1‖w‖ .
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
205of 825
Two Classes
Decision boundary y(x) = 0 is a (D− 1)-dimensional
hyperplane in a D-dimensional input space (decision
surface).
w is orthogonal to any vector lying in the decision surface.
Proof: Assume xA and xB are two points lying in the
decision surface. Then,
0 = y(xA)− y(xB) = w>(xA − xB)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
206of 825
Two Classes
y(x) gives a signed measure of the perpendicular distance
r from the decision surface to x, that is r = y(x)/‖w‖.
y(x) = wT
x︷ ︸︸ ︷(
x⊥ + r
w
‖w‖
)
+w0 = r
w>w
‖w‖ +
0︷ ︸︸ ︷
w>x⊥ + w0 = r‖w‖
x2
x1
w
x
y(x)
‖w‖
x⊥
−w0
‖w‖
y = 0
y < 0
y > 0
R2
R1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
207of 825
Two Classes
The normal distance from the origin to the decision
surface is therefore
−y(0)‖w‖ = −
w0
‖w‖
x2
x1
w
x
y(x)
‖w‖
x⊥
−w0
‖w‖
y = 0
y < 0
y > 0
R2
R1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
208of 825
Two Classes
More compact notation : Add an extra dimension to the
input space and set the value to x0 = 1.
Also define w̃ = (w0,w) and x̃ = (1, x)
y(x) = w̃>x̃
(if it helps, you may think of w̃> as a function).
Decision surface is now a D-dimensional hyperplane in a
D + 1-dimensional expanded input space.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
209of 825
Multi-Class
Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K − 1 one-versus-the-rest classifiers?
R1
R2
R3
?
C1
not C1
C2
not C2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
210of 825
Multi-Class
Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K(K − 1)/2 one-versus-one classifiers?
R1
R2
R3
?C1
C2
C1
C3
C2
C3
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
211of 825
Multi-Class
Number of classes K > 2
Solution: Use K linear functions
yk(x) = w>k x + wk0
Assign input x to class Ck if yk(x) > yj(x) for all j 6= k.
Decision boundary between class Ck and Cj given by
yk(x) = yj(x)
Ri
Rj
Rk
xA
xB
x̂
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
212of 825
Least Squares for Classification
Regression with a linear function of the model parameters
and minimisation of sum-of-squares error function resulted
in a closed-from solution for the parameter values.
Is this also possible for classification?
Given input data x belonging to one of K classes Ck.
Use 1-of-K binary coding scheme.
Each class is described by its own linear model
yk(x) = w>k x + wk0 k = 1, . . . ,K
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
213of 825
Least Squares for Classification
With the conventions
w̃k =
[
wk0
wk
]
∈ RD+1
x̃ =
[
1
x
]
∈ RD+1
W̃ =
[
w̃1 . . . w̃K
]
∈ R(D+1)×K
we get for the (vector valued) discriminant function
y(x) = W̃>x̃ ∈ RK .
(if it helps, you may think of W̃> as a vector-valued
function).
For a new input x, the class is then defined by the index of
the largest value in the row vector y(x)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
214of 825
Determine W̃
Given a training set {xn, tn} where n = 1, . . . ,N, and tn is
the class in the 1-of-K coding scheme.
Define a matrix T where row n corresponds to t>n .
The sum-of-squares error can now be written as
(check that tr
{
A>A
}
= ‖A‖2F)
ED(W̃) =
1
2
tr
{
(X̃W̃− T)T(X̃W̃− T)
}
The minimum of ED(W̃) will be reached for
W̃ = (X̃>X̃)−1X̃>T = X̃†T
where X̃† is the pseudo-inverse of X̃.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
215of 825
Discriminant Function for Multi-Class
The discriminant function y(x) is therefore
y(x) = W̃>x̃ = T>(X̃†)>x̃,
where X̃ is given by the training data, and x̃ is the new
input.
Interesting property: If for every tn the same linear
constraint a>tn + b = 0 holds, then the prediction y(x) will
also obey the same constraint
a>y(x) + b = 0.
For the 1-of-K coding scheme, the sum of all components
in tn is one, and therefore all components of y(x) will sum
to one. BUT: the components are not probabilities, as they
are not constraint to the interval (0, 1).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
216of 825
Deficiencies of the Least Squares Approach
Magenta curve :
Decision Boundary for the least squares approach
Green curve :
Decision boundary for the logistic regression (described later)
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
(Imagine heat-maps of the quadratic penalty function, similarly
to those of the linear functions earlier in the slides.)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
217of 825
Deficiencies of the Least Squares Approach
Left plot : Decision Boundary for least squares
Right plot : Boundaries for logistic regression (described later)
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
218of 825
Fisher’s Linear Discriminant
View linear classification as dimensionality reduction.
y(x) = w>x
If y ≥ −w0 then class C1, otherwise C2.
But there are many projections from a D-dimensional input
space onto one dimension.
Projection always means loss of information.
For classification we want to preserve the class separation
in one dimension.
Can we find a projection which maximally preserves the
class separation ?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
219of 825
Fisher’s Linear Discriminant
Samples from two classes in a two-dimensional input space
and their histogram when projected to two different
one-dimensional spaces.
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
220of 825
Fisher’s Linear Discriminant – First Try
Given N1 input data of class C1, and N2 input data of class
C2, calculate the centres of the two classes
m1 =
1
N1
∑
n∈C1
xn, m2 =
1
N2
∑
n∈C2
xn
Choose w so as to maximise the separation of the
projected class means
m1 − m2 = w>(m1 −m2)
Problem with non-uniform covariance
−2 2 6
−2
0
2
4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
221of 825
Fisher’s Linear Discriminant
Measure also the within-class variance for each class
s2k =
∑
n∈Ck
(yn − mk)2
where yn = w>xn.
Maximise the Fisher criterion
J(w) =
(m2 − m1)2
s21 + s
2
2
−2 2 6
−2
0
2
4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
222of 825
Primer: Bilinear form with a Covariance Matrix
Let
µ = E[x]
Σ = cov[x]
= E[(x− µ)(x− µ)>].
Then
var
[
w>x
]
= E
[
(w>x− E
[
w>x
]
)2
]
= E
[
(w>x− w>E [x])2
]
= E
[
(w>x− w>µ)2
]
= E
[
(w>x− w>µ)(w>x− w>µ)
]
= E
[
(w>x− w>µ)(x>w− µ>w)
]
= E
[
w>(x− µ)(x− µ)>w
]
= w>E
[
(x− µ)(x− µ)>
]
w
= w>Σw.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
223of 825
Fisher’s Linear Discriminant
The Fisher criterion can be rewritten as
J(w) =
w>SBw
w>SWw
SB is the between-class covariance
SB = (m2 −m1)(m2 −m1)T
so by the previous slide, the numerator of J(w) is:
the variance of the projection of the means
SW is the within-class covariance
SW =
∑
n∈C1
(xn −m1)(xn −m1)T +
∑
n∈C2
(xn −m2)(xn −m2)T
so so by the previous slide and
w>(A + B)w = w>Aw + w>Bw, the denominator of J(w) is:
(the variance of the projection of the points in class C1) +
(the variance of the projection of the points in class C2)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
224of 825
Fisher’s Linear Discriminant
The Fisher criterion
J(w) =
w>SBw
w>SWw
has a maximum for Fisher’s linear discriminant
w ∝ S−1W (m2 −m1)
Fisher’s linear discriminant is NOT a discriminant, but can
be used to construct one by choosing a threshold y0 in the
projection space.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
225of 825
Fisher’s Discriminant For Multi-Class
Assume that the dimensionality of the input space D is
greater than the number of classes K.
Use D′ > 1 linear ’features’ yk = w>x and write everything
in vector form (with no bias term)
y = W>x.
The within-class covariance is then the sum of the
covariances for all K classes
SW =
K∑
k=1
Sk
where
Sk =
∑
n∈Ck
(xn −mk)(xn −mk)>
mk =
1
Nk
∑
n∈Ck
xn
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
226of 825
Fisher’s Discriminant For Multi-Class
Between-class covariance
SB =
K∑
k=1
Nk(mk −m)(mk −m)T .
where m is the total mean of the input data
m =
1
N
N∑
n=1
xn.
One possible way to define a function of W which is large
when the between-class covariance is large and the
within-class covariance is small is given by
J(W) = tr
{
(W>SWW)−1(W>SBW)
}
The maximum of J(W) is determined by the D′
eigenvectors of S−1W SB with the largest eigenvalues.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
227of 825
The Perceptron Algorithm
Frank Rosenblatt (1928 – 1969)
“Principles of neurodynamics: Perceptrons and the theory
of brain mechanisms” (Spartan Books, 1962)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
228of 825
The Perceptron Algorithm
Perceptron (“MARK 1”) was the first computer which could
learn new skills by trial and error
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
229of 825
The Perceptron Algorithm
Two class model
Create feature vector φ(x) by a fixed nonlinear
transformation of the input x.
Generalised linear model
y(x) = f (w>φ(x))
with φ(x) containing some bias element φ0(x) = 1.
nonlinear activation function
f (a) =
{
+1, a ≥ 0
−1, a < 0
Target coding for perceptron
t =
{
+1, if C1
−1, if C2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
230of 825
The Perceptron Algorithm - Error Function
Idea : Minimise total number of misclassified patterns.
Problem : As a function of w, this is piecewise constant
and therefore the gradient is zero almost everywhere.
Better idea: Using the (−1,+1) target coding scheme, we
want all patterns to satisfy w>φ(xn)tn > 0.
Perceptron Criterion : Add the errors for all patterns
belonging to the set of misclassified patternsM
EP(w) = −
∑
n∈M
w>φ(xn)tn
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
231of 825
Perceptron – Stochastic Gradient Descent
Perceptron Criterion (with notation φn = φ(xn) )
EP(w) = −
∑
n∈M
w>φntn︸ ︷︷ ︸
≡E(n)P (w)
One iteration at step τ
1 Choose a training datapoint index n
(uniformly at random or by cycling though the data)
2 Update the weight vector w by
w(τ+1) = w(τ) − η∇E(n)P (w)
where
∇E(n)P (w) =
{
−φntn if
(
w(τ)>φ(xn) · tn
)
≤ 0
0 otherwise.
As y(x,w) is invariant to the norm of w, we may set η = 1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
232of 825
The Perceptron Algorithm – Update 1
Update of the perceptron weights from a misclassified pattern
(green)
w(τ+1) = w(τ) + φntn
−1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
233of 825
The Perceptron Algorithm – Update 2
Update of the perceptron weights from a misclassified pattern
(green)
w(τ+1) = w(τ) + φntn
−1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Classification
Generalised Linear
Model
Discriminant Functions
Fisher’s Linear
Discriminant
The Perceptron
Algorithm
234of 825
The Perceptron Algorithm – Convergence
Does the algorithm converge ?
For a single update step, letting η = 1, and considering the
error from a single point,
−w(τ+1)Tφntn = −w(τ)Tφntn − (φntn)>φntn < −w(τ)Tφntn because (φntn) >φntn = ‖φntn‖ > 0. In other words,
gradient descent on a linear function decreases that
function.
BUT: contributions to the error from the other misclassified
patterns might have increased.
AND: some correctly classified patterns might now be
misclassified.
Perceptron Convergence Theorem : If the training set is
linearly separable, the perceptron algorithm is guaranteed
to find a solution in a finite number of steps.
Linear Regression 2
Review
Bayesian Regression
Sequential Update of the Posterior
Predictive Distribution
Proof of the Predictive Distribution
Predictive Distribution with Simplified Prior
Limitations of Linear Basis Function Models