CS计算机代考程序代写 scheme python Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

195of 825

Part V

Linear Classification 1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

196of 825

Strategy in this course

Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data
2 Propose a (linear) mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

197of 825

Classification

Goal : Given input data x, assign it to one of K discrete
classes Ck where k = 1, . . . ,K.
Divide the input space into different regions.
Equivalently: map each point to a categorical label.

Length of petal [in cm] vs sepal [cm] for three types of flowers
(Iris Setosa, Iris Versicolor, Iris Virginica).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

198of 825

How to represent binary class labels?

Class labels are no longer real values as in regression, but
a discrete set.
Two classes : t ∈ {0, 1}
( t = 1 represents class C1 and t = 0 represents class C2)
Can interpret the value of t as the probability of class C1,
with only two values possible for the probability, 0 or 1.
Note: Other conventions to map classes into integers
possible, check the setup.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

199of 825

How to represent multi-class labels?

If there are more than two classes ( K > 2), we call it a
multi-class setup.
Often used: 1-of-K coding scheme in which t is a vector of
length K which has all values 0 except for tj = 1, where j
comes from the membership in class Cj to encode.
Example: Given 5 classes, {C1, . . . , C5}. Membership in
class C2 will be encoded as the target vector

t = (0, 1, 0, 0, 0)T

Note: Other conventions to map multi-classes into integers
possible, check the setup.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

200of 825

Linear Model

Idea: Use again a Linear Model as in regression: y(x,w) is
a linear function of the parameters w

y(xn,w) = w>φ(xn)

But generally y(xn,w) ∈ R.
Example: Which class is y(x,w) = 0.71623 ?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

201of 825

Generalised Linear Model

Apply a mapping f : R→ Z to the linear model to get the
discrete class labels.
Generalised Linear Model

y(xn,w) = f (w>φ(xn))

Activation function: f (·)
Link function : f−1(·)

-0.5 0.0 0.5 1.0
z

-1.0

-0.5

0.5

1.0

signHzL

Figure: Example of an activation function f (z) = sign (z) .

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

202of 825

Three Models for Decision Problems

Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models

1 Solve the inference problem of determining the posterior
class probabilities p(Ck | x).

2 Use decision theory to assign each new x to one of the
classes.

Generative Models
1 Solve the inference problem of determining the

class-conditional probabilities p(x | Ck).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck) directly.
5 Use decision theory to assign each new x to one of the

classes.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

203of 825

Two Classes

Definition

A discriminant is a function that maps from an input vector x to
one of K classes, denoted by Ck.

Consider first two classes ( K = 2 ).
Construct a linear function of the inputs x

y(x) = w>x + w0

such that x being assigned to class C1 if y(x) ≥ 0, and to
class C2 otherwise.
weight vector w
bias w0 ( sometimes −w0 called threshold )

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

204of 825

Linear Functions

−5 0
x1

−5

0x
2

w>x for w = (1, 2)

−16
−12
−8
−4
0
4
8
12
16

−5 0
x1

−5

0x
2

w>x for w = (−1,−2)

−16
−12
−8
−4
0
4
8
12
16

−5 0
x1

−5

0x
2

w>x for w = (2, 4)

−32
−24
−16
−8
0
8
16
24
32

Gradient = direction of steepest ascent = ∇xw>x = w .
The set w>x + w0 = 0 is a hyper-plane.
Projecting x on that hyper-plane means finding
arg minx⊥ ‖x− x⊥‖ subject to the constraint
w>x⊥ + w0 = 0. Geometrically: move in the direction w‖w‖ .
Rate of change of function value in that direction is
d

da

(
a w‖w‖

)>
w = a‖w‖.

The length
∥∥∥a w‖w‖∥∥∥ = a‖w‖ ‖w‖ = a.

For a fixed change in w>
(

a w‖w‖
)

, a ∝ 1‖w‖ .

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

205of 825

Two Classes

Decision boundary y(x) = 0 is a (D− 1)-dimensional
hyperplane in a D-dimensional input space (decision
surface).
w is orthogonal to any vector lying in the decision surface.
Proof: Assume xA and xB are two points lying in the
decision surface. Then,

0 = y(xA)− y(xB) = w>(xA − xB)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

206of 825

Two Classes

y(x) gives a signed measure of the perpendicular distance
r from the decision surface to x, that is r = y(x)/‖w‖.

y(x) = wT

x︷ ︸︸ ︷(
x⊥ + r

w
‖w‖

)
+w0 = r

w>w
‖w‖ +

0︷ ︸︸ ︷
w>x⊥ + w0 = r‖w‖

x2

x1

w
x

y(x)
‖w‖

x⊥

−w0
‖w‖

y = 0
y < 0 y > 0

R2
R1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

207of 825

Two Classes

The normal distance from the origin to the decision
surface is therefore

−y(0)‖w‖ = −
w0
‖w‖

x2

x1

w
x

y(x)
‖w‖

x⊥

−w0
‖w‖

y = 0
y < 0 y > 0

R2
R1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

208of 825

Two Classes

More compact notation : Add an extra dimension to the
input space and set the value to x0 = 1.
Also define w̃ = (w0,w) and x̃ = (1, x)

y(x) = w̃>x̃

(if it helps, you may think of w̃> as a function).
Decision surface is now a D-dimensional hyperplane in a
D + 1-dimensional expanded input space.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

209of 825

Multi-Class

Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K − 1 one-versus-the-rest classifiers?

R1
R2

R3

?

C1

not C1
C2

not C2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

210of 825

Multi-Class

Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K(K − 1)/2 one-versus-one classifiers?

R1

R2

R3
?C1

C2

C1
C3

C2

C3

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

211of 825

Multi-Class

Number of classes K > 2
Solution: Use K linear functions

yk(x) = w>k x + wk0

Assign input x to class Ck if yk(x) > yj(x) for all j 6= k.
Decision boundary between class Ck and Cj given by
yk(x) = yj(x)

Ri

Rj

Rk

xA
xB

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

212of 825

Least Squares for Classification

Regression with a linear function of the model parameters
and minimisation of sum-of-squares error function resulted
in a closed-from solution for the parameter values.
Is this also possible for classification?
Given input data x belonging to one of K classes Ck.
Use 1-of-K binary coding scheme.
Each class is described by its own linear model

yk(x) = w>k x + wk0 k = 1, . . . ,K

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

213of 825

Least Squares for Classification

With the conventions

w̃k =
[

wk0
wk

]
∈ RD+1

x̃ =
[

1
x

]
∈ RD+1

W̃ =
[
w̃1 . . . w̃K

]
∈ R(D+1)×K

we get for the (vector valued) discriminant function

y(x) = W̃>x̃ ∈ RK .

(if it helps, you may think of W̃> as a vector-valued
function).
For a new input x, the class is then defined by the index of
the largest value in the row vector y(x)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

214of 825

Determine W̃

Given a training set {xn, tn} where n = 1, . . . ,N, and tn is
the class in the 1-of-K coding scheme.
Define a matrix T where row n corresponds to t>n .
The sum-of-squares error can now be written as
(check that tr

{
A>A

}
= ‖A‖2F)

ED(W̃) =
1
2

tr
{

(X̃W̃− T)T(X̃W̃− T)
}

The minimum of ED(W̃) will be reached for

W̃ = (X̃>X̃)−1X̃>T = X̃†T

where X̃† is the pseudo-inverse of X̃.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

215of 825

Discriminant Function for Multi-Class

The discriminant function y(x) is therefore

y(x) = W̃>x̃ = T>(X̃†)>x̃,

where X̃ is given by the training data, and x̃ is the new
input.
Interesting property: If for every tn the same linear
constraint a>tn + b = 0 holds, then the prediction y(x) will
also obey the same constraint

a>y(x) + b = 0.

For the 1-of-K coding scheme, the sum of all components
in tn is one, and therefore all components of y(x) will sum
to one. BUT: the components are not probabilities, as they
are not constraint to the interval (0, 1).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

216of 825

Deficiencies of the Least Squares Approach

Magenta curve :
Decision Boundary for the least squares approach
Green curve :
Decision boundary for the logistic regression (described later)

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

(Imagine heat-maps of the quadratic penalty function, similarly
to those of the linear functions earlier in the slides.)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

217of 825

Deficiencies of the Least Squares Approach

Left plot : Decision Boundary for least squares
Right plot : Boundaries for logistic regression (described later)

−6 −4 −2 0 2 4 6
−6

−4

−2

0

2

4

6

−6 −4 −2 0 2 4 6
−6

−4

−2

0

2

4

6

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

218of 825

Fisher’s Linear Discriminant

View linear classification as dimensionality reduction.

y(x) = w>x

If y ≥ −w0 then class C1, otherwise C2.
But there are many projections from a D-dimensional input
space onto one dimension.
Projection always means loss of information.
For classification we want to preserve the class separation
in one dimension.
Can we find a projection which maximally preserves the
class separation ?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

219of 825

Fisher’s Linear Discriminant

Samples from two classes in a two-dimensional input space
and their histogram when projected to two different
one-dimensional spaces.

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

220of 825

Fisher’s Linear Discriminant – First Try

Given N1 input data of class C1, and N2 input data of class
C2, calculate the centres of the two classes

m1 =
1

N1


n∈C1

xn, m2 =
1

N2


n∈C2

xn

Choose w so as to maximise the separation of the
projected class means

m1 − m2 = w>(m1 −m2)
Problem with non-uniform covariance

−2 2 6

−2

0

2

4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

221of 825

Fisher’s Linear Discriminant

Measure also the within-class variance for each class

s2k =

n∈Ck

(yn − mk)2

where yn = w>xn.
Maximise the Fisher criterion

J(w) =
(m2 − m1)2

s21 + s
2
2

−2 2 6

−2

0

2

4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

222of 825

Primer: Bilinear form with a Covariance Matrix

Let

µ = E[x]
Σ = cov[x]

= E[(x− µ)(x− µ)>].

Then

var
[
w>x

]
= E

[
(w>x− E

[
w>x

]
)2
]

= E
[
(w>x− w>E [x])2

]
= E

[
(w>x− w>µ)2

]
= E

[
(w>x− w>µ)(w>x− w>µ)

]
= E

[
(w>x− w>µ)(x>w− µ>w)

]
= E

[
w>(x− µ)(x− µ)>w

]
= w>E

[
(x− µ)(x− µ)>

]
w

= w>Σw.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

223of 825

Fisher’s Linear Discriminant

The Fisher criterion can be rewritten as

J(w) =
w>SBw
w>SWw

SB is the between-class covariance

SB = (m2 −m1)(m2 −m1)T

so by the previous slide, the numerator of J(w) is:
the variance of the projection of the means
SW is the within-class covariance

SW =

n∈C1

(xn −m1)(xn −m1)T +

n∈C2

(xn −m2)(xn −m2)T

so so by the previous slide and
w>(A + B)w = w>Aw + w>Bw, the denominator of J(w) is:
(the variance of the projection of the points in class C1) +
(the variance of the projection of the points in class C2)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

224of 825

Fisher’s Linear Discriminant

The Fisher criterion

J(w) =
w>SBw
w>SWw

has a maximum for Fisher’s linear discriminant

w ∝ S−1W (m2 −m1)

Fisher’s linear discriminant is NOT a discriminant, but can
be used to construct one by choosing a threshold y0 in the
projection space.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

225of 825

Fisher’s Discriminant For Multi-Class

Assume that the dimensionality of the input space D is
greater than the number of classes K.
Use D′ > 1 linear ’features’ yk = w>x and write everything
in vector form (with no bias term)

y = W>x.

The within-class covariance is then the sum of the
covariances for all K classes

SW =
K∑

k=1

Sk

where

Sk =

n∈Ck

(xn −mk)(xn −mk)>

mk =
1

Nk


n∈Ck

xn

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

226of 825

Fisher’s Discriminant For Multi-Class

Between-class covariance

SB =
K∑

k=1

Nk(mk −m)(mk −m)T .

where m is the total mean of the input data

m =
1
N

N∑
n=1

xn.

One possible way to define a function of W which is large
when the between-class covariance is large and the
within-class covariance is small is given by

J(W) = tr
{

(W>SWW)−1(W>SBW)
}

The maximum of J(W) is determined by the D′
eigenvectors of S−1W SB with the largest eigenvalues.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

227of 825

The Perceptron Algorithm

Frank Rosenblatt (1928 – 1969)
“Principles of neurodynamics: Perceptrons and the theory
of brain mechanisms” (Spartan Books, 1962)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

228of 825

The Perceptron Algorithm

Perceptron (“MARK 1”) was the first computer which could
learn new skills by trial and error

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

229of 825

The Perceptron Algorithm

Two class model
Create feature vector φ(x) by a fixed nonlinear
transformation of the input x.
Generalised linear model

y(x) = f (w>φ(x))

with φ(x) containing some bias element φ0(x) = 1.
nonlinear activation function

f (a) =

{
+1, a ≥ 0
−1, a < 0 Target coding for perceptron t = { +1, if C1 −1, if C2 Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Classification Generalised Linear Model Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 230of 825 The Perceptron Algorithm - Error Function Idea : Minimise total number of misclassified patterns. Problem : As a function of w, this is piecewise constant and therefore the gradient is zero almost everywhere. Better idea: Using the (−1,+1) target coding scheme, we want all patterns to satisfy w>φ(xn)tn > 0.
Perceptron Criterion : Add the errors for all patterns
belonging to the set of misclassified patternsM

EP(w) = −

n∈M
w>φ(xn)tn

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

231of 825

Perceptron – Stochastic Gradient Descent

Perceptron Criterion (with notation φn = φ(xn) )

EP(w) = −

n∈M
w>φntn︸ ︷︷ ︸
≡E(n)P (w)

One iteration at step τ
1 Choose a training datapoint index n

(uniformly at random or by cycling though the data)
2 Update the weight vector w by

w(τ+1) = w(τ) − η∇E(n)P (w)

where

∇E(n)P (w) =

{
−φntn if

(
w(τ)>φ(xn) · tn

)
≤ 0

0 otherwise.

As y(x,w) is invariant to the norm of w, we may set η = 1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

232of 825

The Perceptron Algorithm – Update 1

Update of the perceptron weights from a misclassified pattern
(green)

w(τ+1) = w(τ) + φntn

−1 −0.5 0 0.5 1
−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1
−1

−0.5

0

0.5

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

233of 825

The Perceptron Algorithm – Update 2

Update of the perceptron weights from a misclassified pattern
(green)

w(τ+1) = w(τ) + φntn

−1 −0.5 0 0.5 1
−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1
−1

−0.5

0

0.5

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Classification

Generalised Linear
Model

Discriminant Functions

Fisher’s Linear
Discriminant

The Perceptron
Algorithm

234of 825

The Perceptron Algorithm – Convergence

Does the algorithm converge ?
For a single update step, letting η = 1, and considering the
error from a single point,

−w(τ+1)Tφntn = −w(τ)Tφntn − (φntn)>φntn < −w(τ)Tφntn because (φntn) >φntn = ‖φntn‖ > 0. In other words,

gradient descent on a linear function decreases that
function.
BUT: contributions to the error from the other misclassified
patterns might have increased.
AND: some correctly classified patterns might now be
misclassified.
Perceptron Convergence Theorem : If the training set is
linearly separable, the perceptron algorithm is guaranteed
to find a solution in a finite number of steps.

Linear Regression 2
Review
Bayesian Regression
Sequential Update of the Posterior
Predictive Distribution
Proof of the Predictive Distribution
Predictive Distribution with Simplified Prior
Limitations of Linear Basis Function Models