CS计算机代考程序代写 Bayesian algorithm scheme python Statistical Machine Learning

Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 825

Part V
Linear Classification 1
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
195of 825

Strategy in this course
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
196of 825
Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data
2 Propose a (linear) mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results

Classification
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
197of 825
Goal : Given input data x, assign it to one of K discrete classes Ck where k = 1,…,K.
Divide the input space into different regions. Equivalently: map each point to a categorical label.
Length of petal [in cm] vs sepal [cm] for three types of flowers (Iris Setosa, Iris Versicolor, Iris Virginica).

How to represent binary class labels?
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
198of 825
Class labels are no longer real values as in regression, but a discrete set.
Two classes : t ∈ {0, 1}
( t = 1 represents class C1 and t = 0 represents class C2)
Can interpret the value of t as the probability of class C1, with only two values possible for the probability, 0 or 1.
Note: Other conventions to map classes into integers possible, check the setup.

How to represent multi-class labels?
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
199of 825
If there are more than two classes ( K > 2), we call it a multi-class setup.
Often used: 1-of-K coding scheme in which t is a vector of length K which has all values 0 except for tj = 1, where j comes from the membership in class Cj to encode.
Example: Given 5 classes, {C1 , . . . , C5 }. Membership in class C2 will be encoded as the target vector
t = (0,1,0,0,0)T
Note: Other conventions to map multi-classes into integers possible, check the setup.

Linear Model
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
200of 825
Idea: Use again a Linear Model as in regression: y(x, w) is a linear function of the parameters w
y(xn, w) = w⊤φ(xn)
But generally y(xn, w) ∈ R.
Example: Which class is y(x, w) = 0.71623 ?

Generalised Linear Model
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
201of 825
Applyamappingf :R→Ztothelinearmodeltogetthe discrete class labels.
Generalised Linear Model
y(xn, w) = f (w⊤φ(xn)) Activation function: f (·)
Link function : f−1(·) sign􏶖z􏶚
1.0
0.5
􏲱0.5
􏲱1.0
􏲱0.5 0.0
z
Figure: Example of an activation function f (z) = sign (z) .
0.5 1.0

Three Models for Decision Problems
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
202of 825
Find a discriminant function f (x) which maps each input directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior class probabilities p(Ck | x).
2 Use decision theory to assign each new x to one of the classes.
Generative Models
1 Solve the inference problem of determining the
class-conditional probabilities p(x | Ck ).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.

Linear Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
ThePerceptron Algorithm
w xforw= ;2
w xforw= −;−2
16 16 32 12 12 24 8 8 16 448
w xforw= 2;4
x2
x2
x2
000000 −4 −4 −8
−8 −8 −16
−12 −12 −24 −5 −16 −5 −16 −5 −32
Thesetw⊤x+w =0isahyper-plane. 0
Projecting x on that hyper-plane means finding
arg minx⊥ ∥x − x⊥∥ subject to the constraint
w⊤x⊥ + w0 = 0. Geometrically: move in the direction
Rate of change of function value in that direction is
d 􏲏 w 􏲐⊤
da a∥w∥ w = a∥w∥.
􏹆w􏹆a
The length 􏹆􏹆a ∥w∥ 􏹆􏹆 = ∥w∥ ∥w∥ = a.
Forafixedchangeinw⊤􏲏a w 􏲐,a∝ 1 . ∥w∥ ∥w∥
−5 0 −5 0 Gradient = direction of steepest ascent = ∇xw⊤x = w .
−5 0
x1 x1 x1
w . ∥w∥
204of 825

Two Classes
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
205of 825
Decision boundary y(x) = 0 is a (D − 1)-dimensional hyperplane in a D-dimensional input space (decision surface).
w is orthogonal to any vector lying in the decision surface. Proof: Assume xA and xB are two points lying in the
decision surface. Then,
0=y(xA)−y(xB)=w⊤(xA −xB)

Two Classes
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
206of 825
y(x) gives a signed measure of the perpendicular distance r from the decision surface to x, that is r = y(x)/∥w∥.
x
􏲚􏲝􏲜􏲛 0
T􏲇 w􏲈 w⊤w 􏲚⊤􏲝􏲜 􏲛
y(x)=w x⊥+r∥w∥ +w0=r∥w∥+wx⊥+w0=r∥w∥
x2 y=0
y>0 y<0 R1 R2 w x y(x) ∥w∥ x⊥ −w0 ∥w∥ x1 Two Classes Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Classification Generalised Linear Model Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 207of 825 The normal distance from the origin to the decision surface is therefore −y(0) = − w0 ∥w∥ ∥w∥ x2 y=0 y>0 y<0 R1 R2 w x y(x) ∥w∥ x⊥ −w0 ∥w∥ x1 Two Classes Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Classification Generalised Linear Model Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 208of 825 More compact notation : Add an extra dimension to the input space and set the value to x0 = 1. Also define w􏲘 = (w0,w) and 􏲘x = (1,x) ⊤ (if it helps, you may think of w􏲘 ⊤ as a function). y ( x ) = w􏲘 􏲘x Decision surface is now a D-dimensional hyperplane in a D + 1-dimensional expanded input space. Multi-Class Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Classification Generalised Linear Model Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 209of 825 Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K − 1 one-versus-the-rest classifiers? ?
R1
C1
not C1
R2 R3
C2
not C2

Multi-Class
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
210of 825
Number of classes K > 2
Can we combine a number of two-class discriminant
functions using K(K − 1)/2 one-versus-one classifiers?
C3 C1
R1 C1
R3 ?
C2
R2
C3
C2

Multi-Class
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
211of 825
Number of classes K > 2 Solution: Use K linear functions
y k ( x ) = w ⊤k x + w k 0
Assign input x to class Ck if yk(x) > yj(x) for all j ̸= k.
Decision boundary between class Ck and Cj given by yk(x) = yj(x)
Ri
Rk
xA
xB
x􏹉
Rj

Least Squares for Classification
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
212of 825
Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted in a closed-from solution for the parameter values.
Is this also possible for classification?
Given input data x belonging to one of K classes Ck. Use 1-of-K binary coding scheme.
Each class is described by its own linear model
yk(x)=w⊤k x+wk0 k=1,…,K

Least Squares for Classification
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
213of 825
With the conventions
􏵬wk0􏵭 w􏲘 k = w k
􏵬1􏵭 􏲘x = x
∈ R ∈ R
D+1
􏲔􏲕 W􏲘 = w􏲘 1 . . . w􏲘 K
D+1 (D+1)×K
∈ R
we get for the (vector valued) discriminant function
⊤K y ( x ) = W􏲘 􏲘x ∈ R .
(if it helps, you may think of W􏲘 ⊤ as a vector-valued function).
For a new input x, the class is then defined by the index of the largest value in the row vector y(x)

Determine W􏲘
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
214of 825
Given a training set {xn,tn} where n = 1,…,N, and tn is the class in the 1-of-K coding scheme.
Define a matrix T where row n corresponds to t⊤n . The sum-of-squares error can now be written as
(check that tr􏵪A⊤A􏵫 = ∥A∥2F)
ED(W􏲘)= 1tr􏹇(X􏲘W􏲘 −T)T(X􏲘W􏲘 −T)􏹈
2
The minimum of ED(W􏲘 ) will be reached for W􏲘 = (X􏲘⊤X􏲘)−1X􏲘⊤T = X􏲘†T
where X􏲘† is the pseudo-inverse of X􏲘.

Discriminant Function for Multi-Class
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
215of 825
The discriminant function y(x) is therefore ⊤⊤†⊤
y ( x ) = W􏲘 􏲘x = T ( X􏲘 ) 􏲘x ,
where X􏲘 is given by the training data, and 􏲘x is the new
input.
Interesting property: If for every tn the same linear constraint a⊤tn + b = 0 holds, then the prediction y(x) will also obey the same constraint
a⊤y(x) + b = 0.
For the 1-of-K coding scheme, the sum of all components in tn is one, and therefore all components of y(x) will sum to one. BUT: the components are not probabilities, as they are not constraint to the interval (0, 1).

Deficiencies of the Least Squares Approach
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
216of 825
Magenta curve :
Decision Boundary for the least squares approach
Green curve :
Decision boundary for the logistic regression (described later)
44
22
00
−2
−4
−6
−8
−4 −2 0 2 4 6 8
−2
−4
−6
−8
−4 −2 0 2 4 6 8
(Imagine heat-maps of the quadratic penalty function, similarly to those of the linear functions earlier in the slides.)

Deficiencies of the Least Squares Approach
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
217of 825
Left plot : Decision Boundary for least squares
Right plot : Boundaries for logistic regression (described later)
66
44
22
00
−2
−4
−6
−6 −4 −2 0 2 4 6
−2
−4
−6
−6 −4 −2 0 2 4 6

Fisher’s Linear Discriminant
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
218of 825
View linear classification as dimensionality reduction.
y(x) = w⊤x
If y ≥ −w0 then class C1, otherwise C2.
But there are many projections from a D-dimensional input space onto one dimension.
Projection always means loss of information.
For classification we want to preserve the class separation in one dimension.
Can we find a projection which maximally preserves the class separation ?

Fisher’s Linear Discriminant – First Try
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
220of 825
Given N1 input data of class C1, and N2 input data of class C2, calculate the centres of the two classes
m1 = 1 􏱿xn,
m2 = 1 􏱿xn
N1
Choose w so as to maximise the separation of the
projected class means
m1 − m2 = w⊤(m1 − m2) Problem with non-uniform covariance
4
2
0
−2
n∈C1
N2
n∈C2
−2 2 6

Primer: Bilinear form with a Covariance Matrix
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
222of 825
Let
Then
μ = E[x] Σ = cov[x]
= E[(x − μ)(x − μ)⊤].
var 􏲔w⊤x􏲕 = E 􏲔(w⊤x − E 􏲔w⊤x􏲕)2􏲕 = E 􏲔(w⊤x − w⊤E [x])2􏲕
= E 􏲔(w⊤x − w⊤μ)2􏲕
= E 􏲔(w⊤x − w⊤μ)(w⊤x − w⊤μ)􏲕 = E 􏲔(w⊤x − w⊤μ)(x⊤w − μ⊤w)􏲕 = E 􏲔w⊤(x − μ)(x − μ)⊤w􏲕
= w⊤E 􏲔(x − μ)(x − μ)⊤􏲕 w
= w⊤Σw.

Fisher’s Linear Discriminant
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
223of 825
The Fisher criterion can be rewritten as
J(w) = w⊤SBw w⊤SWw
SB is the between-class covariance
SB = (m2 − m1)(m2 − m1)T
so by the previous slide, the numerator of J(w) is: the variance of the projection of the means
SW is the within-class covariance
SW = 􏱿(xn − m1)(xn − m1)T + 􏱿(xn − m2)(xn − m2)T
n∈C1 n∈C2
so so by the previous slide and
w⊤(A + B)w = w⊤Aw + w⊤Bw, the denominator of J(w) is: (the variance of the projection of the points in class C1) + (the variance of the projection of the points in class C2)

Fisher’s Discriminant For Multi-Class
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
225of 825
Assume that the dimensionality of the input space D is greater than the number of classes K.
Use D′ > 1 linear ’features’ yk = w⊤x and write everything in vector form (with no bias term)
y = W⊤x.
The within-class covariance is then the sum of the
covariances for all K classes
where
K
SW =􏱿Sk
k=1
Sk = 􏱿(xn − mk)(xn − mk)⊤ n∈Ck
mk = 1 􏱿 xn Nk n∈Ck

Fisher’s Discriminant For Multi-Class
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
226of 825
Between-class covariance
K
SB =􏱿Nk(mk −m)(mk −m)T.
k=1
where m is the total mean of the input data
1 􏱿N m=N xn.
n=1
One possible way to define a function of W which is large when the between-class covariance is large and the within-class covariance is small is given by
J(W) = tr 􏵪(W⊤SW W)−1(W⊤SBW)􏵫 The maximum of J(W) is determined by the D′
eigenvectors of S−1S with the largest eigenvalues. WB

The Perceptron Algorithm
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
229of 825
Two class model
Create feature vector φ(x) by a fixed nonlinear
transformation of the input x. Generalised linear model
y(x) = f(w⊤φ(x))
with φ(x) containing some bias element φ0(x) = 1.
nonlinear activation function
􏱽+1, a≥0
f(a)= −1, a<0 Target coding for perceptron 􏱽+1, if C1 t= −1, ifC2 The Perceptron Algorithm - Error Function Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Classification Generalised Linear Model Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm 230of 825 Idea : Minimise total number of misclassified patterns. Problem : As a function of w, this is piecewise constant and therefore the gradient is zero almost everywhere. Better idea: Using the (−1, +1) target coding scheme, we want all patterns to satisfy w⊤φ(xn)tn > 0.
Perceptron Criterion : Add the errors for all patterns belonging to the set of misclassified patterns M
EP(w) = − 􏱿 w⊤φ(xn)tn n∈M

Perceptron – Stochastic Gradient Descent
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
231of 825
Perceptron Criterion (with notation φn = φ(xn) ) EP(w) = − 􏱿 w⊤φntn
n∈M􏲜 􏲛􏲚 􏲝 ≡E(n) (w)
P
One iteration at step τ
1 Choose a training datapoint index n
(uniformly at random or by cycling though the data)
2 Update the weight vector w by
where
∇E(n)(w) = P
􏱽 􏲏(τ)⊤ 􏲐 −φntn if w φ(xn) · tn ≤ 0
w(τ+1) =w(τ) −η∇E(n)(w) P
0 otherwise.
As y(x,w) is invariant to the norm of w, we may set η = 1.

The Perceptron Algorithm – Update 1
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
232of 825
Update of the perceptron weights from a misclassified pattern (green)
w(τ+1) =w(τ) +φntn 11
0.5 0.5
00
−0.5
−1
−1 −0.5 0 0.5 1
−0.5
−1
−1 −0.5 0 0.5 1

The Perceptron Algorithm – Update 2
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
233of 825
Update of the perceptron weights from a misclassified pattern (green)
w(τ+1) =w(τ) +φntn 11
0.5 0.5
00
−0.5
−1
−1 −0.5 0 0.5 1
−0.5
−1
−1 −0.5 0 0.5 1

The Perceptron Algorithm – Convergence
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Classification
Generalised Linear Model
Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
234of 825
Does the algorithm converge ?
For a single update step, letting η = 1, and considering the error from a single point,
−w(τ+1)Tφntn = −w(τ)Tφntn − (φntn)⊤φntn < −w(τ)Tφntn because (φntn)⊤φntn = ∥φntn∥ > 0. In other words, gradient descent on a linear function decreases that function.
BUT: contributions to the error from the other misclassified patterns might have increased.
AND: some correctly classified patterns might now be misclassified.
Perceptron Convergence Theorem : If the training set is linearly separable, the perceptron algorithm is guaranteed to find a solution in a finite number of steps.

Related Posts