CS计算机代考程序代写 python Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

68of 825

Part II

Introduction

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

69of 825

Flavour of this course

Formalise intuitions about problems
Use language of mathematics to express models
Geometry, vectors, linear algebra for reasoning
Probabilistic models to capture uncertainty
Design and analysis of algorithms
Numerical algorithms in python
Understand the choices when designing machine learning
methods

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

70of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

71of 825

Polynomial Curve Fitting

some artificial data created from the function

sin(2πx) + random noise x = 0, . . . , 1

x

t

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

72of 825

Polynomial Curve Fitting – Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

73of 825

Polynomial Curve Fitting – Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

74of 825

Polynomial Curve Fitting – Model Specification

M : order of polynomial

y(x,w) = w0 + w1 x + w2 x2 + · · ·+ wM xM

=

M∑
m=0

wm x
m

nonlinear function of x
linear function of the unknown model parameter w
How can we find good parameters w = (w1, . . . ,wM)T?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

75of 825

Learning is Improving Performance

t

x

y(xn,w)

tn

xn

Performance measure : Error between target and
prediction of the model for the training data

E(w) =
1
2

N∑
n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certain
conditions (what are they?)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

76of 825

Learning is Improving Performance

t

x

y(xn,w)

tn

xn

Performance measure : Error between target and
prediction of the model for the training data

E(w) =
1
2

N∑
n=1

(y(xn,w)− tn)2

unique minimum of E(w) for argument w? under certain
conditions (what are they?)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

77of 825

Model Comparison or Model Selection

y(x,w) =
M∑

m=0

wm x
m

∣∣∣∣∣
M=0

= w0

x

t

M = 0

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

78of 825

Model Comparison or Model Selection

y(x,w) =
M∑

m=0

wm x
m

∣∣∣∣∣
M=1

= w0 + w1 x

x

t

M = 1

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

79of 825

Model Comparison or Model Selection

y(x,w) =
M∑

m=0

wm x
m

∣∣∣∣∣
M=3

= w0 + w1 x + w2 x
2 + w3 x

3

x

t

M = 3

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

80of 825

Model Comparison or Model Selection

y(x,w) =
M∑

m=0

wm x
m

∣∣∣∣∣
M=9

= w0 + w1 x + · · ·+ w8 x8 + w9 x9

overfitting

x

t

M = 9

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

81of 825

Testing the Model

Train the model and get w?

Get 100 new data points
Root-mean-square (RMS) error

ERMS =

2E(w?)/N

M

E
R
M
S

0 3 6 9
0

0.5

1
Training
Test

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

82of 825

Testing the Model

M = 0 M = 1 M = 3 M = 9
w?0 0.19 0.82 0.31 0.35
w?1 -1.27 7.99 232.37
w?2 -25.43 -5321.83
w?3 17.37 48568.31
w?4 -231639.30
w?5 640042.26
w?6 -1061800.52
w?7 1042400.18
w?8 -557682.99
w?9 125201.43

Table: Coefficients w? for polynomials of various order.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

83of 825

More Data

N = 15

x

t

N = 15

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

84of 825

More Data

N = 100
heuristics : have no less than 5 to 10 times as many data
points than parameters
but number of parameters is not necessarily the most
appropriate measure of model complexity !
later: Bayesian approach

x

t

N = 100

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

85of 825

Regularisation

How to constrain the growing of the coefficients w ?
Add a regularisation term to the error function

Ẽ(w) =
1
2

N∑
n=1

( y(xn,w)− tn)2 +
λ

2
‖w‖2

Squared norm of the parameter vector w

‖w‖2 ≡ wTw = w20 + w21 + · · ·+ w2M

unique minimum of E(w) for argument w? under certain
conditions (what are they for λ = 0? for λ > 0?)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

86of 825

Regularisation

M = 9

x

t

ln λ = −18

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

87of 825

Regularisation

M = 9

x

t

ln λ = 0

0 1

−1

0

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

88of 825

Regularisation

M = 9
E

R
M
S

ln λ
−35 −30 −25 −20

0

0.5

1
Training
Test

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

89of 825

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.

Task: regression
Experience: x input examples, t output labels
Performance: squared error
Model choice
Regularisation
do not train on the test set!

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

90of 825

Probability Theory

p(X,Y )

X

Y = 2

Y = 1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

91of 825

Probability Theory

Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X,Y )

X

Y = 2

Y = 1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

92of 825

Sum Rule

Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X = d,Y = 1) = 8/60
p(X = d) = p(X = d,Y = 2) + p(X = d,Y = 1)

= 1/60 + 8/60

p(X = d) =

Y

p(X = d,Y)

p(X) =

Y

p(X,Y)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

93of 825

Sum Rule

Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X) =

Y

p(X,Y)

p(X)

X

p(Y) =

X

p(X,Y)

p(Y )

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

94of 825

Product Rule

Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

Conditional Probability

p(X = d | Y = 1) = 8/34

Calculate p(Y = 1):

p(Y = 1) =

X

p(X,Y = 1) = 34/60

p(X = d,Y = 1) = p(X = d | Y = 1)p(Y = 1)
p(X,Y) = p(X | Y) p(Y)

Another intuitive view is renormalisation of relative frequencies:

p(X | Y) = p(X,Y)
p(Y)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

95of 825

Sum and Product Rules

Y vs. X a b c d e f g h i sum
2 0 0 0 1 4 5 8 6 2 26
1 3 6 8 8 5 3 1 0 0 34

sum 3 6 8 9 9 8 9 6 2 60

p(X) =

Y

p(X,Y)

p(X)

X

p(X | Y) = p(X,Y)
p(Y)

X

p(X |Y = 1)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

96of 825

Sum Rule and Product Rule

Sum Rule
p(X) =


Y

p(X,Y)

Product Rule
p(X,Y) = p(X | Y) p(Y)

These rules form the basis of Bayesian machine learning, and
this course!

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

97of 825

Bayes Theorem

Use product rule

p(X,Y) = p(X | Y) p(Y) = p(Y | X) p(X)

Bayes Theorem

p(Y | X) = p(X | Y) p(Y)
p(X)

only defined for p(X) > 0

and

p(X) =

Y

p(X,Y) (sum rule)

=

Y

p(X | Y) p(Y) (product rule)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

98of 825

Probability Densities

Real valued variable x ∈ R
Probability of x to fall in the interval (x, x + δx) is given by
p(x)δx for infinitesimal small δx.

p(x ∈ (a, b)) =
∫ b

a
p(x) dx.

xδx

p(x)
P (x)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

99of 825

Constraints on p(x)

Nonnegative
p(x) ≥ 0

Normalisation ∫ ∞
−∞

p(x) dx = 1.

xδx

p(x)
P (x)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

100of 825

Cumulative distribution function P(x)

P(x) =
∫ x
−∞

p(z) dz

or
d
dx

P(x) = p(x)

xδx

p(x)
P (x)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

101of 825

Multivariate Probability Density

Vector x ≡ (x1, . . . , xD)T =


x1…

xD




Nonnegative
p(x) ≥ 0

Normalisation ∫ ∞
−∞

p(x) dx = 1.

This means ∫ ∞
−∞
· · ·
∫ ∞
−∞

p(x) dx1 . . . dxD = 1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

102of 825

Sum and Product Rule for Probability Densities

Sum Rule
p(x) =

∫ ∞
−∞

p(x, y) dy

Product Rule
p(x, y) = p(y | x) p(x)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

103of 825

Expectations

Weighted average of a function f(x) under the probability
distribution p(x)

E [f ] =

x

p(x) f (x) discrete distribution p(x)

E [f ] =

p(x) f (x) dx probability density p(x)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

104of 825

How to approximate E [f ]

Given a finite number N of points xn drawn from the
probability distribution p(x).
Approximate the expectation by a finite sum:

E [f ] ‘ 1
N

N∑
n=1

f (xn)

How to draw points from a probability distribution p(x) ?
Lecture coming about “Sampling”

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

105of 825

Expection of a function of several variables

arbitrary function f (x, y)

Ex [f (x, y)] =

x

p(x) f (x, y) discrete distribution p(x)

Ex [f (x, y)] =

p(x) f (x, y) dx probability density p(x)

Note that Ex [f (x, y)] is a function of y.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

106of 825

Conditional Expectation

arbitrary function f (x)

Ex [f | y] =

x

p(x | y) f (x) discrete distribution p(x)

Ex [f | y] =

p(x | y) f (x) dx probability density p(x)

Note that Ex [f | y] is a function of y.
Other notation used in the literature : Ex|y [f ].
What is E [E [f (x) | y]] ? Can we simplify it?
This must mean Ey [Ex [f (x) | y]]. (Why?)

Ey [Ex [f (x) | y]] =

y

p(y)Ex [f | y] =

y

p(y)

x

p(x|y) f (x)

=

x,y

f (x) p(x, y) =

x

f (x) p(x)

= Ex [f (x)]

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

107of 825

Variance

arbitrary function f (x)

var[f ] = E
[
(f (x)− E [f (x)])2

]
= E

[
f (x)2

]
− E [f (x)]2

Special case: f (x) = x

var[x] = E
[
(x− E [x])2

]
= E

[
x2
]
− E [x]2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

108of 825

Covariance

Two random variables x ∈ R and y ∈ R

cov[x, y] = Ex,y [(x− E [x])(y− E [y])]
= Ex,y [x y]− E [x]E [y]

With E [x] = a and E [y] = b

cov[x, y] = Ex,y [(x− a)(y− b)]
= Ex,y [x y]− Ex,y [x b]− Ex,y [a y] + Ex,y [a b]
= Ex,y [x y]− b Ex,y [x]︸ ︷︷ ︸

=Ex[x]

−a Ex,y [y]︸ ︷︷ ︸
=Ey[y]

+a b Ex,y [1]︸ ︷︷ ︸
=1

= Ex,y [x y]− a b− a b + a b = Ex,y [x y]− a b
= Ex,y [x y]− E [x]E [y]

Expresses how strongly x and y vary together. If x and y
are independent, their covariance vanishes.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

109of 825

Covariance for Vector Valued Variables

Two random variables x ∈ RD and y ∈ RD

cov[x, y] = Ex,y
[
(x− E [x])(yT − E

[
yT
]
)
]

= Ex,y
[
x yT

]
− E [x]E

[
yT
]

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

110of 825

The Gaussian Distribution

x ∈ R
Gaussian Distribution with mean µ and variance σ2

N (x |µ, σ2) = 1
(2πσ2)

1
2
exp{− 1

2σ2
(x− µ)2}

N (x|µ, σ2)

x

µ

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

111of 825

The Gaussian Distribution

N (x |µ, σ2) > 0∫∞
−∞N (x |µ, σ

2) dx = 1
Expectation over x

E [x] =
∫ ∞
−∞
N (x |µ, σ2) x dx = µ

Expectation over x2

E
[
x2
]
=

∫ ∞
−∞
N (x |µ, σ2) x2 dx = µ2 + σ2

Variance of x

var[x] = E
[
x2
]
− E [x]2 = σ2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and
Covariances

112of 825

Strategy in this course

Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data
2 Propose a (linear) mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results

Overview
Administration
Examples
What is common to these examples?
Definition
Notions of Learning
Python
Mathematics for Machine Learning
Human Learning