CS计算机代考程序代写 python database chain Bayesian algorithm Microsoft PowerPoint - 01

Microsoft PowerPoint – 01

Lecturer: Ben Rubinstein

Lecture 1. StatML Welcome
and Maths Review

COMP90051 Statistical Machine Learning

This lecture

• About COMP90051

• Review: Probability theory

• Review: Linear algebra

• Review: Sequences and limits

COMP90051 Statistical Machine Learning

Subject objectives

• Develop an appreciation for the role of statistical ML,
advanced foundations and applications

• Gain an understanding of a representative selection
of ML techniques – how ML works

• Be able to design, implement and evaluate ML
systems

• Become a discerning ML consumer

COMP90051 Statistical Machine Learning

Subject content

• The subject will cover topics from

Foundations of statistical learning, linear models, non‐linear
bases, regularised linear regression, generalisation theory,
kernel methods, deep neural nets, multi‐armed bandits,
Bayesian learning, probabilistic models

• Theory in lectures; hands‐on experience with range
of toolkits in workshop pracs and projects

• vs COMP90049: much depth, much rigor, so wow

COMP90051 Statistical Machine Learning

Subject staff / Contact hours

Contacting
staff

Piazza first; then combined staff email
comp90051‐2021s2‐ .edu.au

Lecturer &
Coordinator

Ben Rubinstein
Prof, Computing & Information Systems
Associate Dean (Research), Faculty of Engineering & IT
Statistical Machine Learning, ML + Privacy/Security/Databases

Lecturer Qiuhong Ke
Lecturer, Computing & Information Systems
Computer Vision, ML, Deep Learning

Tutors: Jun Wang (Head Tutor)
Shijie Liu, Jiangrong Ouyang, Justin Tan.
See Canvas for latest list and contact details.

Zoom Contact: Weekly, please attend: 2nd Lecture (live discussion), 1 Workshop

Pre‐recorded
Lectures:

Posted to Canvas for you to view safely at home.
Strongly recommend that you keep up, weekly. (viz. quizzes)

COMP90051 Statistical Machine Learning

About me (Ben)

• PhD 2010 – Berkeley, USA

• 4 years in industry research
 Silicon Valley: Google Research, Yahoo! Research, Intel Labs,

Microsoft Research
 Australia: IBM Research
 Patented & Published, Developed & Tested, Recruited

• Impact: Xbox, Bing (MS), Firefox (Mozilla), Kaggle, ABS,
Medicare and Myki data privacy

• Interests: machine learning theory; adversarial ML;
differential privacy; statistical record linkage

COMP90051 Statistical Machine Learning

AdvancedML: Expected Background

• Why a challenge: Diverse math + CS + coding

• ML: COMP90049 either 2020s1 “new” or earlier

• Alg & complexity: big‐oh, termination; basic data
structures & algorithms; solid coding ideally
experience in Python

…and more…

COMP90051 Statistical Machine Learning

Advanced ML: Needed Background
…and more…

• Maths: Review next videos, but ideally seen most before
“Matrix A is symmetric & positive definite, hence its eigenvalues…”

• Probability theory: probability calculus; discrete/continuous
distributions; multivariate; exponential families; Bayes rule

• Sequences: sequences, limits, supremum

• Linear algebra: vector inner products & norms; orthonormal
bases; matrix operations, inverses, eigenvectors/values

• Calculus & optimisation: partial derivatives; gradient descent;
convexity; Lagrange multipliers

COMP90051 Statistical Machine Learning

Textbooks
• We don’t have only one reference. We prefer to pick good bits

from several. We may also supplement with other readings as
we go.

• All are available free online or through the library digitally. See
the Canvas lecture outline for links. Therefore, no need to buy.

• Primarily we refer to (good all rounder): Bishop (2007) Pattern
Recognition and Machine Learning

• Practical Deep Nets: Chollet (2017) Deep learning with Python

• More deep learning detail: Goodfellow, Bengio, Courville (2016)
Deep learning

• For more on PGMs/Bayesian inference: Murphy (2012) Machine
Learning: A Probabilistic Perspective

• For reference on frequentist ideas, SVMs, lasso, etc.:
Hastie, Tibshirani, Friedman (2001) The Elements of Statistical
Learning: Data Mining, Inference and Prediction

COMP90051 Statistical Machine Learning

Assessment

• Assessment components
 Two projects – one group (w4‐7), one individual (w8‐10)

• Each (25%)
• Each has ~3 weeks to complete
• Note: Proj2 timing differs from handbook due to non‐teaching week

 Quizzes 15min fortnightly (10% ‐ best 5 of 6)
 Final Exam (40%)

• 50% hurdles applied to
both exam and combined project

COMP90051 Statistical Machine Learning

Probability theory

COMP90051 Statistical Machine Learning

Data is noisy (almost always)

• Example:
 given mark for Intro ML (IML)
 predict mark for Stat Machine
Learning (SML)

IML mark

SM
L
m
ar
k

* synthetic data 🙂

Training
data*

COMP90051 Statistical Machine Learning

Types of models

𝑦 𝑓 𝑥

IntroML mark was 95,
SML mark is predicted

to be 95

𝑃 𝑦 𝑥

IntroML mark was 95,
SML mark is likely to

be in (92, 97)

𝑃 𝑥, 𝑦

probability of having
(𝐼𝑀𝐿 𝑥, 𝑆𝑀𝐿 𝑦)

𝑥

COMP90051 Statistical Machine Learning

Basics of probability theory

• A probability space:
 Set  of possible
outcomes

 Set F of events
(subsets of outcomes)

 Probability measure
P: F R

• Example: a die roll
 {1, 2, 3, 4, 5, 6}

 { , {1}, …, {6}, {1,2}, …,
{5,6}, …, {1,2,3,4,5,6} }

 P()=0, P({1})=1/6,
P({1,2})=1/3, …

COMP90051 Statistical Machine Learning

Axioms of probability*

contains all of: ; all complements , ;
the union of any countable set of events in .

2. for every event .

3. for all countable sets of
pairwise disjoint events.

* We won’t delve further into advanced probability theory, which starts with measure
theory – a beautiful subject and the only way to “fully” formulate probability.

COMP90051 Statistical Machine Learning

Random variables (r.v.’s)

• A random variable X is a
numeric function of
outcome

• denotes the
probability of the
outcome being such that
X falls in the range A

• Example: X winnings on
$5 bet on even die roll
 Xmaps 1,3,5 to ‐5
Xmaps 2,4,6 to 5

 P(X=5) = P(X=‐5) = ½

COMP90051 Statistical Machine Learning

Discrete vs. continuous distributions

• Discrete distributions
 Govern r.v. taking discrete
values

 Described by probability
mass function p(x) which is
P(X=x)

 𝑃 𝑋 𝑥 ∑ 𝑝 𝑎
 Examples: Bernoulli,
Binomial, Multinomial,
Poisson

• Continuous distributions
 Govern real‐valued r.v.

 Cannot talk about PMF but
rather probability density
function p(x)

 𝑃 𝑋 𝑥 𝑝 𝑎 𝑑𝑎
 Examples: Uniform,
Normal, Laplace, Gamma,
Beta, Dirichlet

COMP90051 Statistical Machine Learning

-4 -2 0 2 4

0.
0

0.
1

0.
2

0.
3

0.
4

p(
x)

Expectation

• Expectation E[X] is the r.v. X’s “average” value
 Discrete: 𝐸 𝑋 ∑ 𝑥 𝑃 𝑋 𝑥

 Continuous: 𝐸 𝑋 𝑥 𝑝 𝑥 𝑑𝑥

• Properties
 Linear: 𝐸 𝑎𝑋 𝑏 𝑎𝐸 𝑋 𝑏

𝐸 𝑋 𝑌 𝐸 𝑋 𝐸 𝑌
 Monotone: 𝑋 𝑌 ⇒ 𝐸 𝑋 𝐸 𝑌

• Variance:

COMP90051 Statistical Machine Learning

Multivariate distributions

• Specify joint distribution over multiple variables

• Probabilities are computed as in univariate case, we now just
have repeated summations or repeated integrals

• Discrete: 𝑃 𝑋,𝑌 ∈ 𝐴 ∑ 𝑝 𝑥,𝑦, ∈

• Continuous: 𝑃 𝑋,𝑌 ∈ 𝐴 𝑝 𝑥,𝑦 𝑑𝑥𝑑𝑦

CCA‐3.0 wikipedia

COMP90051 Statistical Machine Learning

Independence and conditioning

• X, Y are independent if
 𝑃 𝑋 ∈ 𝐴,𝑌 ∈ 𝐵
𝑃 𝑋 ∈ 𝐴 𝑃 𝑌 ∈ 𝐵

 Similarly for densities:
𝑝 , 𝑥,𝑦 𝑝 𝑥 𝑝 𝑦

 Intuitively: knowing value of
Y reveals nothing about X

 Algebraically: the joint on X,Y
factorises!

• Conditional probability

 𝑃 𝐴 𝐵 ∩

 Similarly for densities
𝑝 𝑦 𝑥 ,

 Intuitively: probability event
A will occur given we know
event B has occurred

 X,Y independent equiv to
𝑃 𝑌 𝑦 𝑋 𝑥 𝑃 𝑌 𝑦

COMP90051 Statistical Machine Learning

Inverting conditioning: Bayes’ Theorem

• In terms of events A, B
 𝑃 𝐴 ∩ 𝐵 𝑃 𝐴 𝐵 𝑃 𝐵 𝑃 𝐵 𝐴 𝑃 𝐴

 𝑃 𝐴 𝐵 |

• Simple rule that lets us swap conditioning order

• Probabilistic and Bayesian inference make heavy use
 Marginals: probabilities of individual variables
 Marginalisation: summing away all but r.v.’s of interest
𝑃 𝐴 ∑ 𝑃 𝐴,𝐵 𝑏

Bayes

COMP90051 Statistical Machine Learning

Mini Summary

• Probability spaces, axioms of probability

• Discrete vs continuous; Univariate vs multivariate

• Expectation, Variance

• Independence and conditioning

• Bayes rule and marginalisation

Next: Linear algebra primer/review

COMP90051 Statistical Machine Learning

Vectors

Link between geometric and algebraic
interpretation of ML methods

COMP90051 Statistical Machine Learning

What are vectors?
Suppose . What does really
represent?

Ordered set of numbers

Cartesian coordinates of a point

A direction

𝑢

𝑢
0

art: OpenClipartVectors at
pixabay.com (CC0)

COMP90051 Statistical Machine Learning

Dot product: Algebraic definition

• Given two ‐dimensional vectors and , their dot
product is
 E.g., weighted sum of terms is a dot product 𝒙 𝒘

• If is a scalar, are vectors then

COMP90051 Statistical Machine Learning

Dot product: Geometric definition
• Given two ‐dimensional Euclidean vectors and ,
their dot product is
 𝒖 , 𝒗 are L2 norms for 𝒖,𝒗 also written as 𝒖 𝟐
 𝜃 is the angle between the vectors

𝒖

𝒗

𝜃

The scalar projection of
𝒖 onto 𝒗 is given by

𝑢𝒗 𝒖 cos 𝜃

Thus dot product is
𝒖 𝒗 𝑢𝒗 𝒗 𝑣𝒖 𝒖

COMP90051 Statistical Machine Learning

Geometric properties of the dot product
• If the two vectors are orthogonal then

• If the two vectors are parallel then , if
they are anti‐parallel then

• , so defines the
Euclidean vector length

𝒖

𝒗

𝜃

COMP90051 Statistical Machine Learning

Hyperplanes and normal vectors
• A hyperplane defined by parameters and is a set
of points that satisfy

• In 2D, a hyperplane is a line: a line is a set of points
that satisfy

𝑥

𝑥 A normal vector for a
hyperplane is a vector
perpendicular to that
hyperplane

COMP90051 Statistical Machine Learning

Hyperplanes and normal vectors
• Consider a hyperplane defined by parameters and

. Note that is itself a vector

• Lemma: Vector is normal to the hyperplane

• Proof sketch:
 Choose any two points 𝒖 and 𝒗 on the hyperplane. Note
that vector 𝒖 𝒗 lies parallel to the hyperplane

 Consider dot product 𝒖 𝒗 𝒘 𝒖 𝒘 𝒗 𝒘
𝒖 𝒘 𝑏 𝒗 𝒘 𝑏 0

 Thus 𝒖 𝒗 is parallel to the hyperplane, but is
perpendicular to 𝒘, and so 𝒘 is a vector normal

COMP90051 Statistical Machine Learning

Example in 2D
• Consider a line defined by , and

• Vector is a normal vector

𝑤

𝑤
𝑥

𝑤
𝑤

𝑥
𝑏
𝑤

𝑥

COMP90051 Statistical Machine Learning

• Throughout the subject we will often encounter norms
that are functions of a particular form
 Intuitively, norms measure lengths of vectors in some sense
 Often component of objectives or stopping criteria in

optimisation‐for‐ML

• More specifically, we will often use the L2 norm (aka
Euclidean distance)

• And also the L1 norm (aka absolute norm or Manhattan
distance)

L1 and L2 norms

COMP90051 Statistical Machine Learning

Vector Spaces and Bases

Useful in interpreting matrices and some
algorithms like PCA

COMP90051 Statistical Machine Learning

Linear combinations, Independence

• For formal definition of vector spaces:
https://en.wikipedia.org/wiki/Vector_space#Definition

• A linear combination of vectors some
vector space, is a new vector for some scalars

• A set of vectors is called linearly
dependent if one element can be written as a linear
combination of the other elements

• A set that isn’t linearly dependent is linearly independent

COMP90051 Statistical Machine Learning

Spans, Bases

• The span of vectors is the set of all
obtainable linear combinations (ranging over all scalar
coefficients) of the vectors

• A set of vectors is called a basis for a
vector subspace if
1. The set is linearly independent; and
2. Every v ∈ 𝑉′ is a linear combination of the set.

• An orthonormal basis is a basis in which each
1. Pair of basis vectors are orthogonal (zero dot prod); and
2. Basis vector has norm equal to 1.

COMP90051 Statistical Machine Learning

Matrices

Some useful facts for ML

COMP90051 Statistical Machine Learning

Basic matrices

• See more: https://en.wikipedia.org/wiki/Matrix_(mathematics)
 Including matrix‐matrix and matrix‐vector products

• A rectangular array, often denoted by upper‐case, with two indices
first for row, second for column

• Square matrix has equal dimensions (numbers of rows and
columns)

• Matrix transpose A’ or AT of m by nmatrix A is an n by mmatrix
with entries A’ij=Aji

• A square matrix A with A=A’ is called symmetric

• The (square) identity matrix I has 1 on the diagonal, 0 off‐diagonal

• Matrix inverse A‐1 of square matrix A (if it exists) satisfies A‐1A=I

COMP90051 Statistical Machine Learning

Matrix eigenspectrum

• Scalar, vector pair are called an eigenvalue‐
eigenvector pair of a square matrix A if
 Intuition: matrix A doesn’t rotate v it just stretches it
 Intuition: the eigenvalue represents stretching factor

• In general eigenvalues may be zero, negative or even
complex (imaginary) – we’ll only encounter reals

COMP90051 Statistical Machine Learning

Spectra of common matrices

• Eigenvalues of symmetric matrices are always real
(no imaginary component)

• A matrix with linear dependent columns has some
zero eigenvalues (called rank deficient)  no matrix
inverse exists

COMP90051 Statistical Machine Learning

Positive (semi)definite matrices

• A symmetric square matrix A is called positive
semidefinite if for all vectors v we have .
 Then A has non‐negative eigenvalues
 For example, any 𝐀 𝐗 𝐗 since: 𝐯 𝐗 𝐗𝐯 𝐗𝐯 0

• Further if holds as a strict inequality then
A is called positive definite
 Then A has (strictly) positive eigenvalues

COMP90051 Statistical Machine Learning

Mini Summary

• Vectors: Vector spaces, dot products, independence,
hyperplanes

• Matrices: Eigenvalues, positive semidefinite matrices

Next: Sequences and limits review/primer

COMP90051 Statistical Machine Learning

Sequences and Limits

Sequences arise whenever we have
iterations (e.g. training loops, growing
data sample size). Limits tell us about

where sequences tend towards.

COMP90051 Statistical Machine Learning

Infinite Sequences

• Written like or ∈ℕ
• Formally: a function from the
positive (from 1) or non‐negative
(from 0) integers

• Index set: subscript set e.g.

• Sequences allow us to reason
about test error when training
data grows indefinitely, or training error (or a
stopping criterion) when training runs arbitrarily long

Wikipedia public domain

COMP90051 Statistical Machine Learning

Limits and Convergence

• A sequence ∈ℕ converges if its elements become and
remain arbitrarily close to a fixed limit point .

• Formally: if, for all , there exists
such that for all we have

Notes:
• Epsilon 𝜀 represents distance of

sequence to limit point
• Distance can be arbitrarily small
• Definition says we eventually get

that close (at some finite 𝑁) and we
stay at least that close for ever more

Wikipedia public domain

COMP90051 Statistical Machine Learning

Supremum

Generalising the maximum: When a
sequence never quite peaks.

COMP90051 Statistical Machine Learning

When does the Maximum Exist?

• Can you always take a max of a set?

• Finite sets: what’s the max of {1, 7, 3, 2, 9}?

• Closed, bounded intervals: what’s the max of [0,1]?

• Open, bounded intervals: what’s the max of [0,1)?

• Open, unbounded intervals: what’s the max of [0,∞)?

COMP90051 Statistical Machine Learning

What about “Least Upper Bound”?

• Can you always take a least‐upper‐bound of a set? (much more often!)

• Finite sets: what’s the max of {1, 7, 3, 2, 9}?

max=9 LUB=9

• Closed, bounded intervals: what’s the max of [0,1]?

max=1 LUB=1

• Open, bounded intervals: what’s the max of [0,1)?

max=N/A LUB=1

• Open, unbounded intervals: what’s the max of [0,∞)?

max=N/A LUB=∞

COMP90051 Statistical Machine Learning

The Supremum

• Consider any subset 𝑆 of the reals

• Upper bound 𝑢 ∈ ℝ of set 𝑆 has: 𝑢 𝑥 for all 𝑥 ∈ 𝑆

• If 𝑢 is no bigger than any other upper bound of 𝑆
then it’s called a least upper bound or supremum
of 𝑆, written as sup 𝑆 and pronounced “soup”:
 𝑧 𝑢 for all upper bounds 𝑧 ∈ ℝ of 𝑆

• When we don’t know, or can’t guarantee, that a set or
sequence has a max, it is better to use its sup

47Wikipedia public domain

FreeSVG public domain

willoweit.
Typewritten Text
上确界

COMP90051 Statistical Machine Learning

Infimum

• The greatest lower bound or infimum is
generalisation of the minimum

• Written inf(S) pronounced “inf”

• Useful if we’re minimising training error but don’t
know if the minimum is ever attained.

COMP90051 Statistical Machine Learning

Stochastic Convergence

When random events or quantities can
sometimes be expected to converge (e.g.
test error likely drops to a minimal value)

COMP90051 Statistical Machine Learning

Why Simple Limits Aren’t Enough

• Consider running your favourite learner on varying
numbers of training examples giving classifier

• If your learner minimises training error, you’d wish its
test error wasn’t much bigger than its training error

• If , you’d wish for
as this would mean eventually tiny test error

• But both training data and test data are random!

• Even if usually happens, it won’t always!!

COMP90051 Statistical Machine Learning

Stochastic Convergence

• A sequence 𝑋 of random variables (CDFs 𝐹 )
converges in distribution to random variable 𝑋 (CDF 𝐹) if
𝐹 𝑥 → 𝐹 𝑥 for all constants 𝑥

• A sequence 𝑋 of random variables converges in probability
to random variable 𝑋 if for all 𝜀 0: Pr 𝑋 𝑋 𝜀 → 0

• A sequence 𝑋 of random variables converges almost surely
to random variable 𝑋 if:Pr 𝑋 → 𝑋 1

• Chain of implications:

almost sure (strongest) ⟹ in probability ⟹ in distribution (weakest)

COMP90051 Statistical Machine Learning

But don’t worry…

• We’re not going to do any
calculations with stochastic
convergence

• Close understanding of it won’t be
necessary in this subject

• But it’s good to be aware that its
“out there” and we may refer to it
(v briefly) within StatML theory

CCA4.0 Vincent Le Moign

COMP90051 Statistical Machine Learning

Mini Summary

• Sequences

• Limits of sequences

• Supremum is the new maximum

• Stochastic convergence

Next time: L02 Statistical schools

Homework week #1: Watch all week 1 recordings.
Jupyter notebooks setup and launch (at home)

Related Posts