Microsoft PowerPoint – 01
Lecturer: Ben Rubinstein
Lecture 1. StatML Welcome
and Maths Review
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• About COMP90051
• Review: Probability theory
• Review: Linear algebra
• Review: Sequences and limits
2
COMP90051 Statistical Machine Learning
Subject objectives
• Develop an appreciation for the role of statistical ML,
advanced foundations and applications
• Gain an understanding of a representative selection
of ML techniques – how ML works
• Be able to design, implement and evaluate ML
systems
• Become a discerning ML consumer
3
COMP90051 Statistical Machine Learning
Subject content
• The subject will cover topics from
Foundations of statistical learning, linear models, non‐linear
bases, regularised linear regression, generalisation theory,
kernel methods, deep neural nets, multi‐armed bandits,
Bayesian learning, probabilistic models
• Theory in lectures; hands‐on experience with range
of toolkits in workshop pracs and projects
• vs COMP90049: much depth, much rigor, so wow
4
COMP90051 Statistical Machine Learning
Subject staff / Contact hours
5
Contacting
staff
Piazza first; then combined staff email
comp90051‐2021s2‐ .edu.au
Lecturer &
Coordinator
Ben Rubinstein
Prof, Computing & Information Systems
Associate Dean (Research), Faculty of Engineering & IT
Statistical Machine Learning, ML + Privacy/Security/Databases
Lecturer Qiuhong Ke
Lecturer, Computing & Information Systems
Computer Vision, ML, Deep Learning
Tutors: Jun Wang (Head Tutor)
Shijie Liu, Jiangrong Ouyang, Justin Tan.
See Canvas for latest list and contact details.
Zoom Contact: Weekly, please attend: 2nd Lecture (live discussion), 1 Workshop
Pre‐recorded
Lectures:
Posted to Canvas for you to view safely at home.
Strongly recommend that you keep up, weekly. (viz. quizzes)
COMP90051 Statistical Machine Learning
About me (Ben)
6
• PhD 2010 – Berkeley, USA
• 4 years in industry research
Silicon Valley: Google Research, Yahoo! Research, Intel Labs,
Microsoft Research
Australia: IBM Research
Patented & Published, Developed & Tested, Recruited
• Impact: Xbox, Bing (MS), Firefox (Mozilla), Kaggle, ABS,
Medicare and Myki data privacy
• Interests: machine learning theory; adversarial ML;
differential privacy; statistical record linkage
COMP90051 Statistical Machine Learning
AdvancedML: Expected Background
• Why a challenge: Diverse math + CS + coding
• ML: COMP90049 either 2020s1 “new” or earlier
• Alg & complexity: big‐oh, termination; basic data
structures & algorithms; solid coding ideally
experience in Python
…and more…
7
COMP90051 Statistical Machine Learning
Advanced ML: Needed Background
…and more…
• Maths: Review next videos, but ideally seen most before
“Matrix A is symmetric & positive definite, hence its eigenvalues…”
• Probability theory: probability calculus; discrete/continuous
distributions; multivariate; exponential families; Bayes rule
• Sequences: sequences, limits, supremum
• Linear algebra: vector inner products & norms; orthonormal
bases; matrix operations, inverses, eigenvectors/values
• Calculus & optimisation: partial derivatives; gradient descent;
convexity; Lagrange multipliers
8
COMP90051 Statistical Machine Learning
Textbooks
• We don’t have only one reference. We prefer to pick good bits
from several. We may also supplement with other readings as
we go.
• All are available free online or through the library digitally. See
the Canvas lecture outline for links. Therefore, no need to buy.
• Primarily we refer to (good all rounder): Bishop (2007) Pattern
Recognition and Machine Learning
• Practical Deep Nets: Chollet (2017) Deep learning with Python
• More deep learning detail: Goodfellow, Bengio, Courville (2016)
Deep learning
• For more on PGMs/Bayesian inference: Murphy (2012) Machine
Learning: A Probabilistic Perspective
• For reference on frequentist ideas, SVMs, lasso, etc.:
Hastie, Tibshirani, Friedman (2001) The Elements of Statistical
Learning: Data Mining, Inference and Prediction
9
COMP90051 Statistical Machine Learning
Assessment
• Assessment components
Two projects – one group (w4‐7), one individual (w8‐10)
• Each (25%)
• Each has ~3 weeks to complete
• Note: Proj2 timing differs from handbook due to non‐teaching week
Quizzes 15min fortnightly (10% ‐ best 5 of 6)
Final Exam (40%)
• 50% hurdles applied to
both exam and combined project
10
COMP90051 Statistical Machine Learning
Probability theory
11
COMP90051 Statistical Machine Learning
Data is noisy (almost always)
12
• Example:
given mark for Intro ML (IML)
predict mark for Stat Machine
Learning (SML)
IML mark
SM
L
m
ar
k
* synthetic data 🙂
Training
data*
COMP90051 Statistical Machine Learning
Types of models
13
𝑦 𝑓 𝑥
IntroML mark was 95,
SML mark is predicted
to be 95
𝑃 𝑦 𝑥
IntroML mark was 95,
SML mark is likely to
be in (92, 97)
𝑃 𝑥, 𝑦
probability of having
(𝐼𝑀𝐿 𝑥, 𝑆𝑀𝐿 𝑦)
𝑥
COMP90051 Statistical Machine Learning
Basics of probability theory
• A probability space:
Set of possible
outcomes
Set F of events
(subsets of outcomes)
Probability measure
P: F R
• Example: a die roll
{1, 2, 3, 4, 5, 6}
{ , {1}, …, {6}, {1,2}, …,
{5,6}, …, {1,2,3,4,5,6} }
P()=0, P({1})=1/6,
P({1,2})=1/3, …
14
COMP90051 Statistical Machine Learning
Axioms of probability*
contains all of: ; all complements , ;
the union of any countable set of events in .
2. for every event .
3. for all countable sets of
pairwise disjoint events.
4.
15
* We won’t delve further into advanced probability theory, which starts with measure
theory – a beautiful subject and the only way to “fully” formulate probability.
COMP90051 Statistical Machine Learning
Random variables (r.v.’s)
• A random variable X is a
numeric function of
outcome
• denotes the
probability of the
outcome being such that
X falls in the range A
• Example: X winnings on
$5 bet on even die roll
Xmaps 1,3,5 to ‐5
Xmaps 2,4,6 to 5
P(X=5) = P(X=‐5) = ½
16
COMP90051 Statistical Machine Learning
Discrete vs. continuous distributions
• Discrete distributions
Govern r.v. taking discrete
values
Described by probability
mass function p(x) which is
P(X=x)
𝑃 𝑋 𝑥 ∑ 𝑝 𝑎
Examples: Bernoulli,
Binomial, Multinomial,
Poisson
• Continuous distributions
Govern real‐valued r.v.
Cannot talk about PMF but
rather probability density
function p(x)
𝑃 𝑋 𝑥 𝑝 𝑎 𝑑𝑎
Examples: Uniform,
Normal, Laplace, Gamma,
Beta, Dirichlet
17
COMP90051 Statistical Machine Learning
-4 -2 0 2 4
0.
0
0.
1
0.
2
0.
3
0.
4
x
p(
x)
Expectation
• Expectation E[X] is the r.v. X’s “average” value
Discrete: 𝐸 𝑋 ∑ 𝑥 𝑃 𝑋 𝑥
Continuous: 𝐸 𝑋 𝑥 𝑝 𝑥 𝑑𝑥
• Properties
Linear: 𝐸 𝑎𝑋 𝑏 𝑎𝐸 𝑋 𝑏
𝐸 𝑋 𝑌 𝐸 𝑋 𝐸 𝑌
Monotone: 𝑋 𝑌 ⇒ 𝐸 𝑋 𝐸 𝑌
• Variance:
18
COMP90051 Statistical Machine Learning
Multivariate distributions
• Specify joint distribution over multiple variables
• Probabilities are computed as in univariate case, we now just
have repeated summations or repeated integrals
• Discrete: 𝑃 𝑋,𝑌 ∈ 𝐴 ∑ 𝑝 𝑥,𝑦, ∈
• Continuous: 𝑃 𝑋,𝑌 ∈ 𝐴 𝑝 𝑥,𝑦 𝑑𝑥𝑑𝑦
19
CCA‐3.0 wikipedia
COMP90051 Statistical Machine Learning
Independence and conditioning
• X, Y are independent if
𝑃 𝑋 ∈ 𝐴,𝑌 ∈ 𝐵
𝑃 𝑋 ∈ 𝐴 𝑃 𝑌 ∈ 𝐵
Similarly for densities:
𝑝 , 𝑥,𝑦 𝑝 𝑥 𝑝 𝑦
Intuitively: knowing value of
Y reveals nothing about X
Algebraically: the joint on X,Y
factorises!
• Conditional probability
𝑃 𝐴 𝐵 ∩
Similarly for densities
𝑝 𝑦 𝑥 ,
Intuitively: probability event
A will occur given we know
event B has occurred
X,Y independent equiv to
𝑃 𝑌 𝑦 𝑋 𝑥 𝑃 𝑌 𝑦
20
COMP90051 Statistical Machine Learning
Inverting conditioning: Bayes’ Theorem
• In terms of events A, B
𝑃 𝐴 ∩ 𝐵 𝑃 𝐴 𝐵 𝑃 𝐵 𝑃 𝐵 𝐴 𝑃 𝐴
𝑃 𝐴 𝐵 |
• Simple rule that lets us swap conditioning order
• Probabilistic and Bayesian inference make heavy use
Marginals: probabilities of individual variables
Marginalisation: summing away all but r.v.’s of interest
𝑃 𝐴 ∑ 𝑃 𝐴,𝐵 𝑏
21
Bayes
COMP90051 Statistical Machine Learning
Mini Summary
• Probability spaces, axioms of probability
• Discrete vs continuous; Univariate vs multivariate
• Expectation, Variance
• Independence and conditioning
• Bayes rule and marginalisation
Next: Linear algebra primer/review
22
COMP90051 Statistical Machine Learning
Vectors
Link between geometric and algebraic
interpretation of ML methods
23
COMP90051 Statistical Machine Learning
What are vectors?
Suppose . What does really
represent?
Ordered set of numbers
Cartesian coordinates of a point
A direction
24
𝑢
𝑢
𝑢
𝑢
0
art: OpenClipartVectors at
pixabay.com (CC0)
COMP90051 Statistical Machine Learning
Dot product: Algebraic definition
• Given two ‐dimensional vectors and , their dot
product is
E.g., weighted sum of terms is a dot product 𝒙 𝒘
• If is a scalar, are vectors then
25
COMP90051 Statistical Machine Learning
Dot product: Geometric definition
• Given two ‐dimensional Euclidean vectors and ,
their dot product is
𝒖 , 𝒗 are L2 norms for 𝒖,𝒗 also written as 𝒖 𝟐
𝜃 is the angle between the vectors
26
𝒖
𝒗
𝜃
The scalar projection of
𝒖 onto 𝒗 is given by
𝑢𝒗 𝒖 cos 𝜃
Thus dot product is
𝒖 𝒗 𝑢𝒗 𝒗 𝑣𝒖 𝒖
COMP90051 Statistical Machine Learning
Geometric properties of the dot product
• If the two vectors are orthogonal then
• If the two vectors are parallel then , if
they are anti‐parallel then
• , so defines the
Euclidean vector length
27
𝒖
𝒗
𝜃
COMP90051 Statistical Machine Learning
Hyperplanes and normal vectors
• A hyperplane defined by parameters and is a set
of points that satisfy
• In 2D, a hyperplane is a line: a line is a set of points
that satisfy
28
𝑥
𝑥 A normal vector for a
hyperplane is a vector
perpendicular to that
hyperplane
COMP90051 Statistical Machine Learning
Hyperplanes and normal vectors
• Consider a hyperplane defined by parameters and
. Note that is itself a vector
• Lemma: Vector is normal to the hyperplane
• Proof sketch:
Choose any two points 𝒖 and 𝒗 on the hyperplane. Note
that vector 𝒖 𝒗 lies parallel to the hyperplane
Consider dot product 𝒖 𝒗 𝒘 𝒖 𝒘 𝒗 𝒘
𝒖 𝒘 𝑏 𝒗 𝒘 𝑏 0
Thus 𝒖 𝒗 is parallel to the hyperplane, but is
perpendicular to 𝒘, and so 𝒘 is a vector normal
29
COMP90051 Statistical Machine Learning
Example in 2D
• Consider a line defined by , and
• Vector is a normal vector
30
𝑤
𝑤
𝑥
𝑤
𝑤
𝑥
𝑏
𝑤
𝑥
𝑥
COMP90051 Statistical Machine Learning
• Throughout the subject we will often encounter norms
that are functions of a particular form
Intuitively, norms measure lengths of vectors in some sense
Often component of objectives or stopping criteria in
optimisation‐for‐ML
• More specifically, we will often use the L2 norm (aka
Euclidean distance)
• And also the L1 norm (aka absolute norm or Manhattan
distance)
31
L1 and L2 norms
COMP90051 Statistical Machine Learning
Vector Spaces and Bases
Useful in interpreting matrices and some
algorithms like PCA
32
COMP90051 Statistical Machine Learning
Linear combinations, Independence
• For formal definition of vector spaces:
https://en.wikipedia.org/wiki/Vector_space#Definition
• A linear combination of vectors some
vector space, is a new vector for some scalars
• A set of vectors is called linearly
dependent if one element can be written as a linear
combination of the other elements
• A set that isn’t linearly dependent is linearly independent
33
COMP90051 Statistical Machine Learning
Spans, Bases
• The span of vectors is the set of all
obtainable linear combinations (ranging over all scalar
coefficients) of the vectors
• A set of vectors is called a basis for a
vector subspace if
1. The set is linearly independent; and
2. Every v ∈ 𝑉′ is a linear combination of the set.
• An orthonormal basis is a basis in which each
1. Pair of basis vectors are orthogonal (zero dot prod); and
2. Basis vector has norm equal to 1.
34
COMP90051 Statistical Machine Learning
Matrices
Some useful facts for ML
35
COMP90051 Statistical Machine Learning
Basic matrices
• See more: https://en.wikipedia.org/wiki/Matrix_(mathematics)
Including matrix‐matrix and matrix‐vector products
• A rectangular array, often denoted by upper‐case, with two indices
first for row, second for column
• Square matrix has equal dimensions (numbers of rows and
columns)
• Matrix transpose A’ or AT of m by nmatrix A is an n by mmatrix
with entries A’ij=Aji
• A square matrix A with A=A’ is called symmetric
• The (square) identity matrix I has 1 on the diagonal, 0 off‐diagonal
• Matrix inverse A‐1 of square matrix A (if it exists) satisfies A‐1A=I
36
COMP90051 Statistical Machine Learning
Matrix eigenspectrum
• Scalar, vector pair are called an eigenvalue‐
eigenvector pair of a square matrix A if
Intuition: matrix A doesn’t rotate v it just stretches it
Intuition: the eigenvalue represents stretching factor
• In general eigenvalues may be zero, negative or even
complex (imaginary) – we’ll only encounter reals
37
COMP90051 Statistical Machine Learning
Spectra of common matrices
• Eigenvalues of symmetric matrices are always real
(no imaginary component)
• A matrix with linear dependent columns has some
zero eigenvalues (called rank deficient) no matrix
inverse exists
38
COMP90051 Statistical Machine Learning
Positive (semi)definite matrices
• A symmetric square matrix A is called positive
semidefinite if for all vectors v we have .
Then A has non‐negative eigenvalues
For example, any 𝐀 𝐗 𝐗 since: 𝐯 𝐗 𝐗𝐯 𝐗𝐯 0
• Further if holds as a strict inequality then
A is called positive definite
Then A has (strictly) positive eigenvalues
39
COMP90051 Statistical Machine Learning
Mini Summary
• Vectors: Vector spaces, dot products, independence,
hyperplanes
• Matrices: Eigenvalues, positive semidefinite matrices
Next: Sequences and limits review/primer
40
COMP90051 Statistical Machine Learning
Sequences and Limits
Sequences arise whenever we have
iterations (e.g. training loops, growing
data sample size). Limits tell us about
where sequences tend towards.
41
COMP90051 Statistical Machine Learning
Infinite Sequences
• Written like or ∈ℕ
• Formally: a function from the
positive (from 1) or non‐negative
(from 0) integers
• Index set: subscript set e.g.
• Sequences allow us to reason
about test error when training
data grows indefinitely, or training error (or a
stopping criterion) when training runs arbitrarily long
42
Wikipedia public domain
COMP90051 Statistical Machine Learning
Limits and Convergence
• A sequence ∈ℕ converges if its elements become and
remain arbitrarily close to a fixed limit point .
• Formally: if, for all , there exists
such that for all we have
Notes:
• Epsilon 𝜀 represents distance of
sequence to limit point
• Distance can be arbitrarily small
• Definition says we eventually get
that close (at some finite 𝑁) and we
stay at least that close for ever more
43
Wikipedia public domain
COMP90051 Statistical Machine Learning
Supremum
Generalising the maximum: When a
sequence never quite peaks.
44
COMP90051 Statistical Machine Learning
When does the Maximum Exist?
• Can you always take a max of a set?
• Finite sets: what’s the max of {1, 7, 3, 2, 9}?
• Closed, bounded intervals: what’s the max of [0,1]?
• Open, bounded intervals: what’s the max of [0,1)?
• Open, unbounded intervals: what’s the max of [0,∞)?
45
COMP90051 Statistical Machine Learning
What about “Least Upper Bound”?
• Can you always take a least‐upper‐bound of a set? (much more often!)
• Finite sets: what’s the max of {1, 7, 3, 2, 9}?
max=9 LUB=9
• Closed, bounded intervals: what’s the max of [0,1]?
max=1 LUB=1
• Open, bounded intervals: what’s the max of [0,1)?
max=N/A LUB=1
• Open, unbounded intervals: what’s the max of [0,∞)?
max=N/A LUB=∞
46
COMP90051 Statistical Machine Learning
The Supremum
• Consider any subset 𝑆 of the reals
• Upper bound 𝑢 ∈ ℝ of set 𝑆 has: 𝑢 𝑥 for all 𝑥 ∈ 𝑆
• If 𝑢 is no bigger than any other upper bound of 𝑆
then it’s called a least upper bound or supremum
of 𝑆, written as sup 𝑆 and pronounced “soup”:
𝑧 𝑢 for all upper bounds 𝑧 ∈ ℝ of 𝑆
• When we don’t know, or can’t guarantee, that a set or
sequence has a max, it is better to use its sup
47Wikipedia public domain
FreeSVG public domain
willoweit.
Typewritten Text
上确界
COMP90051 Statistical Machine Learning
Infimum
• The greatest lower bound or infimum is
generalisation of the minimum
• Written inf(S) pronounced “inf”
• Useful if we’re minimising training error but don’t
know if the minimum is ever attained.
48
COMP90051 Statistical Machine Learning
Stochastic Convergence
When random events or quantities can
sometimes be expected to converge (e.g.
test error likely drops to a minimal value)
49
COMP90051 Statistical Machine Learning
Why Simple Limits Aren’t Enough
• Consider running your favourite learner on varying
numbers of training examples giving classifier
• If your learner minimises training error, you’d wish its
test error wasn’t much bigger than its training error
• If , you’d wish for
as this would mean eventually tiny test error
• But both training data and test data are random!
• Even if usually happens, it won’t always!!
50
COMP90051 Statistical Machine Learning
Stochastic Convergence
• A sequence 𝑋 of random variables (CDFs 𝐹 )
converges in distribution to random variable 𝑋 (CDF 𝐹) if
𝐹 𝑥 → 𝐹 𝑥 for all constants 𝑥
• A sequence 𝑋 of random variables converges in probability
to random variable 𝑋 if for all 𝜀 0: Pr 𝑋 𝑋 𝜀 → 0
• A sequence 𝑋 of random variables converges almost surely
to random variable 𝑋 if:Pr 𝑋 → 𝑋 1
• Chain of implications:
almost sure (strongest) ⟹ in probability ⟹ in distribution (weakest)
51
COMP90051 Statistical Machine Learning
But don’t worry…
• We’re not going to do any
calculations with stochastic
convergence
• Close understanding of it won’t be
necessary in this subject
• But it’s good to be aware that its
“out there” and we may refer to it
(v briefly) within StatML theory
52
CCA4.0 Vincent Le Moign
COMP90051 Statistical Machine Learning
Mini Summary
• Sequences
• Limits of sequences
• Supremum is the new maximum
• Stochastic convergence
Next time: L02 Statistical schools
Homework week #1: Watch all week 1 recordings.
Jupyter notebooks setup and launch (at home)
53