Machine Learning: Mathematical Background Calculus
Dariush Hosseini
Department of Computer Science University College London
Copyright By PowCoder代写 加微信 powcoder
Lecture Overview
Lecture Overview
1 Lecture Overview
2 Derivatives &
3 Vector & Matrix Derivatives
4 Matrix Calculus
6 Convexity
7 Quadratic Functions
8 Constrained Optimisation &
9 Integral Calculus
10 Summary
Lecture Overview
Maths & Machine Learning 1
Much of machine learning is concerned with:
Solving systems of linear equations −→ Linear Algebra
Minimising cost functions (a scalar function of several variables that typically measures how poorly our model fits the data).
To this end we are often interested in studying the continuous change of such functions −→ (Differential) Calculus
Characterising uncertainty in our learning environments stochastically −→ Probability
Drawing conclusions based on the analysis of data −→ Statistics 1Much of this lecture is drawn from ‘Mathematics for Machine Learning’ by
Lecture Overview
Maths & Machine Learning
Much of machine learning is concerned with:
Solving systems of linear equations −→ Linear Algebra
Minimising cost functions (a scalar function of several variables that typically measures how poorly our model fits the data).
To this end we are often interested in studying the continuous change of such functions −→ (Differential) Calculus
Characterising uncertainty in our learning environments stochastically −→ Probability
Drawing conclusions based on the analysis of data −→ Statistics
Lecture Overview
Learning Outcomes for Today’s Lecture
By the end of this lecture you should be familiar with some
fundamental objects in and results of Calculus
For the most part we will concentrate on the statement of results
which will be of use in the main body of this module
However we will not be so concerned with the proof of these results
Derivatives &
Lecture Overview
1 Lecture Overview
2 Derivatives &
3 Vector & Matrix Derivatives
4 Matrix Calculus
6 Convexity
7 Quadratic Functions
8 Constrained Optimisation &
9 Integral Calculus
10 Summary
Derivatives &
Derivatives
For a function, f : R → R, the derivative is defined as:
df =limf(x+δ)−f(x)=f′(x) dxδ→0 δ
The second derivative is defined to be the derivative of the
derivative:
d2f = lim f ′(x + δ) − f ′(x) = f ′′(x) dx2 δ→0 δ
Derivatives &
For small changes, δ, about a point x = x, any smooth function, f , can be written as:
∞i δid
i! dx f(x)x=x df δ2 d2f
f(x + δ) = f(x) +
= f ( x ) + δ +
dx x=x 2 dx2 x=x
Derivatives &
Rules for Combining Functions
∀ functions f,g ∀α,β ∈ R:
(αf(x) + βg(x))′ = αf ′(x) + βg′(x)
Product Rule
∀ functions f , g:
(f(x)g(x))′ = f ′(x)g(x) + f(x)g′(x)
Chain Rule
if f(x) = h(g(x)) then:
f ′(x) = h′(g(x)).g′(x)
Derivatives &
Common Derivatives
f(x) f′(x)
nxn−1 ekx kekx
lnx Where n, k are constants.
Derivatives &
Partial Derivatives
For a function that depends on n variables, {xi}ni=1,
f : (x1, x2, …, xn) → f (x1, x2, …, xn), then the partial derivative wrt xi is defined as:
∂f = lim f(x1,x2,…,xi +δ,…,xn)−f(x1,x2,…,xi,…,xn) ∂xi δ→0 δ
So the partial derivative wrt xi keeps the state of the other variables fixed
Vector & Matrix Derivatives
Lecture Overview
1 Lecture Overview
2 Derivatives &
3 Vector & Matrix Derivatives
4 Matrix Calculus
6 Convexity
7 Quadratic Functions
8 Constrained Optimisation &
9 Integral Calculus
10 Summary
Vector & Matrix Derivatives
The gradient of f : Rn → R, denoted by ∇xf is given by the vector of partial derivatives:
∂x1 ∇xf= . ∂ f
Vector & Matrix Derivatives
Directional Derivative
The directional derivative in direction u, (where ∥u∥2 = 1), is the
slope of f (x) in the direction of u:
∇ x f · u
By the definition of angle, this can be re-written as: ∇xf · u = ∥∇xf ∥2∥u∥2 cos θ
= ∥∇xf∥2 cosθ
Where θ is the angle between the gradient vector and u
Thus the directional derivative is maximal when ∇xf and u are aligned. In other words ∇xf points in the direction of steepest ascent on a surface.
Vector & Matrix Derivatives
Total Derivative
For a function that depends on n variables, {xi}ni=1,
f : (x1,x2,…,xn) → f(x1,x2,…,xn), where xi = xi(t) ∀t, then the total derivative wrt t is:
df = n ∂f dxi dt i=1 ∂xi dt
dx1 dxn T
= dt ,…, dt ∇xf
This follows from the chain rule.
Vector & Matrix Derivatives
The Jacobian matrix of a vector-valued function, f : Rn → Rm, defined such that f(x) = [f1(x), . . . , fm(x)]T , denoted by ∇xf or
∂(f1,..,fm) , is the matrix of all its first order partial derivatives: ∂(x1 ,..,xn )
∂f1 … ∂f1 ∂(f ,..,f ) ∂x1 ∂xn
∇xf=1 m=….. ∂(x1,..,xn) ∂fm … ∂fm
Vector & Matrix Derivatives
The Hessian matrix of f : Rn → R, denoted by ∇2xf or H(x) is a matrix of second order partial derivatives:
∂2f … ∂2f ∂x12 ∂x1∂xn
2 . .. . ∇xf=. . .
∂2f ∂2f …
∂xn ∂x1 ∂xn 2
Vector & Matrix Derivatives
Scalar-by-Matrix Derivative
The derivative of a scalar, y, with respect to the elements of a
matrix, A ∈ Rn×m, is defined to be:
∂y ··· ∂y
∂A11 ∂A1m
∂A=. . .
∂y ··· ∂y ∂An1 ∂Anm
Matrix Calculus
Lecture Overview
1 Lecture Overview
2 Derivatives &
3 Vector & Matrix Derivatives
4 Matrix Calculus
6 Convexity
7 Quadratic Functions
8 Constrained Optimisation &
9 Integral Calculus
10 Summary
Matrix Calculus
Matrix Calculus
Leta,x∈Rn andA∈Rn×n:
∇x(aT x) = a ∇x(xT Ax) = (A + AT )x
In particular, if A is symmetric:
∇x(xT Ax) = 2Ax
Matrix Calculus
Matrix Calculus
Let a,c ∈ Rn, b ∈ Rm, A ∈ Rn×n, and B ∈ Rn×m:
∂aT Bb = abT ∂B
∂aTA−1c =−AT−1acT AT−1 ∂A
∂ log |A| = AT −1 ∂A
Matrix Calculus
Matrix Calculus Let B,X ∈ Rn×n:
∂ tr(BXXT)=BX+BTX ∂X
∂ tr(BXTX) = XBT +XB ∂X
Matrix Calculus
Multivariate
For small changes, δ, about a point x = x, for a scalar function of a vector argument which is at least twice differentiable:
f ( x + δ ) ≈ f ( x ) + δ · ∇ x f ( x ) + 12 δ T H ( x ) δ
Lecture Overview
1 Lecture Overview
2 Derivatives &
3 Vector & Matrix Derivatives
4 Matrix Calculus
6 Convexity
7 Quadratic Functions
8 Constrained Optimisation &
9 Integral Calculus
10 Summary
Forafunctionf :Rn →RandafeasiblesetX ⊆Rn overwhichwe are interested in optimising, then:
A point x is a local minimum (or local maximum) of f if f (x) f (y) (or f (x) f (y)) for all y in some neighbourhood, N ⊆ X about x
Iff(x)f(y)(orf(x)f(y))forally∈X thenxisaglobal minimum (or global maximum) of f in X
Stationary Points & Saddle Points
Points where the gradient vanishes, i.e. where ∇xf = 0, are
stationary points, also known as critical points.
It can be proved that a necessary condition for a point to be a
maximum or minimum is that the point is stationary
However this is not a sufficient condition, and points for which ∇xf = 0 but where there is no local maximum or minimum are called saddle points
For example:
f(x1,x2)=x12 −x22 =⇒ ∇xf =[2×1,−2×2]T
x = [ x 1 , x 2 ] T = 0 =⇒ ∇ x f = 0
But at this point we have a minimum in the x1 direction and a maximum in the x2 direction
Saddle Points: Example
Further Conditions for Local Extrema
Recall that by Taylor’s theorem, for sufficiently small δ, and twice
differentiable f about x∗:
f ( x ∗ + δ ) ≈ f ( x ∗ ) + δ · ∇ x f ( x ∗ ) + 12 δ T H ( x ∗ ) δ
If we are at a point for which ∇xf(x∗) = 0, then: If H(x∗) ≻ 0 then x∗ is a local minimum
If H(x∗) ≺ 0 then x∗ is a local maximum
If H(x∗) is indefinite then x∗ is a saddle point Otherwise we need to investigate things further…
Conditions for Global Extrema
These are harder to state, however a class of functions for which it is more straightforward to discern global extrema are twice differentiable convex functions
These are functions where ∇2xf ≽ 0 globally
Many of the learning tasks which we will perform during this module will involve optimisation over convex functions
Lecture Overview
1 Lecture Overview
2 Derivatives &
3 Vector & Matrix Derivatives
4 Matrix Calculus
6 Convexity
7 Quadratic Functions
8 Constrained Optimisation &
9 Integral Calculus
10 Summary
Convex Sets2
Definition:
A set Ω is convex if, for any x, y ∈ Ω and θ ∈ [0, 1], then
θx + (1 − θ)y ∈ Ω.
In other words, if we take any two elements in Ω, and draw a line segment between these two elements, then every point on that line segment also belongs to Ω.
2Boyd & Vandenberghe, ‘Convex Optimisation’ [2004]
Convex Functions
Definition:
A function f : Rn → R is convex if its domain is a convex set and if,
for all x,y in its domain, and all λ ∈ [0,1], we have:
f (λx + (1 − λ)y) λf (x) + (1 − λ)f (y)
f(y) λf(x) + (1 − λ)f(y)
f (λx + (1 − λ)y ) f(x)
Non-Convex
f(y) f(λx + (1 − λ)y)
λf(x) + (1 − λ)f(y) f(x)
λx +(1−λ)y y
λx +(1−λ)y
Concave Functions
Definition:
A function f : Rn → R is concave if −f is convex
1st & 2nd Order Characterisations of Convex, Differentiable Functions
Theorem A.1:
Suppose f is twice differentiable over an open domain. Then, the following are equivalent:
f is convex
f (y) f (x) + ∇x f (x) · (y − x) ∀ x, y ∈ dom(f ) ∇2xf(x) ≽ 0 ∀ x ∈ dom(f)
Global Optimality
Theorem A.2:
Consider an unconstrained optimisation problem: min f (x)
subject to: x ∈ Rn
If f : Rn → R is convex, then any point that is locally optimal is globally optimal
Furthermore, if f is also differentiable then any point x that satisfies ∇xf(x) = 0 is a globally optimal solution
Strict Convexity
Definition:
A function f : Rn → R is strictly convex if its domain is a convex set and if, for all x,y,x ̸= y in its domain, and all λ ∈ (0,1), we have:
f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y) First Order Characterisation:
A function f is strictly convex on Ω ⊆ Rn, if and only if:
f (y) > f (x) + ∇xf (x) · (y − x) ∀x, y ∈ Ω, x ̸= y
Second Order Sufficient Condition:
A function f is strictly convex on Ω ⊆ Rn, if:
∇ 2x f ( x ) ≻ 0 ∀ x ∈ Ω
Strict Convexity and Uniqueness of Optimal Solutions
Theorem A.3:
Consider an optimisation problem:
subject to: x ∈ Ω
Wheref :Rn →RisstrictlyconvexonΩandΩisaconvexset. Then the optimal solution must be unique
Sums of Convex Functions
If a f(·) is a convex function, and g(·) is a convex function, then:
αf (·) + βg(·) is also a convex function if α, β > 0.
If a f (·) is a convex function, and h(·) is a strictly convex function, then:
αf (·) + βh(·) is a strictly convex function if α, β > 0.
Proofs follow from an application of the definitions of convexity and strict convexity
Quadratic Functions
Lecture Overview
1 Lecture Overview
2 Derivatives &
3 Vector & Matrix Derivatives
4 Matrix Calculus
6 Convexity
7 Quadratic Functions
8 Constrained Optimisation &
9 Integral Calculus
10 Summary
Quadratic Functions
Quadratic Functions
Consider the following quadratic function, f : f (x) = xT Ax + b · x + c
Where: A ∈ Rn×n, b ∈ Rn, c ∈ R, and the variable x ∈ Rn.
From the Linear Algebra lecture note that, w.l.o.g., we can write: f(x)= 12xTBx+b·x+c
Where: B = A + AT , and is hence symmetric.
Quadratic Functions
Convexity of Quadratic Functions
B ≽ 0 ⇐⇒ Convexity of f
B ≻ 0 ⇐⇒ Strict Convexity of f
Quadratic Functions
Convexity of Quadratic Functions
Proof: (for first result. Analogous proof holds for second result) For any 0 λ 1, and for any variables x,y ∈ Rn:
λf (x) = λ 12 xT Bx + λb · x + λc
(1 − λ)f (y) = (1 − λ) 12 yT By + (1 − λ)b · y + (1 − λ)c
f(λx+(1−λ)y)= 12 (λx+(1−λ)y)T B(λx+(1−λ)y)+b·(λx+(1−λ)y)+c = 12λ2xTBx+ 12(1−λ)2yTBy+λ(1−λ)xTBy
+λb·x+(1−λ)b·y+c f(λx+(1−λ)y)−λf(x)−(1−λ)f(y)
=(λ2 −λ)1xTBx+(1−λ)2 −(1−λ) 1yTBy+λ(1−λ)xTBy 22
= λ(λ − 1) 21 xT Bx + λ(λ − 1) 12 yT By − λ(λ − 1)xT By = 12λ(λ−1)(x−y)TB(x−y)
So: Convexity =⇒ LHS 0 ⇐⇒ (x − y)T B(x − y) 0 =⇒ B ≽ 0
Quadratic Functions
Optimisation of Quadratic Functions
Consider the following unconstrained optimisation problem:
min f (x) (1) x
f(x)= 12xTBx+b·x+c And: B is real and symmetric
Consider the following possible forms of f :
Quadratic Functions
Strictly Convex f
If B ≻ 0 then f is strictly convex
Thus, by Theorem A.3, there is a unique solution to problem (1), x∗: x∗ = −B−1b
Note that this solution must exist because:
B ≻ 0 =⇒ det B ̸= 0 =⇒ ∃ B−1
Quadratic Functions
Strictly Convex f : Example
f(x)=x1 +x2 ,
i.e.: B= 0 2 ,b=0,c=0
Quadratic Functions
Convex & Bounded f
IfB≽0,B0,b∈range(B)thenf isconvexbutnotstrictly
convex, and is bounded below Proof:
Solutions to the problem involve the following condition:
∇x f (x) = Bx + b = 0 =⇒ Bx = −b
From the Linear Algebra lecture, this system of equations has a solution iff it is
consistent, i.e.:
Because b ∈ range(B) then b is some linear combination of the columns of B, so:
columnspace(B) = columnspace (B| − b) =⇒ range(B)=range(B|−b)
=⇒ rank(B)=rank(B|−b)
rank(B) = rank (B| − b)
Quadratic Functions
Convex & Bounded f (Cont.)
There are infinite solutions to problem (1), since:
B≽0,B0 =⇒atleastoneeigenvalue=0 =⇒ det B = 0
=⇒ B is not full rank
Thus the system of equations, Bx = −b, is underdetermined
Quadratic Functions
Convex & Bounded f : Example
f(x)=x1+x2−2x1x2, i.e.: B= −2 2 ,b=0,c=0
Quadratic Functions
Convex & Unbounded f
IfB≽0,B0,b∈/range(B)thenf isconvexbutnotstrictly
convex, and is unbounded below Proof:
As before solutions to the problem exist iff:
rank(B) = rank (B| − b)
But because b ∈/ range(B) then b cannot be written as some linear combination of the columns of B, so:
columnspace(B) ̸= columnspace (B| − b) =⇒ range(B)̸=range(B|−b)
=⇒ rank(B)̸=rank(B|−b)
Quadratic Functions
Convex & Unbounded f (Cont.)
Thus the system of equations is inconsistent and no solutions to problem (1) exist
Quadratic Functions
Convex & Unbounded f : Example
2 f(x)=x2−x1,
0 0 −1
i.e.: B= 0 2 ,b= 0 ,c=0
Quadratic Functions
Non-Convex f
If B 0 then f is non-convex
There is no solution to problem (1) since f is unbounded below Proof:
Consider an eigenvector of B, x, with an eigenvalue, λ < 0: =⇒ Bx = λx
=⇒x Bx=λx x<0 12T
=⇒f(αx)= 2α λx x+αb·x+c Thus f (αx) → −∞ as α → ∞
Quadratic Functions
Non-Convex f : Example
f(x)=x1 −x2 ,
i.e.: B= 0 −2 ,b=0,c=0
Quadratic Functions
Optimisation of Quadratic Functions Characteristic Properties of:
Parameters of f B ≻ 0
B ≽ 0, B 0, b ∈ range B B ≽ 0, B 0, b ∈/ range B B 0
Strictly Convex Convex & Bounded Below Convex & Unbounded Below Non-Convex
argminx f (x) 1 solution
∞ solutions 0 solutions 0 solutions
(Bx = −b) consistent & exactly determined (Bx = −b) consistent & underdetermined (Bx = −b) inconsistent
f unbounded below
argmin f (x) x
where: f(x)= 12xTBx+b·x+c
Quadratic Functions
Example: Ordinary Least Squares Consider:
argmin f (w) w
f (w) = 12 ∥y − Xw∥2
= 21wTXTXw−(XTy)·w+ 12∥y∥2
w ←− x XT X ←− B −XT y ←− b
21 ∥ y ∥ 2 2 ← − c
Quadratic Functions
Example: Ordinary Least Squares Also, recall from the Linear Algebra lecture:
rank(XT X|XT y) = rank(XT X) =⇒ XT y ∈ range(XT X) =⇒ −XT y ∈ range(XT X)
Quadratic Functions
Example: Ordinary Least Squares
Thus the OLS problem always has at least one solution: Either:
XT X ≻ 0 =⇒
X T X ≽ 0
XT X 0 =⇒ ∞ solutions −XT y ∈ range(XT X)
1 solution
Constrained Optimisation &
Lecture Overview
1 Lecture Overview
2 Derivatives &
3 Vector & Matrix Derivatives
4 Matrix Calculus
6 Convexity
7 Quadratic Functions
8 Constrained Optimisation &
9 Integral Calculus
10 Summary
Constrained Optimisation & / Equality Constraints
f : Rn → R is the function over which we wish to optimise x g(x) = 0 represents an (n − 1) dimensional surface constraint n = 2 dimensional illustration (with g(xA) = 0, and g(xB) < 0):
g(x) < 0 g(x) = 0 xB
xA ∇xg(xA)
3Content and illustrations based on Bishop, ‘Pattern Recognition & Machine Learning’ [2008]
Constrained Optimisation & / Equality Constraints
Equality Constraints: Problem
min f (x) x∈Rn
subject to: g(x) = 0
Note that the functions f and g can be convex or nonconvex in general.
Constrained Optimisation & / Equality Constraints
Equality Constraints: Observations
∇xg(x) is orthogonal to the surface defined by g(x):
Because, if we denote any direction along the surface g(x) by u, then because the directional derivative along the direction of the surface must be zero 0, ∇xg(x) · u = 0.
The optimal point, x∗ must have the property that ∇xf(x∗) is orthogonal to the constraint surface:
Because, otherwise f (x) could decrease for movements along the surface.
Constrained Optimisation & / Equality Constraints
Equality Constraints:
Thus ∇xf(x) and ∇xg(x) must be parallel, i.e.:
∇xf(x) + λ∇xg(x) = 0 for some: λ ̸= 0
Here λ is a so-called Lagrange multiplier
Constrained Optimisation & / Equality Constraints
Equality Constraints: Lagrangian
Let us define the Lagrangian function, L, as follows:
L(x, λ) = f (x) + λg(x)
∇xL = ∇xf (x) + λ∇xg(x) ∇λL = g(x)
Constrained Optimisation & / Equality Constraints
Equality Constraints: Problem reformulation
Seek stationary solutions (x∗, λ∗) which satisfy the following:
∇xL = 0 ∇λL = 0
Thus we have transformed our problem into an unconstrained optimisation problem
Furthermore, we have re-phrased this problem as a well posed one involving the solution of a set of simultaneous equations
Constrained Optimisation & / Equality Constraints
Equality Constraints: Problem reformulation
Note that these conditions characterise a stationary point
associated with the function f ...
...But we have said nothing about whether such a point is a
maximum, a minimum or a saddle point
It can be proved that if both f and g are convex functions then the
stationary point is a minimum
And if f is concave and g is convex then the stationary point is a maximum
Constrained Optimisation & / Equality Constraints
Equality Constraints: Saddle Point
Note that the optimal solution to our constrained optimisation
problem will be a saddle point...
Consider the Hessian matrix for the Lagrangian:
∇2xf (x) + λ∇2xg(x) ∇xg(x) H(x,λ) = ∇xg(x)T 0
Now consider the quadratic form αT H(x, λ)α, where α = [a, b]T , foralla∈Rn andb∈R:
αT H(x, λ)α = aT ∇2xf (x) + λg(x)2g(x) a + 2ba · ∇xg(x)
Clearly if ∇xg(x) is finite then it is always