Vector Calculus
Liang Zheng
Australian National University
liang. .au
5 Vector Calculus
• We discuss functions
𝑓 ∶ ℝ$ → ℝ
𝒙 ↦ 𝑓(𝒙)
where ℝ$ is the domain of 𝑓, and the function values 𝑓(𝒙) are the
image/codomain of 𝑓.
• Example (dot product)
• Previously, we write dot product as
𝑓 𝒙 = 𝒙+𝒙, 𝒙 ∈ ℝ.
• In this chapter, we write it as
𝑓 ∶ ℝ.→ ℝ
𝒙 ↦ 𝑥0
. + 𝑥.
.
5.1 Differentiation of Univariate Functions
• Given y = 𝑓(𝑥), the difference quotient is defined as
𝛿𝑦
𝛿𝑥
: =
𝑓(𝑥 + 𝛿𝑥) − 𝑓(𝑥)
𝛿𝑥
• It computes the slope of the secant line through two points on the graph of 𝑓.
In this figure, these are the points with 𝑥–coordinates 𝑥7 and 𝑥7 + 𝛿𝑥7.
• In the limit for 𝛿𝑥 → 0, we obtain the tangent of 𝑓 at 𝑥 (if 𝑓 is differentiable).
The tangent is then the derivative of 𝑓 at 𝑥.
5.1 Differentiation of Univariate Functions
• For ℎ > 0, the derivative of 𝑓 at 𝑥 is defined as the limit
d𝑓
d𝑥
: = lim
?→7
𝑓(𝑥 + ℎ) − 𝑓(𝑥)
ℎ
• The derivative of 𝑓 points in the direction of steepest ascent of 𝑓.
• Example – Derivative of a Polynomial
• Compute the derivative of 𝑓 𝑥 = 𝑥@, 𝑛 ∈ ℕ. (From our high school
knowledge, the derivative is 𝑛𝑥@C0.)
d𝑓
d𝑥
: = lim
?→7
𝑓(𝑥 + ℎ) − 𝑓(𝑥)
ℎ
= lim
?→7
(𝑥 + ℎ)@ − 𝑥@
ℎ
= lim
?→7
∑EF7
@ @
E 𝑥
@CEℎE − 𝑥@
ℎ
we see that 𝑥@ =
𝑛
0
𝑥@C7ℎ7. By starting the sum at 1, the 𝑥@ cancels.
You will need the Taylor Expansion of exp(x) to do this:
5.1.2 Differentiation Rules
• Product rule
(𝑓(𝑥)𝑔(𝑥))M = 𝑓M(𝑥)𝑔(𝑥) + 𝑓(𝑥)𝑔M(𝑥)
• Quotient rule:
(
𝑓(𝑥)
𝑔(𝑥)
)M=
𝑓M(𝑥)𝑔(𝑥) − 𝑓(𝑥)𝑔M(𝑥)
(𝑔(𝑥)).
• Sum rule:
(𝑓(𝑥) + 𝑔(𝑥))M= 𝑓M 𝑥 + 𝑔M 𝑥
• Chain rule:
(𝑔(𝑓(𝑥)))M= (𝑔 ∘ 𝑓)M(𝑥) = 𝑔M(𝑓(𝑥))𝑓M(𝑥)
Here, 𝑔 ∘ 𝑓 denotes function composition 𝑔 𝑓 𝑥
5.2 Partial Differentiation and Gradients
• Instead of considering 𝑥 ∈ ℝ, we consider 𝒙 ∈ ℝ@, e.g., 𝑓 𝒙 = 𝑓 𝑥0, 𝑥.
• The generalization of the derivative to functions of several variables is the
gradient.
• We find the gradient of the function 𝑓 with respect to 𝒙 by
• varying one variable at a time and keeping the others constant.
• The gradient is the collection of these partial derivatives.
• For a function 𝑓:ℝ@ → ℝ,𝒙 ↦ 𝑓 𝒙 ,𝒙 ∈ ℝ@ of 𝑛 variables 𝑥0, … , 𝑥@, we define
the partial derivatives as
𝜕𝑓
𝜕𝑥0
: = lim
?→7
𝑓(𝑥0 + ℎ, 𝑥., … , 𝑥@) − 𝑓(𝒙)
ℎ
⋮
𝜕𝑓
𝜕𝑥@
:= lim
?→7
𝑓(𝑥0, … , 𝑥@C0, 𝑥@ + ℎ) − 𝑓(𝒙)
ℎ
and collect them in the row vector
∇T𝑓 = grad𝑓 =
d𝑓
d𝒙
=
𝜕𝑓 𝒙
𝜕𝑥0
𝜕𝑓 𝒙
𝜕𝑥@
…
𝜕𝑓 𝒙
𝜕𝑥@
∈ ℝ0×@
5.2 Partial Differentiation and Gradients
• ∇T𝑓 = grad𝑓 =
`U
`𝒙
=
aU 𝒙
aTb
aU 𝒙
aTc
…
aU 𝒙
aTc
∈ ℝ0×@
• 𝑛 is the number of variables and 1 is the dimension of the
image/range/codomain of 𝑓
• The row vector ∇T𝑓 ∈ ℝ
0×@ is called the gradient of 𝑓 or the Jacobian.
• Example – Partial Derivatives Using the Chain Rule
• For 𝑓 𝑥, 𝑦 = 𝑥 + 2𝑦S ., we obtain the partial derivatives
𝜕𝑓 𝑥, 𝑦
𝜕𝑥
= 2 𝑥 + 2𝑦S
𝜕
𝜕𝑥
𝑥 + 2𝑦S = 2 𝑥 + 2𝑦S
𝜕𝑓 𝑥, 𝑦
𝜕𝑦
= 2 𝑥 + 2𝑦S
𝜕
𝜕𝑦
𝑥 + 2𝑦S = 12 𝑥 + 2𝑦S 𝑦.
5.2 Partial Differentiation and Gradients
• For 𝑓 𝑥0, 𝑥. = 𝑥0
.𝑥. + 𝑥0𝑥.
S ∈ ℝ, the partial derivatives (i.e., the derivatives of
𝑓 with respect to 𝑥0 and 𝑥. are
𝜕𝑓 𝑥0, 𝑥.
𝜕𝑥0
= 2𝑥0𝑥. + 𝑥.
S
𝜕𝑓 𝑥0, 𝑥.
𝜕𝑥.
= 𝑥0
. + 3𝑥0𝑥.
.
• and the gradient is then
d𝑓
d𝒙
=
𝜕𝑓 𝑥0, 𝑥.
𝜕𝑥0
𝜕𝑓 𝑥0, 𝑥.
𝜕𝑥.
= 2𝑥0𝑥. + 𝑥.
S 𝑥0
.+ 3𝑥0𝑥.
. ∈ ℝ0×.
5.2.1 Basic Rules of Partial Differentiation
• Product rule:
𝜕
𝜕𝒙
𝑓 𝒙 𝑔 𝒙 =
𝜕𝑓
𝜕𝒙
𝑔 𝒙 + 𝑓 𝒙
𝜕𝑔
𝜕𝒙
• Sum rule:
𝜕
𝜕𝒙
𝑓 𝒙 + 𝑔 𝒙 =
𝜕𝑓
𝜕𝒙
+
𝜕𝑔
𝜕𝒙
• Chain rule:
𝜕
𝜕𝒙
𝑔 ∘ 𝑓 𝑥 =
𝜕
𝜕𝒙
𝑔 𝑓 𝒙 =
𝜕𝑔
𝜕𝑓
𝜕𝑓
𝜕𝒙
5.2.2 Chain Rule
• Consider a function 𝑓:ℝ. → ℝ of two variables 𝑥0 and 𝑥..
• 𝑥0 𝑡 and 𝑥. 𝑡 are themselves functions of 𝑡.
• To compute the gradient of 𝑓 with respect to 𝑡, we apply the chain rule:
𝑑𝑓
𝑑𝑡
=
𝜕𝑓
𝜕𝒙
𝜕𝒙
𝜕𝑡
=
𝜕𝑓
𝜕𝑥0
𝜕𝑓
𝜕𝑥.
𝜕𝑥0 𝑡
𝜕𝑡
𝜕𝑥. 𝑡
𝜕𝑡
=
𝜕𝑓
𝜕𝑥0
𝜕𝑥0
𝜕𝑡
+
𝜕𝑓
𝜕𝑥.
𝜕𝑥.
𝜕𝑡
Where 𝑑 denotes the gradient and 𝜕 partial derivates.
• Example
• Consider 𝑓(𝑥0, 𝑥.) = 𝑥0
. + 2𝑥., where 𝑥0 = sin 𝑡 and 𝑥. = cos 𝑡, then
𝑑𝑓
𝑑𝑡
=
𝜕𝑓
𝜕𝑥0
𝜕𝑥0
𝜕𝑡
+
𝜕𝑓
𝜕𝑥.
𝜕𝑥.
𝜕𝑡
= 2 sin 𝑡
𝜕 sin 𝑡
𝜕𝑡
+ 2
𝜕 cos 𝑡
𝜕𝑡
= 2 sin 𝑡 cos 𝑡 − 2 sin 𝑡 = 2 sin 𝑡 cos 𝑡 − 1
• The above is the corresponding derivative of 𝑓 with respect to 𝑡.
5.2.2 Chain Rule
• If 𝑓(𝑥0, 𝑥.) is a function of 𝑥0 and 𝑥., where 𝑓:ℝ
. → ℝ, 𝑥0(𝑠, 𝑡) and 𝑥.(𝑠, 𝑡) are
themselves functions of two variables 𝑠 and 𝑡, the chain rule yields the partial
derivatives
𝜕𝑓
𝜕𝑠
=
𝜕𝑓
𝜕𝑥0
𝜕𝑥0
𝜕𝑠
+
𝜕𝑓
𝜕𝑥.
𝜕𝑥.
𝜕𝑠
𝜕𝑓
𝜕𝑡
=
𝜕𝑓
𝜕𝑥0
𝜕𝑥0
𝜕𝑡
+
𝜕𝑓
𝜕𝑥.
𝜕𝑥.
𝜕𝑡
• The gradient can be obtained by the matrix multiplication
𝑑𝑓
𝑑 𝑠, 𝑡
=
𝜕𝑓
𝜕𝒙
𝜕𝒙
𝜕 𝑠, 𝑡
=
𝜕𝑓
𝜕𝑥0
𝜕𝑓
𝜕𝑥.
𝜕𝑥0
𝜕𝑠
𝜕𝑥0
𝜕𝑡
𝜕𝑥.
𝜕𝑠
𝜕𝑥.
𝜕𝑡=
𝜕𝑓
𝜕𝒙
=
𝜕𝒙
𝜕 𝑠, 𝑡
5.3 Gradients of Vector-Valued Functions
• We discussed partial derivatives and gradients of function 𝑓:ℝ@ → ℝ
• We will generalize the concept of the gradient to vector-valued functions
(vector fields) 𝒇:ℝ@ → ℝl, where 𝑛 ≥ 1 and 𝑚 > 1.
• For a function 𝒇 ∶ ℝ@ → ℝl and a vector 𝒙 = [𝑥0, . . . , 𝑥@]
+ ∈ ℝ@, the
corresponding vector of function values is given as
𝒇 𝒙 =
𝑓0 𝒙
⋮
𝑓l 𝒙
∈ ℝl
• Writing the vector-valued function in this way allows us to view a vector
valued function 𝒇:ℝ@ → ℝl as a vector of functions [𝑓0, . . . , 𝑓l]
+, 𝑓E:ℝ
@ → ℝ
that map onto ℝ.
• The differentiation rules for every 𝑓E are exactly the ones we discussed before.
5.3 Gradients of Vector-Valued Functions
• The partial derivative of a vector-valued function 𝒇:ℝ@ → ℝl with respect
to 𝑥E ∈ ℝ, 𝑖 = 1, . . . 𝑛, is given as the vector
𝜕𝒇
𝜕𝑥E
=
𝜕𝑓0
𝜕𝑥E
⋮
𝜕𝑓l
𝜕𝑥E
=
lim
?→7
𝑓0 𝑥0, … , 𝑥EC0, 𝑥E + ℎ, 𝑥Er0, … , 𝑥@ − 𝑓0 𝒙
ℎ
⋮
lim
?→7
𝑓l 𝑥0, … , 𝑥EC0, 𝑥E + ℎ, 𝑥Er0, … , 𝑥@ − 𝑓l 𝒙
ℎ
∈ ℝl
• In above, every partial derivative
a𝒇
aTs
is a column vector
• Recall that the gradient of 𝑓 with respect to a vector is the row vector of the
partial derivatives
• Therefore, we obtain the gradient of 𝒇:ℝ@ → ℝl with respect to 𝒙 ∈ ℝ@, by
collecting these partial derivatives:
𝑑𝒇 𝒙
𝑑𝒙
=
𝜕𝒇 𝒙
𝜕𝑥0
⋯
𝜕𝒇 𝒙
𝜕𝑥@
=
𝜕𝑓0 𝒙
𝜕𝑥0
⋯
𝜕𝑓0 𝒙
𝜕𝑥@
⋮ ⋮
𝜕𝑓l 𝒙
𝜕𝑥0
⋯
𝜕𝑓l 𝒙
𝜕𝑥@
∈ ℝl×@
5.3 Gradients of Vector-Valued Functions
• The collection of all first-order partial derivatives of a vector-valued function
𝒇:ℝ@ → ℝl is called the Jacobian. The Jacobian 𝑱 is an 𝑚×𝑛 matrix, which
we define and arrange as follows:
𝑱 = ∇𝒙𝒇 =
𝑑𝒇 𝒙
𝑑𝒙
=
𝜕𝒇 𝒙
𝜕𝑥0
⋯
𝜕𝒇 𝒙
𝜕𝑥@
=
𝜕𝑓0 𝒙
𝜕𝑥0
⋯
𝜕𝑓0 𝒙
𝜕𝑥@
⋮ ⋮
𝜕𝑓l 𝒙
𝜕𝑥0
⋯
𝜕𝑓l 𝒙
𝜕𝑥@
𝒙 =
𝑥0
⋮
𝑥@
, 𝑱 𝑖, 𝑗 =
𝜕𝑓𝒊
𝜕𝑥x
• The elements of 𝒇 define the rows and the elements of 𝒙 define the columns
of the corresponding Jacobian
• Special case: for a function 𝒇:ℝ@ → ℝ0 which maps a vector 𝒙 ∈ ℝ@ onto a
scalar, i.e., 𝑚 = 1, the Jacobian is a row vector of dimension 1×𝑛.
5.3 Gradients of Vector-Valued Functions
• If 𝑓:ℝ → ℝ, the gradient is a scalar
• If 𝑓:ℝ$ → ℝ, the gradient is a 1×𝐷 row vector
• If 𝒇:ℝ → ℝz, the gradient is a 𝐸×1 column vector
• If 𝒇:ℝ$ → ℝz, the gradient is an 𝐸×𝐷 matrix
Example – Gradient of a Vector-Valued Function
• We are given 𝒇 𝒙 = 𝑨𝒙, 𝒇 𝒙 ∈ ℝ}, 𝑨 ∈ ℝ}×~, 𝒙 ∈ ℝ~.
• To compute the gradient 𝑑𝒇/𝑑𝒙 we first determine the dimension of 𝑑𝒇/𝑑𝒙:
Since 𝒇:ℝ~ → ℝ}, it follows that 𝑑𝒇/𝑑𝒙 ∈ ℝ}×~.
• Then, we determine the partial derivatives of 𝒇 with respect to every 𝑥x:
𝑓E 𝒙 =H
xF0
~
𝐴Ex𝑥x ⇒
𝜕𝑓E
𝜕𝑥x
= 𝐴Ex
• We collect the partial derivatives in the Jacobian and obtain the gradient
𝜕𝒇
𝜕𝒙
=
𝜕𝑓0
𝜕𝑥0
⋯
𝜕𝑓0
𝜕𝑥~
⋮ ⋮
𝜕𝑓}
𝜕𝑥0
⋯
𝜕𝑓}
𝜕𝑥~
=
𝐴00 ⋯ 𝐴0~
⋮ ⋮
𝐴}0 ⋯ 𝐴}~
= 𝑨 ∈ ℝ}×~
Example – Chain Rule
• Consider the function ℎ:ℝ → ℝ, ℎ(𝑡) = (𝑓 ∘ 𝑔)(𝑡) with
𝑓:ℝ. → ℝ
𝑔:ℝ → ℝ.
𝑓 𝒙 = exp 𝑥0𝑥.
.
𝒙 =
𝑥0
𝑥.
= 𝑔 𝑡 =
𝑡 cos 𝑡
𝑡 sin 𝑡
• We compute the gradient of ℎ with respect to 𝑡. Since 𝑓:ℝ. → ℝ and 𝑔:ℝ →
ℝ. we note that
𝜕𝑓
𝜕𝒙
∈ ℝ0×.,
𝜕𝑔
𝜕𝑡
∈ ℝ.×0
• The desired gradient is computed by applying the chain rule:
𝑑ℎ
𝑑𝑡
=
𝜕𝑓
𝜕𝒙
𝜕𝒙
𝜕t
=
𝜕𝑓
𝜕𝑥0
𝜕𝑓
𝜕𝑥.
𝜕𝑥0
𝜕𝑡
𝜕𝑥.
𝜕𝑡
= exp 𝑥0𝑥.
. 𝑥.
. 2exp 𝑥0𝑥.
. 𝑥0𝑥.
cos 𝑡 − 𝑡 sin 𝑡
sin 𝑡 + 𝑡 cos 𝑡
= exp 𝑥0𝑥.
. 𝑥.
. cos 𝑡 − 𝑡 sin 𝑡 + 2𝑥0𝑥. sin 𝑡 + 𝑡 cos 𝑡
where 𝑥0 = 𝑡 cos 𝑡 and 𝑥. = 𝑡 sin 𝑡
Example – Gradient of a Least-Squares Loss in a Linear Model
• Let us consider the linear model
𝒚 = 𝚽𝜽
where 𝜽 ∈ ℝ$ is a parameter vector, 𝚽∈ ℝ~×$ are input features and 𝒚 ∈ ℝ~
are the corresponding observations. We define the functions
𝐿 𝒆 ≔∥ 𝒆 ∥.,
𝒆 𝜽 ≔ 𝒚 − 𝚽𝜽
• We seek
a•
a𝜽
, and we will use the chain rule for this purpose. 𝐿 is called a
least-squares loss function.
• First, we determine the dimensionality of the gradient as
𝜕𝐿
𝜕𝜽
∈ ℝ0×$
• The chain rule allows us to compute the gradient as
𝜕𝐿
𝜕𝜽
=
𝜕𝐿
𝜕𝒆
𝜕𝒆
𝜕𝜽
Example – Gradient of a Least-Squares Loss in a Linear Model
• We know that ||𝒆||. = 𝒆+𝒆 and determine
𝝏•
𝝏𝒆
= 2𝒆+ ∈ ℝ0×~
• Further, we obtain
𝝏𝒆
𝝏𝜽
= −𝚽 ∈ ℝ~×$
• Our desired derivative is
a•
a𝜽
= −2𝒆+𝚽 = − 2 𝒚+ − 𝜽+𝚽+ ⏟𝜱 ∈ ℝ
0×$
1×N N×D
5.4 Gradients of Matrices
• Consider the following example
𝒇 = 𝑨𝒙, 𝒇 ∈ ℝ}, 𝑨 ∈ ℝ}×~, 𝒙 ∈ ℝ~
• We seek the gradient
”𝒇
”𝑨
• First, we determine the dimension of the gradient
𝑑𝒇
𝑑𝑨
∈ ℝ}×(}×~)
• By definition, the gradient is the collection of the partial derivatives:
𝑑𝒇
𝑑𝑨
=
𝜕𝑓0
𝜕𝑨
⋮
𝜕𝑓}
𝜕𝑨
,
𝜕𝑓E
𝜕𝑨
∈ ℝ0×(}×~)
• To compute the partial derivatives, we explicitly write out the matrix vector
multiplication
𝑓E = H
xF0
~
𝐴Ex𝑥x, 𝑖 = 1,⋯ ,𝑀,
𝑓E = H
xF0
~
𝐴Ex𝑥x, 𝑖 = 1,⋯ ,𝑀,
• The partial derivatives are then given as
𝜕𝑓E
𝜕𝐴E–
= 𝑥–
• Partial derivatives of 𝑓E with respect to a row of 𝑨 are given as
𝜕𝑓E
𝜕𝐴E,:
= 𝒙+ ∈ ℝ0×0×~,
𝜕𝑓E
𝜕𝐴—˜E,:
= 𝟎+ ∈ ℝ0×0×~
• Since 𝑓E maps onto ℝ and each row of 𝑨 is of size 1×𝑁, we obtain a 1×1×𝑁
sized tensor as the partial derivative of 𝑓E with respect to a row of 𝑨.
• We stack the partial derivatives and get the desired gradient
𝜕𝑓E
𝜕𝑨
=
𝟎+
⋮
𝟎+
𝒙+
𝟎+
⋮
𝟎+
∈ ℝ0×(}×~)
Example – Gradient of Matrices with Respect to Matrices
• Consider a matrix 𝑹 ∈ ℝ}×~ and 𝒇:ℝ}×~ → ℝ~×~ with
𝒇(𝑹) = 𝑹𝐓 𝑹 =:𝑲 ∈ ℝ~×~
• We seek the gradient
”𝑲
”𝑹
• First, the dimension of the gradient is given as
𝑑𝑲
𝑑𝑹
∈ ℝ(~×~)×(}×~)
𝑑𝐾Ÿ–
𝑑𝑹
∈ ℝ0×}×~
for 𝑝, 𝑞 = 1,… ,𝑁, where 𝐾Ÿ– is the 𝑝𝑞th entry of 𝑲 = 𝒇(𝑹).
• Denoting the 𝑖th column of 𝑹 by 𝒓E, every entry of 𝑲 is given by the dot
product of two columns of 𝑹, i.e.,
𝐾Ÿ– = 𝒓Ÿ
+𝒓– = H
lF0
}
𝑅lŸ 𝑅l–
Example – Gradient of Matrices with Respect to Matrices
• Denoting the 𝑖th column of 𝑹 by 𝒓E, every entry of 𝑲 is given by the dot
product of two columns of 𝑹, i.e.,
𝐾Ÿ– = 𝒓Ÿ
+𝒓– = H
lF0
}
𝑅lŸ 𝑅l–
• We now compute the partial derivative
a¤¥¦
a§s¨
, we obtain
𝜕𝐾Ÿ–
𝜕𝑅Ex
= H
lF0
}
𝜕
𝜕𝑅Ex
𝑅lŸ𝑅l– = 𝜕Ÿ–Ex
𝜕Ÿ–Ex =
𝑅E– 𝑖𝑓 𝑗 = 𝑝, 𝑝 ≠ 𝑞
𝑅EŸ 𝑖𝑓 𝑗 = 𝑞,𝑝 ≠ 𝑞
2𝑅E– 𝑖𝑓 𝑗 = 𝑝, 𝑝 = 𝑞
0 otherwise
• The desired gradient has the dimension (𝑁×𝑁)×(𝑀×𝑁), and every single
entry of this tensor is given by 𝜕Ÿ–Ex, where 𝑝, 𝑞, 𝑗 = 1,… ,𝑁 and 𝑖 = 1,… ,𝑀
5.5 Useful Identities for Computing Gradients
• Some useful gradients that are frequently required in machine learning
• tr V : trace det V : determinant 𝒇 𝑿 C0: the inverse of 𝒇 𝑿
𝜕𝒙+𝒂
𝜕𝒙
= 𝒂+
a𝒂®𝒙
a𝒙
= 𝒂+
𝜕𝒂+𝑿𝒃
𝜕𝑿
= 𝒂𝒃+
𝜕𝒙+𝑩𝒙
𝜕𝒙
= 𝒙+(𝑩 + 𝑩+)
𝜕
𝜕𝒔
(𝒙 − 𝑨𝒔)+𝑾(𝒙 − 𝑨𝒔) = −2(𝒙 − 𝑨𝒔)+𝑾𝑨 for symmetric 𝑾
You should be able to calculate these gradients