CS计算机代考程序代写 chain Vector Calculus

Vector Calculus

Liang Zheng
Australian National University
liang. .au

5 Vector Calculus

• We discuss functions
𝑓 ∶ ℝ$ → ℝ

𝒙 ↦ 𝑓(𝒙)

where ℝ$ is the domain of 𝑓, and the function values 𝑓(𝒙) are the
image/codomain of 𝑓.

• Example (dot product)

• Previously, we write dot product as

𝑓 𝒙 = 𝒙+𝒙, 𝒙 ∈ ℝ.

• In this chapter, we write it as
𝑓 ∶ ℝ.→ ℝ

𝒙 ↦ 𝑥0
. + 𝑥.

.

5.1 Differentiation of Univariate Functions

• Given y = 𝑓(𝑥), the difference quotient is defined as
𝛿𝑦
𝛿𝑥
: =

𝑓(𝑥 + 𝛿𝑥) − 𝑓(𝑥)
𝛿𝑥

• It computes the slope of the secant line through two points on the graph of 𝑓.
In this figure, these are the points with 𝑥–coordinates 𝑥7 and 𝑥7 + 𝛿𝑥7.

• In the limit for 𝛿𝑥 → 0, we obtain the tangent of 𝑓 at 𝑥 (if 𝑓 is differentiable).
The tangent is then the derivative of 𝑓 at 𝑥.

5.1 Differentiation of Univariate Functions

• For ℎ > 0, the derivative of 𝑓 at 𝑥 is defined as the limit
d𝑓
d𝑥
: = lim

?→7

𝑓(𝑥 + ℎ) − 𝑓(𝑥)

• The derivative of 𝑓 points in the direction of steepest ascent of 𝑓.

• Example – Derivative of a Polynomial

• Compute the derivative of 𝑓 𝑥 = 𝑥@, 𝑛 ∈ ℕ. (From our high school
knowledge, the derivative is 𝑛𝑥@C0.)

d𝑓
d𝑥
: = lim

?→7

𝑓(𝑥 + ℎ) − 𝑓(𝑥)

= lim
?→7

(𝑥 + ℎ)@ − 𝑥@

= lim
?→7

∑EF7
@ @

E 𝑥
@CEℎE − 𝑥@

we see that 𝑥@ =
𝑛
0

𝑥@C7ℎ7. By starting the sum at 1, the 𝑥@ cancels.

You will need the Taylor Expansion of exp(x) to do this:

5.1.2 Differentiation Rules

• Product rule
(𝑓(𝑥)𝑔(𝑥))M = 𝑓M(𝑥)𝑔(𝑥) + 𝑓(𝑥)𝑔M(𝑥)

• Quotient rule:

(
𝑓(𝑥)
𝑔(𝑥)

)M=
𝑓M(𝑥)𝑔(𝑥) − 𝑓(𝑥)𝑔M(𝑥)

(𝑔(𝑥)).

• Sum rule:
(𝑓(𝑥) + 𝑔(𝑥))M= 𝑓M 𝑥 + 𝑔M 𝑥

• Chain rule:
(𝑔(𝑓(𝑥)))M= (𝑔 ∘ 𝑓)M(𝑥) = 𝑔M(𝑓(𝑥))𝑓M(𝑥)

Here, 𝑔 ∘ 𝑓 denotes function composition 𝑔 𝑓 𝑥

5.2 Partial Differentiation and Gradients

• Instead of considering 𝑥 ∈ ℝ, we consider 𝒙 ∈ ℝ@, e.g., 𝑓 𝒙 = 𝑓 𝑥0, 𝑥.

• The generalization of the derivative to functions of several variables is the
gradient.

• We find the gradient of the function 𝑓 with respect to 𝒙 by
• varying one variable at a time and keeping the others constant.

• The gradient is the collection of these partial derivatives.

• For a function 𝑓:ℝ@ → ℝ,𝒙 ↦ 𝑓 𝒙 ,𝒙 ∈ ℝ@ of 𝑛 variables 𝑥0, … , 𝑥@, we define
the partial derivatives as

𝜕𝑓
𝜕𝑥0

: = lim
?→7

𝑓(𝑥0 + ℎ, 𝑥., … , 𝑥@) − 𝑓(𝒙)


𝜕𝑓

𝜕𝑥@
:= lim

?→7

𝑓(𝑥0, … , 𝑥@C0, 𝑥@ + ℎ) − 𝑓(𝒙)

and collect them in the row vector

∇T𝑓 = grad𝑓 =
d𝑓
d𝒙

=
𝜕𝑓 𝒙

𝜕𝑥0

𝜕𝑓 𝒙
𝜕𝑥@


𝜕𝑓 𝒙

𝜕𝑥@
∈ ℝ0×@

5.2 Partial Differentiation and Gradients

• ∇T𝑓 = grad𝑓 =
`U
`𝒙
=

aU 𝒙
aTb

aU 𝒙
aTc


aU 𝒙
aTc

∈ ℝ0×@

• 𝑛 is the number of variables and 1 is the dimension of the
image/range/codomain of 𝑓

• The row vector ∇T𝑓 ∈ ℝ
0×@ is called the gradient of 𝑓 or the Jacobian.

• Example – Partial Derivatives Using the Chain Rule

• For 𝑓 𝑥, 𝑦 = 𝑥 + 2𝑦S ., we obtain the partial derivatives
𝜕𝑓 𝑥, 𝑦

𝜕𝑥
= 2 𝑥 + 2𝑦S

𝜕
𝜕𝑥

𝑥 + 2𝑦S = 2 𝑥 + 2𝑦S

𝜕𝑓 𝑥, 𝑦
𝜕𝑦

= 2 𝑥 + 2𝑦S
𝜕

𝜕𝑦
𝑥 + 2𝑦S = 12 𝑥 + 2𝑦S 𝑦.

5.2 Partial Differentiation and Gradients

• For 𝑓 𝑥0, 𝑥. = 𝑥0
.𝑥. + 𝑥0𝑥.

S ∈ ℝ, the partial derivatives (i.e., the derivatives of
𝑓 with respect to 𝑥0 and 𝑥. are

𝜕𝑓 𝑥0, 𝑥.
𝜕𝑥0

= 2𝑥0𝑥. + 𝑥.
S

𝜕𝑓 𝑥0, 𝑥.
𝜕𝑥.

= 𝑥0
. + 3𝑥0𝑥.

.

• and the gradient is then
d𝑓
d𝒙

=
𝜕𝑓 𝑥0, 𝑥.

𝜕𝑥0

𝜕𝑓 𝑥0, 𝑥.
𝜕𝑥.

= 2𝑥0𝑥. + 𝑥.
S 𝑥0

.+ 3𝑥0𝑥.
. ∈ ℝ0×.

5.2.1 Basic Rules of Partial Differentiation

• Product rule:
𝜕

𝜕𝒙
𝑓 𝒙 𝑔 𝒙 =

𝜕𝑓
𝜕𝒙

𝑔 𝒙 + 𝑓 𝒙
𝜕𝑔
𝜕𝒙

• Sum rule:
𝜕

𝜕𝒙
𝑓 𝒙 + 𝑔 𝒙 =

𝜕𝑓
𝜕𝒙

+
𝜕𝑔
𝜕𝒙

• Chain rule:
𝜕

𝜕𝒙
𝑔 ∘ 𝑓 𝑥 =

𝜕
𝜕𝒙

𝑔 𝑓 𝒙 =
𝜕𝑔
𝜕𝑓

𝜕𝑓
𝜕𝒙

5.2.2 Chain Rule
• Consider a function 𝑓:ℝ. → ℝ of two variables 𝑥0 and 𝑥..

• 𝑥0 𝑡 and 𝑥. 𝑡 are themselves functions of 𝑡.

• To compute the gradient of 𝑓 with respect to 𝑡, we apply the chain rule:

𝑑𝑓
𝑑𝑡

=
𝜕𝑓
𝜕𝒙

𝜕𝒙
𝜕𝑡

=
𝜕𝑓

𝜕𝑥0

𝜕𝑓
𝜕𝑥.

𝜕𝑥0 𝑡
𝜕𝑡

𝜕𝑥. 𝑡
𝜕𝑡

=
𝜕𝑓

𝜕𝑥0

𝜕𝑥0
𝜕𝑡

+
𝜕𝑓

𝜕𝑥.

𝜕𝑥.
𝜕𝑡

Where 𝑑 denotes the gradient and 𝜕 partial derivates.

• Example

• Consider 𝑓(𝑥0, 𝑥.) = 𝑥0
. + 2𝑥., where 𝑥0 = sin 𝑡 and 𝑥. = cos 𝑡, then

𝑑𝑓
𝑑𝑡

=
𝜕𝑓

𝜕𝑥0

𝜕𝑥0
𝜕𝑡

+
𝜕𝑓

𝜕𝑥.

𝜕𝑥.
𝜕𝑡

= 2 sin 𝑡
𝜕 sin 𝑡

𝜕𝑡
+ 2

𝜕 cos 𝑡
𝜕𝑡

= 2 sin 𝑡 cos 𝑡 − 2 sin 𝑡 = 2 sin 𝑡 cos 𝑡 − 1

• The above is the corresponding derivative of 𝑓 with respect to 𝑡.

5.2.2 Chain Rule

• If 𝑓(𝑥0, 𝑥.) is a function of 𝑥0 and 𝑥., where 𝑓:ℝ
. → ℝ, 𝑥0(𝑠, 𝑡) and 𝑥.(𝑠, 𝑡) are

themselves functions of two variables 𝑠 and 𝑡, the chain rule yields the partial
derivatives

𝜕𝑓
𝜕𝑠

=
𝜕𝑓

𝜕𝑥0

𝜕𝑥0
𝜕𝑠

+
𝜕𝑓

𝜕𝑥.

𝜕𝑥.
𝜕𝑠

𝜕𝑓
𝜕𝑡

=
𝜕𝑓

𝜕𝑥0

𝜕𝑥0
𝜕𝑡

+
𝜕𝑓

𝜕𝑥.

𝜕𝑥.
𝜕𝑡

• The gradient can be obtained by the matrix multiplication

𝑑𝑓
𝑑 𝑠, 𝑡

=
𝜕𝑓
𝜕𝒙

𝜕𝒙
𝜕 𝑠, 𝑡

=
𝜕𝑓

𝜕𝑥0

𝜕𝑓
𝜕𝑥.

𝜕𝑥0
𝜕𝑠

𝜕𝑥0
𝜕𝑡

𝜕𝑥.
𝜕𝑠

𝜕𝑥.
𝜕𝑡=

𝜕𝑓
𝜕𝒙

=
𝜕𝒙

𝜕 𝑠, 𝑡

5.3 Gradients of Vector-Valued Functions

• We discussed partial derivatives and gradients of function 𝑓:ℝ@ → ℝ

• We will generalize the concept of the gradient to vector-valued functions
(vector fields) 𝒇:ℝ@ → ℝl, where 𝑛 ≥ 1 and 𝑚 > 1.

• For a function 𝒇 ∶ ℝ@ → ℝl and a vector 𝒙 = [𝑥0, . . . , 𝑥@]
+ ∈ ℝ@, the

corresponding vector of function values is given as

𝒇 𝒙 =
𝑓0 𝒙


𝑓l 𝒙

∈ ℝl

• Writing the vector-valued function in this way allows us to view a vector
valued function 𝒇:ℝ@ → ℝl as a vector of functions [𝑓0, . . . , 𝑓l]

+, 𝑓E:ℝ
@ → ℝ

that map onto ℝ.

• The differentiation rules for every 𝑓E are exactly the ones we discussed before.

5.3 Gradients of Vector-Valued Functions

• The partial derivative of a vector-valued function 𝒇:ℝ@ → ℝl with respect
to 𝑥E ∈ ℝ, 𝑖 = 1, . . . 𝑛, is given as the vector

𝜕𝒇
𝜕𝑥E

=

𝜕𝑓0
𝜕𝑥E


𝜕𝑓l
𝜕𝑥E

=

lim
?→7

𝑓0 𝑥0, … , 𝑥EC0, 𝑥E + ℎ, 𝑥Er0, … , 𝑥@ − 𝑓0 𝒙

lim
?→7

𝑓l 𝑥0, … , 𝑥EC0, 𝑥E + ℎ, 𝑥Er0, … , 𝑥@ − 𝑓l 𝒙

∈ ℝl

• In above, every partial derivative
a𝒇
aTs

is a column vector

• Recall that the gradient of 𝑓 with respect to a vector is the row vector of the
partial derivatives

• Therefore, we obtain the gradient of 𝒇:ℝ@ → ℝl with respect to 𝒙 ∈ ℝ@, by
collecting these partial derivatives:

𝑑𝒇 𝒙
𝑑𝒙

=
𝜕𝒇 𝒙

𝜕𝑥0

𝜕𝒇 𝒙
𝜕𝑥@

=

𝜕𝑓0 𝒙
𝜕𝑥0


𝜕𝑓0 𝒙

𝜕𝑥@
⋮ ⋮

𝜕𝑓l 𝒙
𝜕𝑥0


𝜕𝑓l 𝒙

𝜕𝑥@

∈ ℝl×@

5.3 Gradients of Vector-Valued Functions

• The collection of all first-order partial derivatives of a vector-valued function
𝒇:ℝ@ → ℝl is called the Jacobian. The Jacobian 𝑱 is an 𝑚×𝑛 matrix, which
we define and arrange as follows:

𝑱 = ∇𝒙𝒇 =
𝑑𝒇 𝒙

𝑑𝒙
=

𝜕𝒇 𝒙
𝜕𝑥0


𝜕𝒇 𝒙

𝜕𝑥@

=

𝜕𝑓0 𝒙
𝜕𝑥0


𝜕𝑓0 𝒙

𝜕𝑥@
⋮ ⋮

𝜕𝑓l 𝒙
𝜕𝑥0


𝜕𝑓l 𝒙

𝜕𝑥@

𝒙 =
𝑥0

𝑥@
, 𝑱 𝑖, 𝑗 =

𝜕𝑓𝒊
𝜕𝑥x

• The elements of 𝒇 define the rows and the elements of 𝒙 define the columns
of the corresponding Jacobian

• Special case: for a function 𝒇:ℝ@ → ℝ0 which maps a vector 𝒙 ∈ ℝ@ onto a
scalar, i.e., 𝑚 = 1, the Jacobian is a row vector of dimension 1×𝑛.

5.3 Gradients of Vector-Valued Functions

• If 𝑓:ℝ → ℝ, the gradient is a scalar

• If 𝑓:ℝ$ → ℝ, the gradient is a 1×𝐷 row vector

• If 𝒇:ℝ → ℝz, the gradient is a 𝐸×1 column vector

• If 𝒇:ℝ$ → ℝz, the gradient is an 𝐸×𝐷 matrix

Example – Gradient of a Vector-Valued Function

• We are given 𝒇 𝒙 = 𝑨𝒙, 𝒇 𝒙 ∈ ℝ}, 𝑨 ∈ ℝ}×~, 𝒙 ∈ ℝ~.

• To compute the gradient 𝑑𝒇/𝑑𝒙 we first determine the dimension of 𝑑𝒇/𝑑𝒙:
Since 𝒇:ℝ~ → ℝ}, it follows that 𝑑𝒇/𝑑𝒙 ∈ ℝ}×~.

• Then, we determine the partial derivatives of 𝒇 with respect to every 𝑥x:

𝑓E 𝒙 =H
xF0

~

𝐴Ex𝑥x ⇒
𝜕𝑓E
𝜕𝑥x

= 𝐴Ex

• We collect the partial derivatives in the Jacobian and obtain the gradient

𝜕𝒇
𝜕𝒙

=

𝜕𝑓0
𝜕𝑥0


𝜕𝑓0
𝜕𝑥~

⋮ ⋮
𝜕𝑓}
𝜕𝑥0


𝜕𝑓}
𝜕𝑥~

=
𝐴00 ⋯ 𝐴0~

⋮ ⋮
𝐴}0 ⋯ 𝐴}~

= 𝑨 ∈ ℝ}×~

Example – Chain Rule

• Consider the function ℎ:ℝ → ℝ, ℎ(𝑡) = (𝑓 ∘ 𝑔)(𝑡) with
𝑓:ℝ. → ℝ
𝑔:ℝ → ℝ.

𝑓 𝒙 = exp 𝑥0𝑥.
.

𝒙 =
𝑥0
𝑥.

= 𝑔 𝑡 =
𝑡 cos 𝑡
𝑡 sin 𝑡

• We compute the gradient of ℎ with respect to 𝑡. Since 𝑓:ℝ. → ℝ and 𝑔:ℝ →
ℝ. we note that

𝜕𝑓
𝜕𝒙

∈ ℝ0×.,
𝜕𝑔
𝜕𝑡

∈ ℝ.×0

• The desired gradient is computed by applying the chain rule:

𝑑ℎ
𝑑𝑡

=
𝜕𝑓
𝜕𝒙

𝜕𝒙
𝜕t
=

𝜕𝑓
𝜕𝑥0

𝜕𝑓
𝜕𝑥.

𝜕𝑥0
𝜕𝑡

𝜕𝑥.
𝜕𝑡

= exp 𝑥0𝑥.
. 𝑥.

. 2exp 𝑥0𝑥.
. 𝑥0𝑥.

cos 𝑡 − 𝑡 sin 𝑡
sin 𝑡 + 𝑡 cos 𝑡

= exp 𝑥0𝑥.
. 𝑥.

. cos 𝑡 − 𝑡 sin 𝑡 + 2𝑥0𝑥. sin 𝑡 + 𝑡 cos 𝑡

where 𝑥0 = 𝑡 cos 𝑡 and 𝑥. = 𝑡 sin 𝑡

Example – Gradient of a Least-Squares Loss in a Linear Model

• Let us consider the linear model
𝒚 = 𝚽𝜽

where 𝜽 ∈ ℝ$ is a parameter vector, 𝚽∈ ℝ~×$ are input features and 𝒚 ∈ ℝ~
are the corresponding observations. We define the functions

𝐿 𝒆 ≔∥ 𝒆 ∥.,
𝒆 𝜽 ≔ 𝒚 − 𝚽𝜽

• We seek
a•
a𝜽

, and we will use the chain rule for this purpose. 𝐿 is called a
least-squares loss function.

• First, we determine the dimensionality of the gradient as
𝜕𝐿
𝜕𝜽

∈ ℝ0×$

• The chain rule allows us to compute the gradient as
𝜕𝐿
𝜕𝜽

=
𝜕𝐿
𝜕𝒆

𝜕𝒆
𝜕𝜽

Example – Gradient of a Least-Squares Loss in a Linear Model

• We know that ||𝒆||. = 𝒆+𝒆 and determine

𝝏•
𝝏𝒆
= 2𝒆+ ∈ ℝ0×~

• Further, we obtain
𝝏𝒆
𝝏𝜽

= −𝚽 ∈ ℝ~×$

• Our desired derivative is
a•
a𝜽
= −2𝒆+𝚽 = − 2 𝒚+ − 𝜽+𝚽+ ⏟𝜱 ∈ ℝ

0×$

1×N N×D

5.4 Gradients of Matrices
• Consider the following example

𝒇 = 𝑨𝒙, 𝒇 ∈ ℝ}, 𝑨 ∈ ℝ}×~, 𝒙 ∈ ℝ~

• We seek the gradient
”𝒇
”𝑨

• First, we determine the dimension of the gradient
𝑑𝒇
𝑑𝑨

∈ ℝ}×(}×~)

• By definition, the gradient is the collection of the partial derivatives:

𝑑𝒇
𝑑𝑨

=

𝜕𝑓0
𝜕𝑨


𝜕𝑓}
𝜕𝑨

,
𝜕𝑓E
𝜕𝑨

∈ ℝ0×(}×~)

• To compute the partial derivatives, we explicitly write out the matrix vector
multiplication

𝑓E = H
xF0

~

𝐴Ex𝑥x, 𝑖 = 1,⋯ ,𝑀,

𝑓E = H
xF0

~

𝐴Ex𝑥x, 𝑖 = 1,⋯ ,𝑀,

• The partial derivatives are then given as
𝜕𝑓E

𝜕𝐴E–
= 𝑥–

• Partial derivatives of 𝑓E with respect to a row of 𝑨 are given as
𝜕𝑓E

𝜕𝐴E,:
= 𝒙+ ∈ ℝ0×0×~,

𝜕𝑓E
𝜕𝐴—˜E,:

= 𝟎+ ∈ ℝ0×0×~

• Since 𝑓E maps onto ℝ and each row of 𝑨 is of size 1×𝑁, we obtain a 1×1×𝑁
sized tensor as the partial derivative of 𝑓E with respect to a row of 𝑨.

• We stack the partial derivatives and get the desired gradient

𝜕𝑓E
𝜕𝑨

=

𝟎+


𝟎+

𝒙+

𝟎+


𝟎+

∈ ℝ0×(}×~)

Example – Gradient of Matrices with Respect to Matrices

• Consider a matrix 𝑹 ∈ ℝ}×~ and 𝒇:ℝ}×~ → ℝ~×~ with
𝒇(𝑹) = 𝑹𝐓 𝑹 =:𝑲 ∈ ℝ~×~

• We seek the gradient
”𝑲
”𝑹

• First, the dimension of the gradient is given as
𝑑𝑲
𝑑𝑹

∈ ℝ(~×~)×(}×~)

𝑑𝐾Ÿ–
𝑑𝑹

∈ ℝ0×}×~

for 𝑝, 𝑞 = 1,… ,𝑁, where 𝐾Ÿ– is the 𝑝𝑞th entry of 𝑲 = 𝒇(𝑹).

• Denoting the 𝑖th column of 𝑹 by 𝒓E, every entry of 𝑲 is given by the dot
product of two columns of 𝑹, i.e.,

𝐾Ÿ– = 𝒓Ÿ
+𝒓– = H

lF0

}

𝑅lŸ 𝑅l–

Example – Gradient of Matrices with Respect to Matrices

• Denoting the 𝑖th column of 𝑹 by 𝒓E, every entry of 𝑲 is given by the dot
product of two columns of 𝑹, i.e.,

𝐾Ÿ– = 𝒓Ÿ
+𝒓– = H

lF0

}

𝑅lŸ 𝑅l–

• We now compute the partial derivative
a¤¥¦
a§s¨

, we obtain

𝜕𝐾Ÿ–
𝜕𝑅Ex

= H
lF0

}
𝜕

𝜕𝑅Ex
𝑅lŸ𝑅l– = 𝜕Ÿ–Ex

𝜕Ÿ–Ex =

𝑅E– 𝑖𝑓 𝑗 = 𝑝, 𝑝 ≠ 𝑞
𝑅EŸ 𝑖𝑓 𝑗 = 𝑞,𝑝 ≠ 𝑞
2𝑅E– 𝑖𝑓 𝑗 = 𝑝, 𝑝 = 𝑞

0 otherwise

• The desired gradient has the dimension (𝑁×𝑁)×(𝑀×𝑁), and every single
entry of this tensor is given by 𝜕Ÿ–Ex, where 𝑝, 𝑞, 𝑗 = 1,… ,𝑁 and 𝑖 = 1,… ,𝑀

5.5 Useful Identities for Computing Gradients

• Some useful gradients that are frequently required in machine learning

• tr V : trace det V : determinant 𝒇 𝑿 C0: the inverse of 𝒇 𝑿
𝜕𝒙+𝒂

𝜕𝒙
= 𝒂+

a𝒂®𝒙
a𝒙

= 𝒂+

𝜕𝒂+𝑿𝒃
𝜕𝑿

= 𝒂𝒃+

𝜕𝒙+𝑩𝒙
𝜕𝒙

= 𝒙+(𝑩 + 𝑩+)
𝜕

𝜕𝒔
(𝒙 − 𝑨𝒔)+𝑾(𝒙 − 𝑨𝒔) = −2(𝒙 − 𝑨𝒔)+𝑾𝑨 for symmetric 𝑾

You should be able to calculate these gradients