程序代做CS代考 chain Vector Calculus

Vector Calculus
Australian National University

5 Vector Calculus • We discuss functions
𝑓 ∶ R$ → R 𝒙 ↦ 𝑓(𝒙)
where R$ is the domain of 𝑓, and the function values 𝑓(𝒙) are the image/codomain of 𝑓.
• Example (dot product)
• Previously, we write dot product as
𝑓𝒙 =𝒙+𝒙, 𝒙∈R.
• In this chapter, we write it as
𝑓 ∶ R.→ R 𝒙 ↦ 𝑥 0. + 𝑥 . .

5.1 Differentiation of Univariate Functions • Given y = 𝑓(𝑥), the difference quotient is defined as
𝛿𝑦 : = 𝑓(𝑥 + 𝛿𝑥) − 𝑓(𝑥)
𝛿𝑥 𝛿𝑥
• It computes the slope of the secant line through two points on the graph of 𝑓.
In this figure, these are the points with 𝑥–coordinates 𝑥7 and 𝑥7 + 𝛿𝑥7.
• In the limit for 𝛿𝑥 → 0, we obtain the tangent of 𝑓 at 𝑥 (if 𝑓 is differentiable). The tangent is then the derivative of 𝑓 at 𝑥.

5.1 Differentiation of Univariate Functions
• For h > 0, the derivative of 𝑓 at 𝑥 is defined as the limit
d𝑓:= lim𝑓(𝑥+h)−𝑓(𝑥)
d𝑥?→7 h
• The derivative of 𝑓 points in the direction of steepest ascent of 𝑓.
• Example – Derivative of a Polynomial
• Compute the derivative of 𝑓 𝑥 = 𝑛 ∈ N. (From our high school knowledge, the derivative is
d𝑓:= lim𝑓(𝑥+h)−𝑓(𝑥)
d𝑥?→7 h
= lim(𝑥+h)@ −𝑥@
?→7 h
∑@ @
=lim EF7 E
?→7 h
we see that 𝑥@ =
𝑛0 By starting the sum at 1, the 𝑥@ cancels.

You will need the of exp(x) to do this:

5.1.2 Differentiation Rules
• Product rule • Quotient rule:
• Sum rule: • Chain rule:
(𝑓(𝑥)𝑔(𝑥))M = 𝑓M(𝑥)𝑔(𝑥) + 𝑓(𝑥)𝑔M(𝑥)
(𝑓(𝑥))M= 𝑓M(𝑥)𝑔(𝑥) − 𝑓(𝑥)𝑔M(𝑥) 𝑔(𝑥) (𝑔(𝑥)).
(𝑓(𝑥)+𝑔(𝑥))M=𝑓M 𝑥 +𝑔M 𝑥
(𝑔(𝑓(𝑥)))M= (𝑔 ∘ 𝑓)M(𝑥) = 𝑔M(𝑓(𝑥))𝑓M(𝑥) Here, 𝑔 ∘ 𝑓 denotes function composition 𝑔 𝑓 𝑥

5.2 Partial Differentiation and Gradients
• Instead of considering 𝑥 ∈ R, we consider 𝒙 ∈ e.g., 𝑓 𝒙 = 𝑓 𝑥0,𝑥.
• The generalization of the derivative to functions of several variables is the
gradient.
• We find the gradient of the function 𝑓 with respect to 𝒙 by
• varying one variable at a time and keeping the others constant.
• The gradient is the collection of these partial derivatives.
• For a function 𝑓:R@ → R,𝒙 ↦ 𝑓 𝒙 ,𝒙 ∈ R@ of 𝑛 variables we define the partial derivatives as
𝜕𝑓 := lim𝑓(𝑥0
𝜕𝑥0 ?→7
h

𝜕𝑓 := +h)−𝑓(𝒙)
𝜕𝑥@ ?→7 h
and collect them in the row vector
∇T𝑓=grad𝑓=d𝑓= 𝜕𝑓 𝒙 𝜕𝑓 𝒙 … 𝜕𝑓 𝒙 ∈R0×@ d𝒙 𝜕𝑥0 𝜕𝑥@ 𝜕𝑥@

5.2 Partial Differentiation and Gradients
• ∇T𝑓=grad𝑓=`U= aU𝒙 aU𝒙 … aU𝒙 ∈R0×@ `𝒙 aTb aTc aTc
• 𝑛 is the number of variables and 1 is the dimension of the image/range/codomain of 𝑓
• The row vector ∇T𝑓 ∈ R0×@ is called the gradient of 𝑓 or the Jacobian.
• Example – Partial Derivatives Using the Chain Rule
• For 𝑓 𝑥, 𝑦 = 𝑥 + 2𝑦S ., we obtain the partial derivatives
𝜕𝑓𝑥,𝑦 =2𝑥+2𝑦S 𝜕 𝑥+2𝑦S =2𝑥+2𝑦S 𝜕𝑥 𝜕𝑥
𝜕𝑓 𝑥,𝑦 =2 𝑥+2𝑦S 𝜕 𝑥+2𝑦S =12 𝑥+2𝑦S 𝑦. 𝜕𝑦 𝜕𝑦

5.2 Partial Differentiation and Gradients
• For 𝑓 𝑥0, 𝑥. = 𝑥0.𝑥. + 𝑥0𝑥.S ∈ R, the partial derivatives (i.e., the derivatives of 𝑓 with respect to 𝑥0 and 𝑥. are
𝜕𝑓 𝑥0,𝑥.
𝜕𝑥0 𝜕𝑓 𝑥0,𝑥.
𝜕𝑥.
d𝑓= 𝜕𝑓 𝑥0,𝑥. 𝜕𝑓 𝑥0,𝑥. = 2𝑥0𝑥.+𝑥.S 𝑥0.+3𝑥0𝑥. ∈R0×. d𝒙 𝜕𝑥0 𝜕𝑥.
= 2 𝑥 0 𝑥 . + 𝑥 .S = 𝑥 0. + 3 𝑥 0 𝑥 . .
• and the gradient is then

5.2.1 Basic Rules of Partial Differentiation
• Product rule: • Sum rule:
• Chain rule:
𝜕 𝑓𝒙𝑔𝒙 =𝜕𝑓𝑔𝒙+𝑓𝒙𝜕𝑔 𝜕𝒙 𝜕𝒙 𝜕𝒙
𝜕 𝑓𝒙+𝑔𝒙 =𝜕𝑓+𝜕𝑔 𝜕𝒙 𝜕𝒙 𝜕𝒙
𝜕 𝑔 ∘ 𝑓 𝑥 = 𝜕 𝑔 𝑓 𝒙 = 𝜕𝑔 𝜕𝑓 𝜕𝒙 𝜕𝒙 𝜕𝑓𝜕𝒙

5.2.2 Chain Rule
• Consider a function 𝑓: R. → R of two variables 𝑥0 and 𝑥..
• 𝑥0 𝑡 and 𝑥. 𝑡 are themselves functions of 𝑡.
• To compute the gradient of 𝑓 with respect to 𝑡, we apply the chain rule:
𝑑𝑓𝜕𝑓𝜕𝒙 𝜕𝑓𝜕𝑓𝜕𝑥0𝑡 𝜕𝑓𝜕𝑥 𝜕𝑓𝜕𝑥 == 𝜕𝑡=0+.
𝑑𝑡 𝜕𝒙𝜕𝑡 𝜕𝑥0 𝜕𝑥. 𝜕𝑥. 𝑡 𝜕𝑥0 𝜕𝑡 𝜕𝑥. 𝜕𝑡 𝜕𝑡
Where 𝑑 denotes the gradient and 𝜕 partial derivates.
• Example
• Consider𝑓(𝑥0,𝑥.)=𝑥0.+2𝑥.,where𝑥0 = sin𝑡and𝑥. = cos𝑡,then 𝑑𝑓 = 𝜕𝑓 𝜕𝑥0 + 𝜕𝑓 𝜕𝑥.
𝑑𝑡 𝜕𝑥0𝜕𝑡 𝜕𝑥.𝜕𝑡 = 2 sin 𝑡 𝜕 sin 𝑡 + 2 𝜕 cos 𝑡
𝜕𝑡 𝜕𝑡 =2sin𝑡cos𝑡−2sin𝑡=2sin𝑡 cos𝑡−1
• The above is the corresponding derivative of 𝑓 with respect to 𝑡.

5.2.2 Chain Rule
• If 𝑓(𝑥0,𝑥.) is a function of 𝑥0 and 𝑥., where 𝑓:R. → R,𝑥0(𝑠,𝑡) and 𝑥.(𝑠,𝑡) are themselves functions of two variables 𝑠 and 𝑡, the chain rule yields the partial
derivatives
𝜕𝑓 = 𝜕𝑓 𝜕𝑥0 + 𝜕𝑓 𝜕𝑥. 𝜕𝑠 𝜕𝑥0𝜕𝑠 𝜕𝑥.𝜕𝑠
𝜕𝑓 = 𝜕𝑓 𝜕𝑥0 + 𝜕𝑓 𝜕𝑥. 𝜕𝑡 𝜕𝑥0𝜕𝑡 𝜕𝑥.𝜕𝑡
• The gradient can be obtained by the matrix multiplication
𝜕𝑥0 𝜕𝑥0 𝜕𝑠 𝜕𝑡
𝜕𝑥. 𝜕𝑥. 𝜕𝑠 𝜕𝑡
= 𝜕𝒙 𝜕 𝑠,𝑡
𝑑𝑓 =𝜕𝑓 𝜕𝒙 = 𝜕𝑓 𝜕𝑓
𝑑 𝑠,𝑡
𝜕𝒙𝜕 𝑠,𝑡
𝜕𝑥0
𝜕𝑥.
= 𝜕𝑓 𝜕𝒙

5.3 Gradients of Vector-Valued Functions
• We discussed partial derivatives and gradients of function 𝑓: R@ → R
• We will generalize the concept of the gradient to vector-valued functions
(vector fields) 𝒇:R@ → Rl, where 𝑛 ≥ 1 and 𝑚 > 1.
• For a function 𝒇 ∶ R@ → Rl and a vector 𝒙 = ∈ the
corresponding vector of function values is given as
𝑓0𝒙 l 𝒇𝒙=⋮∈R
𝑓l 𝒙
• Writing the vector-valued function in this way allows us to view a vector valued function 𝒇:R@ → Rl as a vector of functions [𝑓0,…,𝑓l]+, 𝑓E:R@ → R that map onto R.
• The differentiation rules for every 𝑓E are exactly the ones we discussed before.

5.3 Gradients of Vector-Valued Functions
• The partial derivative of a vector-valued function 𝒇: R@ → Rl with respect
to𝑥E ∈ R, 𝑖 = 1,…𝑛, is given as the vector
𝜕𝑓0 lim 𝑓0 𝜕𝒇 𝜕𝑥E ?→7
𝑥0,…,𝑥EC0,𝑥E + h,𝑥Er0,…,𝑥@
h
⋮ 𝑥0,…,𝑥EC0,𝑥E + h,𝑥Er0,…,𝑥@
h
− 𝑓0 𝒙
l
𝜕𝑥= ⋮ = E 𝜕𝑓l
𝜕𝑥E
lim 𝑓l ?→7
∈R − 𝑓l 𝒙
• In above, every partial derivative a𝒇 is a column vector aTs
• Recall that the gradient of 𝑓 with respect to a vector is the row vector of the partial derivatives
• Therefore, we obtain the gradient of 𝒇: R@ → Rl with respect to 𝒙 ∈ by
collecting these partial derivatives: 𝜕𝑓0 𝒙 𝑑𝒇 𝒙 𝜕𝒇 𝒙 𝜕𝒇 𝒙 𝜕𝑥0

𝜕𝑓0 𝒙
𝜕𝑥@ l×@
𝑑𝒙=𝜕𝑥0 l⋯l
𝜕𝑓⋮𝒙∈R 𝜕𝑥0 𝜕𝑥@

5.3 Gradients of Vector-Valued Functions
• The collection of all first-order partial derivatives of a vector-valued function
𝒇: R@ → Rl is called the Jacobian. The Jacobian 𝑱 is an 𝑚×𝑛 matrix, which
we define and arrange as follows:
𝑱=∇𝒙𝒇=𝑑𝒇𝒙 = 𝜕𝒇𝒙 ⋯ 𝜕𝒇𝒙 𝑑𝒙 𝜕𝑥0 𝜕𝑥@
𝜕𝑓0 𝒙 ⋯ 𝜕𝑓0 𝒙
𝜕𝑥0 𝜕𝑥@ =⋮⋮
𝜕𝑓l 𝒙 ⋯ 𝜕𝑓l 𝒙
𝑥𝜕𝑥0
⋮0
𝜕𝑥@ 𝑱𝑖,𝑗=𝜕𝑥x
𝜕𝑓𝒊
• The elements of 𝒇 define the rows and the elements of 𝒙 define the columns
of the corresponding Jacobian
𝒙=𝑥 , @
• Special case: for a function 𝒇: R@ → R0 which maps a vector 𝒙 ∈ R@ onto a scalar, i.e., 𝑚 = 1, the Jacobian is a row vector of dimension 1×𝑛.

5.3 Gradients of Vector-Valued Functions
• If 𝑓:R → R, the gradient is a scalar
• If 𝑓: R$ → R, the gradient is a 1×𝐷 row vector
• If 𝒇: R → Rz, the gradient is a 𝐸×1 column vector • If 𝒇: R$ → Rz, the gradient is an 𝐸×𝐷 matrix

Example – Gradient of a Vector-Valued Function
• Wearegiven𝒇 𝒙 =𝑨𝒙, 𝒇 𝒙 ∈R}, 𝑨∈R}×~, 𝒙∈R~.
• To compute the gradient 𝑑𝒇/𝑑𝒙 we first determine the dimension of 𝑑𝒇/𝑑𝒙: Since 𝒇: R~ → R}, it follows that 𝑑𝒇/𝑑𝒙 ∈ R}×~.
• Then, we determine the partial derivatives of 𝒇 with respect to every 𝑥x: ~ 𝜕𝑓E
𝑓E 𝒙 =H𝐴Ex𝑥x⇒𝜕𝑥x=𝐴Ex xF0
• We collect the partial derivatives in the Jacobian and obtain the gradient
𝜕𝑓0 ⋯ 𝜕𝑓0
𝜕𝒇 𝜕𝑥0 𝜕𝑥~ 𝐴00 ⋯ 𝐴0~ }×~ 𝜕𝒙= ⋮ ⋮ = ⋮ ⋮ =𝑨∈R
𝜕𝑓} ⋯ 𝜕𝑓} 𝐴}0 ⋯ 𝐴}~ 𝜕𝑥0 𝜕𝑥~

Example – Chain Rule
• Consider the function h: R → R, h(𝑡) = (𝑓 ∘ 𝑔)(𝑡) with 𝑓:R. →R
𝑔:R →R. 𝑓𝒙 =exp𝑥0𝑥.
𝒙= 𝑥0 =𝑔𝑡 = 𝑡cos𝑡 𝑥. 𝑡sin𝑡
• We compute the gradient of h with respect to𝑡. Since𝑓:R. →R and 𝑔:R →
R. we note that
• The desired gradient is computed by applying the chain rule:
𝜕𝑓 ∈ R0×., 𝜕𝑔 ∈ R.×0 𝜕𝒙 𝜕𝑡
𝑑h=𝜕𝑓𝜕𝒙= 𝜕𝑓 𝜕𝑓 𝑑𝑡 𝜕𝒙 𝜕t 𝜕𝑥0 𝜕𝑥.
𝜕𝑥0
𝜕𝑡 𝜕𝑥.
𝜕𝑡
= exp 𝑥0𝑥. 𝑥. 2exp 𝑥0𝑥. 𝑥0𝑥. cos𝑡−𝑡sin𝑡
. . sin 𝑡 + 𝑡 cos 𝑡 =exp 𝑥0𝑥. 𝑥. cos𝑡− 𝑡sin𝑡 +2𝑥0𝑥. sin𝑡+𝑡cos𝑡
where𝑥0 = 𝑡cos𝑡and𝑥. = 𝑡sin𝑡

Example – Gradient of a Least-Squares Loss in a Linear Model
• Let us consider the linear model
𝒚 = 𝚽𝜽
where 𝜽 ∈ R$ is a parameter vector, 𝚽∈ R~×$ are input features and 𝒚 ∈ R~ are the corresponding observations. We define the functions
𝐿𝒆 ≔∥𝒆∥., 𝒆𝜽 ≔𝒚−𝚽𝜽
• We seek a•, and we will use the chain rule for this purpose. 𝐿 is called a a𝜽
least-squares loss function.
• First, we determine the dimensionality of the gradient as
• The chain rule allows us to compute the gradient as
𝜕𝐿 ∈ R0×$ 𝜕𝜽
𝜕𝐿 = 𝜕𝐿 𝜕𝒆 𝜕𝜽 𝜕𝒆 𝜕𝜽

Example – Gradient of a Least-Squares Loss in a Linear Model • We know that ||𝒆||. = 𝒆+𝒆 and determine
• Further, we obtain
• Our desired derivative is
𝝏• = 2𝒆+ ∈ R0×~ 𝝏𝒆
𝝏𝒆 = −𝚽 ∈ R~×$ 𝝏𝜽
a•=−2𝒆+𝚽=−2𝒚+−𝜽+𝚽+ a𝜽
1×N
𝜱⏟ ∈R0×$ N×D

5.4 Gradients of Matrices
• Consider the following example
𝒇 = 𝑨𝒙, 𝒇 ∈ R}, 𝑨 ∈ R}×~, 𝒙 ∈ R~
• We seek the gradient ”𝒇 ”𝑨
• First, we determine the dimension of the gradient
𝑑𝒇 ∈ R}×(}×~) 𝑑𝑨
• By definition, the gradient is the collection of the partial derivatives:
𝑑𝒇 𝜕𝑨 𝜕𝑓E 0×(}×~) 𝑑𝑨= ⋮ , 𝜕𝑨∈R
𝜕𝑓} 𝜕𝑨
• To compute the partial derivatives, we explicitly write out the matrix vector
𝜕𝑓0
multiplication
~
𝑓E =H𝐴Ex𝑥x, 𝑖=1,⋯,𝑀, xF0

~
𝑓E =H𝐴Ex𝑥x, 𝑖=1,⋯,𝑀, xF0
• The partial derivatives are then given as
𝜕𝑓E =𝑥– 𝜕𝐴E–
• Partial derivatives of 𝑓E with respect to a row of 𝑨 are given as 𝜕𝑓E = 𝒙+ ∈ R0×0×~, 𝜕𝑓E = 𝟎+ ∈ R0×0×~
𝜕𝐴E,: 𝜕𝐴— ̃E,:
• Since 𝑓E maps onto R and each row of 𝑨 is of size 1×𝑁, we obtain a 1×1×𝑁
sized tensor as the partial derivative of 𝑓E with respect to a row of 𝑨. • We stack the partial derivatives and get the desired gradient
⋮ 𝜕𝑓E 𝟎+
𝜕𝑨 = 𝒙+ 𝟎+
⋮ 𝟎+
𝟎+
∈ R0×(}×~)

Example – Gradient of Matrices with Respect to Matrices • Consider a matrix 𝑹 ∈ R}×~ and 𝒇: R}×~ → R~×~ with
𝒇(𝑹) = 𝑹𝐓 𝑹 =: 𝑲 ∈ R~×~
• We seek the gradient ”𝑲 ”𝑹
• First, the dimension of the gradient is given as
𝑑𝑲 ∈ R(~×~)×(}×~) 𝑑𝑹
𝑑𝐾Ÿ– ∈ R0×}×~ 𝑑𝑹
for 𝑝,𝑞 = 1,…,𝑁, where 𝐾Ÿ– is the 𝑝𝑞th entry of 𝑲 = 𝒇(𝑹).
• Denoting the 𝑖th column of 𝑹 by 𝒓E, every entry of 𝑲 is given by the dot
product of two columns of 𝑹, i.e., }
𝐾 Ÿ – = 𝒓 +Ÿ 𝒓 – = H 𝑅 l Ÿ 𝑅 l – lF0

Example – Gradient of Matrices with Respect to Matrices
• Denoting the 𝑖th column of 𝑹 by 𝒓E, every entry of 𝑲 is given by the dot
product of two columns of 𝑹, i.e., }
𝐾 Ÿ – = 𝒓 +Ÿ 𝒓 – = H 𝑅 l Ÿ 𝑅 l –
lF0
• We now compute the partial derivative a¤¥¦, we obtain
a§s ̈
𝜕 𝑅lŸ𝑅l– = 𝜕Ÿ–Ex 𝜕𝑅Ex
𝑅E– 𝑖𝑓 𝑗=𝑝,𝑝≠𝑞 𝑅EŸ 𝑖𝑓 𝑗=𝑞,𝑝≠𝑞 2𝑅E– 𝑖𝑓𝑗=𝑝,𝑝=𝑞 0 otherwise
• The desired gradient has the dimension (𝑁×𝑁)×(𝑀×𝑁), and every single entry of this tensor is given by 𝜕Ÿ–Ex, where 𝑝,𝑞,𝑗 = 1,…,𝑁 and 𝑖 = 1,…,𝑀
} 𝜕𝐾Ÿ– = H
𝜕𝑅Ex lF0
𝜕Ÿ–Ex =

5.5 Useful Identities for Computing Gradients • Some useful gradients that are frequently required in machine learning
• tr V : trace det V : determinant 𝒇 𝑿 C0: the inverse of 𝒇 𝑿 𝜕𝒙+𝒂 = 𝒂+
𝜕𝒙
a𝒂®𝒙 = 𝒂+ a𝒙
𝜕𝒂+𝑿𝒃 = 𝒂𝒃+ 𝜕𝑿
𝜕𝒙+𝑩𝒙 = 𝒙+(𝑩 + 𝑩+) 𝜕 𝜕𝒙
𝜕𝒔 (𝒙 − 𝑨𝒔)+𝑾(𝒙 − 𝑨𝒔) = −2(𝒙 − 𝑨𝒔)+𝑾𝑨
for symmetric 𝑾 You should be able to calculate these gradients