Gradient Descent
89
Gradient descent in one dimension
gradient_descent_1d(𝜃𝑖𝑛𝑖𝑡, 𝑓, 𝑓′, 𝜂, 𝜖): 𝜃(0) = 𝜃𝑖𝑛𝑖𝑡
t=0 loop
t = t+1
step size eta
𝜃(𝑡) = 𝜃𝑡−1 − 𝜂𝑓′(𝜃 𝑡−1 ) until 𝑓 𝜃(𝑡) − 𝑓(𝜃(𝑡−1)) < 𝜖 return(𝜃𝑡)
convergence criterion
91
Convergence
local minimum
global minimum
• Theorem: If 𝑓 is convex, then for any desired accuracy 𝜖, there is some step size 𝜂 such that gradient descent will converge to a theta within epsilon of the optimum.
93
Gradient descent in multiple dimensions
gradient_descent(𝜃𝑖𝑛𝑖𝑡 , 𝑓, ∇𝜃 𝑓, 𝜂, 𝜖): ...
𝜃(𝑡) = 𝜃𝑡−1 − 𝜂∇𝜃𝑓(𝜃 𝑡−1 ) ...
𝑓(𝜃,...,𝜃 ) 1𝑀
∇𝜃𝑓: 𝑚 × 1
𝜃:𝑚×1 𝜃𝑖𝑛𝑖𝑡: 𝑚 × 1
94
∇𝜃𝑓 =
𝜕𝑓/𝜕𝜃 1
𝜕𝑓/𝜕𝜃 2
⋮
𝜕𝑓/𝜕𝜃 𝑚