Stochastic Search and . Gradients
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Copyright By PowCoder代写 加微信 powcoder
Multivariate Optimisation
Let f : Rd → R and suppose that all of the first- and second-order partial derivatives of f exist and are continuous everywhere. We write x = (x1, . . . , xd)T for an element of Rd and ei for the i-th co-ordinate vector: x=x1e1 +···+xded.
The i-th partial derivative at x will be denoted fi(x) = ∂f(x)/∂xi and we define the gradient
∇f(x) = (f1(x),…,fd(x))T and the Hessian
∂2f(x) ··· ∂2f(x) ∂x1∂x1 ∂x1∂xd
H(x) = . … . .
∂2f(x) ··· ∂2f(x) ∂xd∂x1 ∂xd∂xd
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
For any vector v ̸= 0 the slope at x in direction v is given by
vT ∇f (x)/∥v∥,
The curvature at x in direction v is given by
vT H(x)v/∥v∥2.
A necessary (but not sufficient) condition for a local maximum at x is ∇f(x) = 0 = (0,…,0)T and for all v ̸= 0, the curvature at x in direction v is ≤ 0.
A sufficient (but not necessary) condition for f to have a local maximum at x is that ∇f (x) = 0 and the curvature in all directions is < 0.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Steepest ascent
Put x(n + 1) = x(n) + αv, where α is a positive scalar and the direction v is the direction with largest slope. That is, v maximises vT ∇f (x(n))/∥v∥. Consider
∂ vT∇f(x) = fi(x) − (vT∇f(x))vi. ∂vi ∥v∥ ∥v∥ ∥v∥3
Setting this to 0 we get vi ∝ fi(x), from which we see that at the point x the direction with largest slope is ∇f (x). Thus, the steepest ascent method has the form
x(n + 1) = x(n) + α∇f (x(n)), for some α ≥ 0.
Stoch. Grad.
Steepest ascent
Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Example: ascent.r α = 0.2
f(x,y)=sin(x^2/2−y^2/4)*cos(2*x−exp(y))
−0.5 0.0 0.5 1.0
1.5 2.0 2.5 3.0
Stoch. Grad.
Steepest ascent
Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
−0.5 0.0 0.5
1.0 1.5 2.0
Kiefer-Wolfowitz 1952
Suppose that f(x) = EΦ(x) for x ∈ R where f has a unique global maximum at θ, Var Φ(x) is bounded, and
1. f is Lipschitz in a neighborhood of θ
2. f has bounded growth
3. f ′ is bounded away from 0 outside any nghbrhd of θ
For arbitrary Z0 put
Z = Z + a Φ2n − Φ2n−1
n+1 n n cn
where Φ2n−1 ∼ Φ(zn − cn) and Φ2n ∼ Φ(zn + cn) are
independent. Then Zn −→ θ when a cn → 0
b an = ∞
c ancn <∞
d a 2n / c 2n < ∞
For example, when an = n−1, cn = n−1/4
Stoch. Grad.
Steepest ascent
Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Zn+1 = Zn + an(Φ2n − Φ2n−1)/cn
Weshowthatbn :=E(Zn −θ)2 →0asn→∞.
bn+1 = bn + 2an E(Un+ + Un−) + a2n E(Φ2n − Φ2n−1)2 cn c2n
where Un± are the + and − parts of
(Zn − θ)(f(Zn + cn) − f(Zn − cn)).
Iterating we get
bn+1 = b0 +2 aj EUn+ +2 aj EUn− + a2j E(Φ2n −Φ2n−1)2 j cj j cj j c2j
Stoch. Grad.
Steepest ascent
Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
bn+1 =b0 +2 ajEU+ +2 ajEU− + a2j E(Φ2n −Φ2n−1)2 jcj n jcj n jc2j
Consider the last term.
E(Φ2n−Φ2n−1)2 = VarΦ2n+VarΦ2n−1+E(f(Zn+cn)−f(Zn−cn))2
By assumption the RHS is bounded, thus if a2n/c2n < ∞ then the last term is finite.
Stoch. Grad.
Steepest ascent
Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
bn+1 =b0 +2 ajEU+ +2 ajEU− + a2j E(Φ2n −Φ2n−1)2 jcj n jcj n jc2j
Consider the second term:
EUn+ = E[(Zn − θ)(f(Zn + cn) − f(Zn − cn))]+
Since f has a unique maximum at θ, if the integrand is strictly positive, then Zn must be within distance cn of θ (draw a picture).
Since f is Lipschitz at θ (by assumption) we have that E[(Zn − θ)(f(Zn + cn) − f(Zn − cn))]+ ≤ αc2n, for some α. Thus if ancn < ∞ then the second term is finite.
Stoch. Grad.
Steepest ascent
Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
bn+1 =b0 +2 ajEU+ +2 ajEU− + a2j E(Φ2n −Φ2n−1)2 jcj n jcj n jc2j
The LHS is non-negative.
The first, second and fourth terms on the RHS are non-negative and bounded, so the third term (which is non-positive) must be bounded below. That is, this sum also converges.
So, considering the second and third terms, we have
a n ( E U n+ − E U n− ) → 0 cn
Thus, if an diverges, there must be a subsequence {nj} such that
E|Znj −θ||f(Znj +cnj)−f(Znj −cnj)| →0. cnj
E|Zn − θ||f(Zn + cn) − f(Zn − cn)|an → 0. cn
Stoch. Grad.
Steepest ascent
Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
E|Zn −θ||f(Znj+cnj)−f(Znj−cnj)| →0 j cnj
By assumption, f ′ is bounded away from 0 outside of any neighborhood of θ.
It follows that E|Znj − θ| → 0.
Returning to our sum
bn ≤bm +2ajEUn+ +a2j E(Φ2n −Φ2n−1)2. j ≥ m c j j ≥ m c 2j
We can choose m = nj so that bm is arbitrarily small with arbitrarily large probability, and we have already seen that the two sums go to 0 as m → ∞.
Stoch. Grad.
Steepest ascent
Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Improving Kiefer- (1954) showed that the condition n ancn < ∞ is not needed to show convergence.
However Broadie, Cicek, & Zeevi (2011) showed that asymptotically the MSE is minimised by taking
an ∼ n−1 and cn ∼ n−1/4. For this choice n ancn < ∞ anyway.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Stochastic Gradient Descent
For f : Rd → R a stochastic gradient descent algorithm for finding a local minimum has the form
Zn+1 = Zn − anGn(Zn)
where Gn(x) is a (stochastic) approximation to ∇f(x).
If we can sample from Φ(x) where EΦ(x) = f (x) then we can use finite differences to estimate ∇f (x).
That is, for the i-th element of Gn we have
[Gn(x)]i = Φ(x + cnei) − Φ(x − cnei) 2cn
where ei is the i-th co-ordinate vector and cn is small.
As for K-W we can obtain conditions on {an} and {cn} for convergence.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Example: ascent_stoch_grad.r added white noise var 0.12, an = 0.2, cn = 0.3
f(x,y)=sin(x^2/2−y^2/4)*cos(2*x−exp(y))
−0.5 0.0 0.5 1.0
1.5 2.0 2.5 3.0
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
−0.5 0.0 0.5
1.0 1.5 2.0
Simultaneous Perturbations Spall 1992
Zn+1 = Zn − anGn(Zn)
Using finite differences to estimate ∇f requires 2d evaluations of Φ, which is unnecessarily expensive if the dimension d is large.
Let ∆n be i.i.d. random vectors in Rd, where each has i.i.d. components [∆n]i each equally likely to take values ±1.
The Simultaneous Perturbations estimate Gn(x) of ∇f(x) has i-th component
[Gn(x)]i = Φ(x + cn∆n) − Φ(x − cn∆n) 2cn [∆n ]i
This only requires 2 evaluations of Φ, yet it turns out that gives the same asymptotic convergence as using finite differences to estimate ∇f .
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Simultaneous Perturbations
Convergence conditions (simplified)
Zn+1 = Zn − anGn(Zn),
[Gn(x)]i = (Φ(x + cn∆n) − Φ(x − cn∆n))/(2cn[∆n]i)
Suppose that f (x) = EΦ(x), our observations of Φ(·) are independent, and Var Φ(x) is bounded.
Also suppose that f is three times continuously differentiable, has a unique minimum at θ, and that
d x(t) = −∇f (x(t)) dt
has an asymptotically stable equilibrium point at θ. a.s.
Then Zn −→ θ when a an ↓0,cn ↓0
b n an = ∞
c n a 2n / c 2n < ∞
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Simultaneous Perturbations
Convergence rate
Zn+1 = Zn − anGn(Zn),
[Gn(x)]i = (Φ(x + cn∆n) − Φ(x − cn∆n))/(2cn[∆n]i)
Suppose that in addition to the convergence conditions we have
an= a ,cn= c (n+1+A)α (n+1)γ
where a, c, α, γ > 0, A ≥ 0, 6γ ≥ α ≥ 2γ, then as n → ∞ α/2−γ d
n (Zn − θ) −→ N(μ, Σ) for some μ and Σ.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Simultaneous Perturbations Suggested gain sequences
ak =a/(k+1+A)α,ck =c/(k+1)γ Asymptotically it is optimal to take
α=1, γ=1/6
which gives asymptotic convergence speed n−1/3. In practice we only have a finite number of samples and Spall (2003) recommends
α = 0.602, γ = 0.101
which gives asymptotic convergence speed n−1/5.
This choice gives a slower decay of an than the asymptotically optimal. Spall also suggests switching from his recommendation to the asymptotically optimal choice after some initial number of iterations.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Simultaneous Perturbations Suggested gain sequences
ak =a/(k+1+A)α,ck =c/(k+1)γ
Spall (2003) gives as a rule of thumb
c≈VarΦ(θ);
A ≤ 1 maximum number of iterations; 10
choose a so that a0G0(Z0) is a sensible initial step size.
In general with more noise we should decrease the step size (that is, decrease a).
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Example: ascent_sim_perturb.r added white noise var 0.12, a = 0.1 × 110.602, c = 0.1, A = 10
f(x,y)=sin(x^2/2−y^2/4)*cos(2*x−exp(y))
−0.5 0.0 0.5 1.0
1.5 2.0 2.5 3.0
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
−0.5 0.0 0.5
1.0 1.5 2.0
Common random numbers
[Gn(x)]i = (Φ(x + cn∆n) − Φ(x − cn∆n))/2cn[∆n]i
If Φ(x) is generated by simulation then we can write it as
Φ(x) = φ(x;V1,V2,…) = φ(x;V)
where the Vi are a sequence of computer generated
pseudo-random variables.
Typically we find that if we use the same random sequence V and choose x and y close, then φ(x; V) and φ(y; V) will be +vely correlated.
Thus, we expect that φ(x + h; V) − φ(x − h; V) will have the same mean but smaller variance than
φ(x + h; V) − φ(x − h; W), where W is identically distributed as and independent of V.
At the very least we can achieve this by resetting the uniform random number generator to the same place before generating Φ(x + h) and Φ(x − h)
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Zn+1 = Zn − anGn(Zn)
An easy way to reduce the variance of Φ(x) is to sample multiple times and average. There is heuristic evidence that this can speed up the algorithm, especially in the initial stages, but no theoretical guidelines as to how much (if any) averaging to use.
Conversely, if f is highly multi-modal or multi-saddled, then there is an advantage to having a noisy estimate of ∇f , in that it gives you a way of escaping local stationary points. In fact, you can show that by deliberately adding extra noise to a stochastic gradient you can guarantee convergence to a global minimum (see references in Spall 2003).
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Iterate Averaging
Zn+1 = Zn − anGn(Zn)
A modification that has good (asymptotic) theoretical properties but is often underwhelming in (finite) practice, is iterate averaging.
We calculate the Zn precisely as before, but as our n-th estimate of θ we take
Zn:= Zk/(n−n0).
Spall 2003 gives a good discussion of why theory and practice don’t match.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Momentum (Gradient Smoothing)
Zn+1 = Zn − anGn(Zn)
Suppose γn ↓ 0, then replace Gn(Zn) by
Gn := (1 − γn)Gn(Zn) + γnGn−1.
Particularly useful when Gn(x) is volatile, and can help mitigate the effect of banana shaped trenches. That is, help you escape from local stationary points (or close to stationary).
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
Neural Nets
Zn+1 = Zn − anGn(Zn)
The back-propagation algorithm used to fit (feed-forward) neural nets is a steepest descent algorithm where the chain rule is used to calculate the gradient.
When the training set is uncomfortably large1, calculating the gradient can be overly time consuming, which suggests taking successive samples from the training set. This gives stochastic estimates of the gradient, so we have stochastic descent.
This approach also allows you to use ideas from sampling theory to reduce the variance of the gradient estimates. For example stratifying the training set and then sampling across strata.
1big data 🙂
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com