CS代考 Stochastic Search and . Gradients

Stochastic Search and . Gradients
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

Copyright By PowCoder代写 加微信 powcoder

Multivariate Optimisation
Let f : Rd → R and suppose that all of the first- and second-order partial derivatives of f exist and are continuous everywhere. We write x = (x1, . . . , xd)T for an element of Rd and ei for the i-th co-ordinate vector: x=x1e1 +···+xded.
The i-th partial derivative at x will be denoted fi(x) = ∂f(x)/∂xi and we define the gradient
∇f(x) = (f1(x),…,fd(x))T and the Hessian
 ∂2f(x) ··· ∂2f(x)  ∂x1∂x1 ∂x1∂xd
H(x) =  . … . .
 ∂2f(x) ··· ∂2f(x)  ∂xd∂x1 ∂xd∂xd
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

For any vector v ̸= 0 the slope at x in direction v is given by
vT ∇f (x)/∥v∥,
The curvature at x in direction v is given by
vT H(x)v/∥v∥2.
A necessary (but not sufficient) condition for a local maximum at x is ∇f(x) = 0 = (0,…,0)T and for all v ̸= 0, the curvature at x in direction v is ≤ 0.
A sufficient (but not necessary) condition for f to have a local maximum at x is that ∇f (x) = 0 and the curvature in all directions is < 0. Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets Steepest ascent Put x(n + 1) = x(n) + αv, where α is a positive scalar and the direction v is the direction with largest slope. That is, v maximises vT ∇f (x(n))/∥v∥. Consider ∂ vT∇f(x) = fi(x) − (vT∇f(x))vi. ∂vi ∥v∥ ∥v∥ ∥v∥3 Setting this to 0 we get vi ∝ fi(x), from which we see that at the point x the direction with largest slope is ∇f (x). Thus, the steepest ascent method has the form x(n + 1) = x(n) + α∇f (x(n)), for some α ≥ 0. Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets Example: ascent.r α = 0.2 f(x,y)=sin(x^2/2−y^2/4)*cos(2*x−exp(y)) −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets −0.5 0.0 0.5 1.0 1.5 2.0 Kiefer-Wolfowitz 1952 Suppose that f(x) = EΦ(x) for x ∈ R where f has a unique global maximum at θ, Var Φ(x) is bounded, and 1. f is Lipschitz in a neighborhood of θ 2. f has bounded growth 3. f ′ is bounded away from 0 outside any nghbrhd of θ For arbitrary Z0 put Z = Z + a Φ2n − Φ2n−1 n+1 n n cn where Φ2n−1 ∼ Φ(zn − cn) and Φ2n ∼ Φ(zn + cn) are independent. Then Zn −→ θ when a cn → 0 b 􏰣 an = ∞ c 􏰣ancn <∞ d 􏰣 a 2n / c 2n < ∞ For example, when an = n−1, cn = n−1/4 Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets Zn+1 = Zn + an(Φ2n − Φ2n−1)/cn Weshowthatbn :=E(Zn −θ)2 →0asn→∞. bn+1 = bn + 2an E(Un+ + Un−) + a2n E(Φ2n − Φ2n−1)2 cn c2n where Un± are the + and − parts of (Zn − θ)(f(Zn + cn) − f(Zn − cn)). Iterating we get bn+1 = b0 +2􏰡 aj EUn+ +2􏰡 aj EUn− +􏰡 a2j E(Φ2n −Φ2n−1)2 j cj j cj j c2j Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets bn+1 =b0 +2􏰣 ajEU+ +2􏰣 ajEU− +􏰣 a2j E(Φ2n −Φ2n−1)2 jcj n jcj n jc2j Consider the last term. E(Φ2n−Φ2n−1)2 = VarΦ2n+VarΦ2n−1+E(f(Zn+cn)−f(Zn−cn))2 By assumption the RHS is bounded, thus if 􏰣 a2n/c2n < ∞ then the last term is finite. Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets bn+1 =b0 +2􏰣 ajEU+ +2􏰣 ajEU− +􏰣 a2j E(Φ2n −Φ2n−1)2 jcj n jcj n jc2j Consider the second term: EUn+ = E[(Zn − θ)(f(Zn + cn) − f(Zn − cn))]+ Since f has a unique maximum at θ, if the integrand is strictly positive, then Zn must be within distance cn of θ (draw a picture). Since f is Lipschitz at θ (by assumption) we have that E[(Zn − θ)(f(Zn + cn) − f(Zn − cn))]+ ≤ αc2n, for some α. Thus if 􏰣 ancn < ∞ then the second term is finite. Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets bn+1 =b0 +2􏰣 ajEU+ +2􏰣 ajEU− +􏰣 a2j E(Φ2n −Φ2n−1)2 jcj n jcj n jc2j The LHS is non-negative. The first, second and fourth terms on the RHS are non-negative and bounded, so the third term (which is non-positive) must be bounded below. That is, this sum also converges. So, considering the second and third terms, we have a n ( E U n+ − E U n− ) → 0 cn Thus, if 􏰣 an diverges, there must be a subsequence {nj} such that E|Znj −θ||f(Znj +cnj)−f(Znj −cnj)| →0. cnj E|Zn − θ||f(Zn + cn) − f(Zn − cn)|an → 0. cn Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets E|Zn −θ||f(Znj+cnj)−f(Znj−cnj)| →0 j cnj By assumption, f ′ is bounded away from 0 outside of any neighborhood of θ. It follows that E|Znj − θ| → 0. Returning to our sum bn ≤bm +2􏰡ajEUn+ +􏰡a2j E(Φ2n −Φ2n−1)2. j ≥ m c j j ≥ m c 2j We can choose m = nj so that bm is arbitrarily small with arbitrarily large probability, and we have already seen that the two sums go to 0 as m → ∞. Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets Improving Kiefer- (1954) showed that the condition 􏰣n ancn < ∞ is not needed to show convergence. However Broadie, Cicek, & Zeevi (2011) showed that asymptotically the MSE is minimised by taking an ∼ n−1 and cn ∼ n−1/4. For this choice 􏰣n ancn < ∞ anyway. Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets Stochastic Gradient Descent For f : Rd → R a stochastic gradient descent algorithm for finding a local minimum has the form Zn+1 = Zn − anGn(Zn) where Gn(x) is a (stochastic) approximation to ∇f(x). If we can sample from Φ(x) where EΦ(x) = f (x) then we can use finite differences to estimate ∇f (x). That is, for the i-th element of Gn we have [Gn(x)]i = Φ(x + cnei) − Φ(x − cnei) 2cn where ei is the i-th co-ordinate vector and cn is small. As for K-W we can obtain conditions on {an} and {cn} for convergence. Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets Example: ascent_stoch_grad.r added white noise var 0.12, an = 0.2, cn = 0.3 f(x,y)=sin(x^2/2−y^2/4)*cos(2*x−exp(y)) −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets −0.5 0.0 0.5 1.0 1.5 2.0 Simultaneous Perturbations Spall 1992 Zn+1 = Zn − anGn(Zn) Using finite differences to estimate ∇f requires 2d evaluations of Φ, which is unnecessarily expensive if the dimension d is large. Let ∆n be i.i.d. random vectors in Rd, where each has i.i.d. components [∆n]i each equally likely to take values ±1. The Simultaneous Perturbations estimate Gn(x) of ∇f(x) has i-th component [Gn(x)]i = Φ(x + cn∆n) − Φ(x − cn∆n) 2cn [∆n ]i This only requires 2 evaluations of Φ, yet it turns out that gives the same asymptotic convergence as using finite differences to estimate ∇f . Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets Simultaneous Perturbations Convergence conditions (simplified) Zn+1 = Zn − anGn(Zn), [Gn(x)]i = (Φ(x + cn∆n) − Φ(x − cn∆n))/(2cn[∆n]i) Suppose that f (x) = EΦ(x), our observations of Φ(·) are independent, and Var Φ(x) is bounded. Also suppose that f is three times continuously differentiable, has a unique minimum at θ, and that d x(t) = −∇f (x(t)) dt has an asymptotically stable equilibrium point at θ. a.s. Then Zn −→ θ when a an ↓0,cn ↓0 b 􏰣n an = ∞ c 􏰣 n a 2n / c 2n < ∞ Stoch. Grad. Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets Simultaneous Perturbations Convergence rate Zn+1 = Zn − anGn(Zn), [Gn(x)]i = (Φ(x + cn∆n) − Φ(x − cn∆n))/(2cn[∆n]i) Suppose that in addition to the convergence conditions we have an= a ,cn= c (n+1+A)α (n+1)γ where a, c, α, γ > 0, A ≥ 0, 6γ ≥ α ≥ 2γ, then as n → ∞ α/2−γ d
n (Zn − θ) −→ N(μ, Σ) for some μ and Σ.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

Simultaneous Perturbations Suggested gain sequences
ak =a/(k+1+A)α,ck =c/(k+1)γ Asymptotically it is optimal to take
α=1, γ=1/6
which gives asymptotic convergence speed n−1/3. In practice we only have a finite number of samples and Spall (2003) recommends
α = 0.602, γ = 0.101
which gives asymptotic convergence speed n−1/5.
This choice gives a slower decay of an than the asymptotically optimal. Spall also suggests switching from his recommendation to the asymptotically optimal choice after some initial number of iterations.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

Simultaneous Perturbations Suggested gain sequences
ak =a/(k+1+A)α,ck =c/(k+1)γ
Spall (2003) gives as a rule of thumb
􏰤 c≈􏰑VarΦ(θ);
􏰤 A ≤ 1 maximum number of iterations; 10
􏰤 choose a so that a0G0(Z0) is a sensible initial step size.
In general with more noise we should decrease the step size (that is, decrease a).
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

Example: ascent_sim_perturb.r added white noise var 0.12, a = 0.1 × 110.602, c = 0.1, A = 10
f(x,y)=sin(x^2/2−y^2/4)*cos(2*x−exp(y))
−0.5 0.0 0.5 1.0
1.5 2.0 2.5 3.0
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets
−0.5 0.0 0.5
1.0 1.5 2.0

Common random numbers
[Gn(x)]i = (Φ(x + cn∆n) − Φ(x − cn∆n))/2cn[∆n]i
If Φ(x) is generated by simulation then we can write it as
Φ(x) = φ(x;V1,V2,…) = φ(x;V)
where the Vi are a sequence of computer generated
pseudo-random variables.
Typically we find that if we use the same random sequence V and choose x and y close, then φ(x; V) and φ(y; V) will be +vely correlated.
Thus, we expect that φ(x + h; V) − φ(x − h; V) will have the same mean but smaller variance than
φ(x + h; V) − φ(x − h; W), where W is identically distributed as and independent of V.
At the very least we can achieve this by resetting the uniform random number generator to the same place before generating Φ(x + h) and Φ(x − h)
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

Zn+1 = Zn − anGn(Zn)
An easy way to reduce the variance of Φ(x) is to sample multiple times and average. There is heuristic evidence that this can speed up the algorithm, especially in the initial stages, but no theoretical guidelines as to how much (if any) averaging to use.
Conversely, if f is highly multi-modal or multi-saddled, then there is an advantage to having a noisy estimate of ∇f , in that it gives you a way of escaping local stationary points. In fact, you can show that by deliberately adding extra noise to a stochastic gradient you can guarantee convergence to a global minimum (see references in Spall 2003).
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

Iterate Averaging
Zn+1 = Zn − anGn(Zn)
A modification that has good (asymptotic) theoretical properties but is often underwhelming in (finite) practice, is iterate averaging.
We calculate the Zn precisely as before, but as our n-th estimate of θ we take
Zn:= 􏰡 Zk/(n−n0).
Spall 2003 gives a good discussion of why theory and practice don’t match.
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

Momentum (Gradient Smoothing)
Zn+1 = Zn − anGn(Zn)
Suppose γn ↓ 0, then replace Gn(Zn) by
Gn := (1 − γn)Gn(Zn) + γnGn−1.
Particularly useful when Gn(x) is volatile, and can help mitigate the effect of banana shaped trenches. That is, help you escape from local stationary points (or close to stationary).
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

Neural Nets
Zn+1 = Zn − anGn(Zn)
The back-propagation algorithm used to fit (feed-forward) neural nets is a steepest descent algorithm where the chain rule is used to calculate the gradient.
When the training set is uncomfortably large1, calculating the gradient can be overly time consuming, which suggests taking successive samples from the training set. This gives stochastic estimates of the gradient, so we have stochastic descent.
This approach also allows you to use ideas from sampling theory to reduce the variance of the gradient estimates. For example stratifying the training set and then sampling across strata.
1big data 🙂
Stoch. Grad.
Steepest ascent Kiefer- et al. Stoch. Grad. Sim. Perturb. Common RN Averaging Momentum Neural Nets

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com