J. C. SPALL
An Overview of the Simultaneous Perturbation Method for Efficient Optimization
Multivariate stochastic optimization plays a major role in the analysis and control of many engineering systems. In almost all real-world optimization problems, it is necessary to use a mathematical algorithm that iteratively seeks out the solution because an analytical (closed-form) solution is rarely available. In this spirit, the “simultaneous perturbation stochastic approximation (SPSA)” method for difficult multivariate optimization problems has been developed. SPSA has recently attracted considerable international attention in areas such as statistical parameter estimation, feedback control, simulation-based optimization, signal and image processing, and experimental design. The essential feature of SPSA—which accounts for its power and relative ease of implementation—is the underlying gradient approximation that requires only two measurements of the objective function regardless of the dimension of the optimization problem. This feature allows for a significant decrease in the cost of optimization, especially in problems with a large number of variables to be optimized. (Keywords: Gradient approximation, Multivariate optimization, Simulation-based optimization, Simultaneous perturbation stochastic approximation, SPSA, Stochastic optimization.)
INTRODUCTION
Copyright By PowCoder代写 加微信 powcoder
This article is an introduction to the simultaneous perturbation stochastic approximation (SPSA) algo- rithm for stochastic optimization of multivariate systems. Optimization algorithms play a critical role in the design, analysis, and control of most engineering systems and are in widespread use in the work of APL and other organizations:
The future, in fact, will be full of [optimization] algorithms. They are becoming part of almost everything. They are moving up the complexity chain to make entire companies more efficient. They also are moving down the chain as computers spread. (USA Today, 31 Dec 1997)
Before presenting the SPSA algorithm, we provide some general background on the stochastic optimiza- tion context of interest here.
482 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION
The mathematical representation of most optimiza- tion problems is the minimization (or maximization) of some scalar-valued objective function with respect to a vector of adjustable parameters. The optimization algorithm is a step-by-step procedure for changing the adjustable parameters from some initial guess (or set of guesses) to a value that offers an improvement in the objective function. Figure 1 depicts this process for a very simple case of only two variables, 1 and 2, where our objective function is a loss function to be minimized (without loss of generality, we will discuss optimization in the context of minimization because a maximization problem can be trivially converted to a minimization problem by changing the sign of the objective func- tion). Most real-world problems would have many more variables. The illustration in Fig. 1 is typical of a sto- chastic optimization setting with noisy input informa- tion because the loss function value does not uniformly decrease as the iteration process proceeds (note the temporary increase in the loss value in the third step of the algorithm). Many optimization algorithms have been developed that assume a deterministic setting and that assume information is available on the gradient vector associated with the loss function (i.e., the gra- dient of the loss function with respect to the parameters being optimized). However, there has been a growing interest in recursive optimization algorithms that do not depend on direct gradient information or measure- ments. Rather, these algorithms are based on an ap- proximation to the gradient formed from (generally noisy) measurements of the loss function. This interest has been motivated, for example, by problems in the adaptive control and statistical identification of com- plex systems, the optimization of processes by large Monte Carlo simulations, the training of recurrent neural networks, the recovery of images from noisy
sensor data, and the design of complex queuing and discrete-event systems. This article focuses on the case where such an approximation is going to be used as a result of direct gradient information not being readily available.
Overall, gradient-free stochastic algorithms exhibit convergence properties similar to the gradient-based stochastic algorithms [e.g., Robbins-Monro1 stochastic approximation (R-M SA)] while requiring only loss function measurements. A main advantage of such algorithms is that they do not require the detailed knowledge of the functional relationship between the parameters being adjusted (optimized) and the loss function being minimized that is required in gradient- based algorithms. Such a relationship can be notorious- ly difficult to develop in some areas (e.g., nonlinear feedback controller design), whereas in other areas (such as Monte Carlo optimization or recursive statis- tical parameter estimation), there may be large compu- tational savings in calculating a loss function relative to that required in calculating a gradient.
Let us elaborate on the distinction between algo- rithms based on direct gradient measurements and those based on gradient approximations from measurements of the loss function. The prototype gradient-based algorithm is R-M SA, which may be considered a gen- eralization of such techniques as deterministic steepest descent and Newton–Raphson, neural network back- propagation, and infinitesimal perturbation analysis– based optimization for discrete-event systems. The gra- dient-based algorithms rely on direct measurements of the gradient of the loss function with respect to the parameters being optimized. These measurements typ- ically yield an estimate of the gradient because the underlying data generally include added noise. Because it is not usually the case that one would obtain direct measurements of the gradient (with or without added noise) naturally in the course of operating or simulating a system, one must have detailed knowledge of the underlying system input–output relationships to calcu- late the R-M gradient estimate from basic system output measurements. In contrast, the approaches based on gradient approximations require only conversion of the basic output measurements to sample values of the loss function, which does not require full knowledge of the system input–output relationships. The classical method for gradient-free stochastic optimization is the Kiefer– Wolfowitz finite-difference SA (FDSA) algorithm.2
Because of the fundamentally different information needed in implementing these gradient-based (R-M) and gradient-free algorithms, it is difficult to construct meaningful methods of comparison. As a general rule, however, the gradient-based algorithms will be faster to converge than those using loss function–based gradient approximations when speed is measured in number of iterations. Intuitively, this result is not surprising given
Height of vertical line after each optimization step denotes loss function value
Initial guess
L ( 1 , 2 )
Figure 1. Example of stochastic optimization algorithm minimiz- ing loss function L(1, 2).
JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 483
J. C. SPALL
the additional information required for the gradient-
based algorithms. In particular, on the basis of asymptotic
theory, the optimal rate of convergence measured in
terms of the deviation of the parameter estimate from
the true optimal parameter vector is of order k–1/2 for
the gradient-based algorithms and of order k–1/3 for the
algorithms based on gradient approximations, where k
represents the number of iterations. (Special cases exist
where the maximum rate of convergence for a non-
gradient algorithm is arbitrarily close to, or equal to, k–1/2.)
In practice, of course, many other factors must be considered in determining which algorithm is best for a given circumstance for the following three reasons: (1) It may not be possible to obtain reliable knowledge of the system input–output relationships, implying that the gradient-based algorithms may be either infeasible (if no system model is available) or undependable (if a poor system model is used). (2) The total cost to achieve effective convergence depends not only on the number of iterations required, but also on the cost needed per iteration, which is typically greater in gradient-based algorithms. (This cost may include greater computational burden, additional human effort required for determining and coding gradients, and experimental costs for model building such as labor, materials, and fuel.) (3) The rates of convergence are based on asymptotic theory and may not be represen- tative of practical convergence rates in finite samples. For these reasons, one cannot say in general that a gradient-based search algorithm is superior to a gradient approximation-based algorithm, even though the gradient-based algorithm has a faster asymptotic rate of convergence (and with simulation-based optimization such as infinitesimal perturbation analysis requires only one system run per iteration, whereas the approximation- based algorithm may require multiple system runs per iteration). As a general rule, however, if direct gradient information is conveniently and reliably available, it is generally to one’s advantage to use this information in the optimization process. The focus in this article is the case where such information is not readily available.
FDSA AND SPSA ALGORITHMS
This article considers the problem of minimizing a (scalar) differentiable loss function L(), where is a p-dimensional vector and where the optimization problem can be translated into finding the minimizing * such that ∂L/∂ = 0. This is the classical formulation of (local) optimization for differentiable loss functions. It is assumed that measurements of L() are available at various values of . These measurements may or may not include added noise. No direct measurements of ∂L/∂ are assumed available, in contrast to the R-M framework. This section will describe the FDSA and SPSA algorithms. Although the emphasis of this article is SPSA, the FDSA discussion is included for compar- ison because FDSA is a classical method for stochastic optimization.
The SPSA and FDSA procedures are in the general recursive SA form:
The next section describes SPSA and the related
FDSA algorithm. Then some of the theory associated
with the convergence and efficiency of SPSA is
summarized. The following section is an illustration of
the implications of the theory in an example related to
neural networks. Then practical guidelines for
implementation are presented, followed by a summary
of some ancillary results and some extensions of the
algorithm. Not covered here are global optimization
methods such as genetic algorithms and simulated
annealing; Spall presents some discussion of such gˆk(k) (i = 1, 2,…, p) for a two-sided finite-difference methods in the context of stochastic approximation. approximation is given by
k + 1 = k – a k gˆ k ( k ) ,
where gˆk(k) is the estimate of the gradient g()
∂L/∂ at the iterate ˆk based on the previously men- tioned measurements of the loss function. Under appro- priate conditions, the iteration in Eq. 1 will converge to * in some stochastic sense (usually “almost surely”) (see, e.g., Fabian4 or Kushner and Yin5).
The essential part of Eq. 1 is the gradient approx-
imation gˆk(k). We discuss the two forms of interest
here. Let y(·) denote a measurement of L(·) at a design
level represented by the dot (i.e., y(·) =L(·) + noise)
and ck be some (usually small) positive number. One-
sided gradient approximations involve measurements
y(k) and y(k+perturbation), whereas two-sided
gradient approximations involve measurements of the form y(ˆk ± perturbation). The two general forms of gradient approximations for use in FDSA and SPSA are finite difference and simultaneous perturbation, respec- tively, which are discussed in the following paragraphs.
For the finite-difference approximation, each com-
ponent of ˆk is perturbed one at a time, and corre-
sponding measurements y(·) are obtained. Each com-
ponent of the gradient estimate is formed by
differencing the corresponding y(·) values and then
dividing by a difference interval. This is the standard
approach to approximating gradient vectors and is
motivated directly from the definition of a gradient as
a vector of p partial derivatives, each constructed as the
limit of the ratio of a change in the function value over
a corresponding change in one component of the
argument vector. Typically, the ith component of 3ˆ
484 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998)
ˆ y(k +ckei)–y(k –ckei)
at APL, where SPSA is providing a solution to a prob- lem that appeared intractable using other available methods. In addition, to illustrate some of the other possible applications, we close with a brief summary of some additional projects based on SPSA, most of which have been developed at other institutions.
Signal Timing for Vehicle Traffic Control
A long-standing problem in traffic engineering is to optimize the flow of vehicles through a given road network (Fig. 2). Improving the timing of the traffic signals at intersections in a network is generally the most powerful and cost-effective means of achieving this goal. However, because of the many complex as- pects of a traffic system—human behavioral consider- ations, vehicle flow interactions within the network, weather effects, traffic accidents, long-term (e.g., sea- sonal) variation, etc.—it has been notoriously difficult to determine the optimal signal timing, especially on a system-wide (multiple intersection) basis. Much of this difficulty has stemmed from the need to build extremely complex models of the traffic dynamics as a component of the control strategy. A “strategy” in this context is a set of rules providing real-time signal tim- ing in response to minute-by-minute changes in the traffic conditions. The APL approach is fundamentally different from those in existence in that it eliminates the need for such complex models. SPSA is central to the approach by providing a means for making small simultaneous changes to all the signal timings in a network and using the information gathered in this way to update the system-wide timing strategy. By avoiding conventional “one-signal-at-a-time” changes to the signal timing strategies, the time it would take to pro- duce an overall optimal strategy for the system is re- duced from years or decades (obviously impractical!) to several months (quite reasonable). Note that, unlike the two examples that follow, the savings here is not computational per se, but is inherent in the need for data on a daily basis (and hence represents a reduction in physical experimental costs such as labor and time). This approach is described in detail in Spall and Chin6 and Chin et al.,7 including realistic simulations of a 9- intersection network within the central business dis- trict of Manhattan, , and a 12-intersection network in , Maryland.
Optimal Targeting of Weapon Systems
This is an example of the use of simulations to optimize processes, something done in a wide range of DoD and non-DoD applications. More specifically, given a number of projectiles that are going to be di- rected at a target, the problem is to optimally select a set of aim points with the goal of maximizing damage to the target while minimizing so-called collateral
gˆki(k)= 2c , (2) k
2c , k ki
SIMULTANEOUS PERTURBATION METHOD FOR OPTIMIZATION
where ei denotes a vector with a one in the ith place and zeros elsewhere (an obvious analogue holds for the one-sided version; likewise for the simultaneous pertur- bation form below), and ck denotes a small positive number that usually gets smaller as k gets larger.
The simultaneous perturbation approximation has
all elements of ˆk randomly perturbed together to ob-
tain two measurements of y(·), but each component ˆ
gˆki(k) is formed from a ratio involving the individual components in the perturbation vector and the differ- ence in the two corresponding measurements. For two- sided simultaneous perturbation, we have
ˆ gˆki(k)=
y(k +ckk)–y(k –ckk)
where the distribution of the user-specified p-dimen- sional random perturbation vector, k = (k1, k2,…, kp)T, satisfies conditions discussed later in this article (superscript T denotes vector transpose).
Note that the number of loss function measurements y(·) needed in each iteration of FDSA grows with p, whereas with SPSA, only two measurements are need- ed independent of p because the numerator is the same in all p components. This circumstance, of course, pro- vides the potential for SPSA to achieve a large savings (over FDSA) in the total number of measurements required to estimate when p is large. This potential is realized only if the number of iterations required for effective convergence to * does not increase in a way to cancel the measurement savings per gradient approx- imation at each iteration. A later section of this article will discuss this efficiency issue further, demonstrating when this potential can be realized by establishing that:
SELECTED APPLICATIONS OF SPSA
The efficiency issue mentioned in the preceding section (and treated in more detail in the next section) has profound implications for practical multivariate optimization. Many problems that formerly may have been considered intractable with conventional (say, FDSA) methods, may now be solvable. In this section, we summarize three distinct examples, based on work
Under reasonably general conditions, SPSA and FDSA achieve the same level of statistical accuracy for a given number of iterations, even though SPSA uses p times fewer function evaluations than FDSA (because each gradient approximation uses only 1/p the number of function evaluations).
JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 19, NUMBER 4 (1998) 485
J. C. SPALL
Figure 2. Overall system-wide traffic control concept. Traffic control center provides timing information to signals in traffic network; information on traffic flow is fed back to traffic control center.
damage (damage to sensitive locations not directly associated with the military mission, e.g., schools and hospitals). The projectiles have inherent random vari- ation and may be statistically dependent. So the tar- geter must allow for the statistical variation between the aim point and actual impact point in developing a strategy for determining aim points. In such cases it is desirable to use patterning of multiple projectiles. “Patterning” in this context means aiming the projec- tiles at a set of points that may not overlap each other or be within the target boundaries. Figure 3 illustrates the concept for one target and five projectiles; a bias away from the “stay-out zone” (e.g., a civilian facility) is apparent in this case where it is desired to destroy the target while minimizing the chances of producing dam- age to the stay-out zone. For scenarios with many pro- jectiles that are independently targeted, the damage function—which must be evaluated to find the best aim point pattern—is likely to be analytically unavailable and will require estimation by Monte Carlo simulation. In particular, to evaluate the effectiveness of a given set of aim points, it is necessary to run one or more
simulations (recall that there is random variation in the outcome of one “volley” of projectiles corresponding to one simulation). Commonly used techniques for solv- ing the optimization problem by simulation are com- putationally intensive and prohibitivel
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com