Methodology and Computing in Applied Probability, 2:3, 271±291, 2000 # 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.
Nested Partitions Method for Stochastic Optimization
LEYUAN SHI leyuan@ie.engr.wisc.edu
Department of Industrial Engineering University of Wisconsin±Madison, 1513 University Avenue Madison, WI 53706
SIGURDUR OÂ LAFSSON olafsson@iastate.edu
Department of Industrial and Manufacturing Systems Engineering Iowa State University, 2019 Black Engineering Ames, IA 50011
Received September 28, 1998, Revised June 26, 2000; Accepted July 18, 2000
Abstract. The nested partitions (NP) method is a recently proposed new alternative for global optimization. Primarily aimed at problems with large but ®nite feasible regions, the method employs a global sampling strategy that is continuously adapted via a partitioning of the feasible region. In this paper we adapt the original NP method to stochastic optimization where the performance is estimated using simulation. We prove asymptotic convergence of the new method and present a numerical example to illustrate its potential.
Keywords: simulation-based optimization, combinatorial optimization, Markov chain Monte Carlo. AMS 1991 Subject Classification: 78M50, 78M35
1. Introduction
Optimizing the con®guration of complex discrete event systems such as computer systems, communication networks, and manufacturing systems is often a daunting task. Many factors contribute to this dif®culty, including the combinatorial explosion of alternatives and lack of structure available to guide the search. In the deterministic context, this often leads to NP-hard optimization problems. In the case of stochastic optimization, the dif®culty is exacerbated by the added randomness, and simulation is often the only tool available for optimizing such stochastic systems.
Simulation optimization methods can be traced to two seminal papers by Robbins and Monro (1951) and Kiefer and Wolfowitz (1952). These papers gave rise to the family of techniques called stochastic approximation (SA) methods, and the interested reader is referred to Benveniste, MeÂtiver, and Priouret (1990) for a discussion of such methods. In recent years, simulation-based optimization has been greatly advanced through the use of two classes of methods that extract signi®cantly increased amount of information from a single simulation run. These techniques, perturbation analysis (PA) methods (Ho and Cao, 1991; Glasserman, 1991; Fu and Hu, 1997), and score function (SF) or likelihood ratio (LR) methods (Glynn, 1990; Rubinstein and Shapiro, 1993), make it possible to obtain not
272 SHI AND OÂ LAFSSON
only a system performance estimate, but also all sensitivities from a single sample path. The PA methods may be used along with a SA type procedure for simulation-based stochastic optimization (L’Ecuyer and Glynn, 1994), and the SF technique can be applied for stochastic optimization, without having to resort to an iterative procedure and multiple simulations, by replacing the optimization problem with its stochastic counterpart, which can then be solved using the powerful arsenal of mathematical programming (Rubinstein, 1991; Rubinstein and Shapiro, 1990). Similar ideas are employed in the sample-path optimization method, that also uses mathematical programming methods to solve stochastic optimization problems with only a single sample path evaluated (Robinson, 1996). All of these methods are primarily intended for continuous stochastic optimization.
In this paper, we consider discrete stochastic optimization problems, in particular problems with a ®nite feasible region. Applications that involve discrete parameters and an uncertain performance measure are abundant in numerous areas, but despite its practical interest, the general discrete stochastic optimization problem has received relatively little attention to date. Recent methods proposed for this problem include simulating annealing (Gelfand and Mitter, 1989), the stochastic ruler method (Yan and Mukai, 1992), the stochastic comparison method (Gong, Ho and Zhai, 1992), the cross-entropy method (Rubinstein, 1998), and the method of AndradoÂttir (1995, 1996). In each of the above methods a Markov chain is constructed and convergence almost surely or in probability is proved by analyzing the long-run behavior of the Markov chain. Such optimization methods can be traced back to the Metropolis method (Metropolis, Rosenbluth, Rosenbluth, Teller and Teller, 1953), which was later generalized by Hastings (1970). The well known simulated annealing algorithm, which imitates the physical process of annealing, is based on the same concept (Kirkpatrick, Gelatt, and Vecchi, 1983). The stochastic ruler method resembles simulated annealing, but instead of solving the original problem directly, it solves an alternative problem that agrees with the original under certain conditions. If the two problems agree, the method converges in probability to a global optimum. The stochastic comparison method is also based on ideas similar to the stochastic ruler and simulated annealing algorithms. The method of AndradoÂttir (1995) can be considered a discrete stochastic approximation method. The method compares two neighboring points in each iteration and moves to the point that is found to be better. Other related literature, that also deals with ®nite stochastic optimization, includes ranking-and- selection and multiple-comparison methods (Bechhofer, Santner, and Goldsman, 1995), that are typically used for comparing a small number of alternatives, multi-armed bandit methods (Barry and Fristedt, 1985), learning automata (Narendra and Thathachar, 1989), and the low-dispersion point set method of Yakowitz, L’Ecuyer, and VaÂzquez-Abad, (1999). Also of relevance to the analysis here is the recent work of Norkin, P ̄ug and RuszczynÂski (1998), where the authors extend the classical deterministic branch-and- bound method to stochastic problems. Finally, related to ranking-and-selection procedures, but aimed at large-scale optimization is ordinal optimization (Ho, Sreenivas and Vakili, 1992; Dai, 1996). This methodology differs from traditional cardinal optimization methods in that its aim is only to ®nd feasible points with performance that places it in a given percentile of all feasible points.
In a recent paper, Shi and OÂ lafsson (2000) introduce a new optimization method, the nested partitions (NP) method, for global optimization when the objective function can be
NESTED PARTITIONS METHOD 273
evaluated deterministically. This method has been found to be quite ef®cient for various combinatorial optimization problems (Shi, OÂ lafsson, and Sun, 1999; OÂ lafsson and Shi, 2000). In this paper we present a stochastic NP method that can be applied when no analytical expression exists for the objective function, and it is estimated through simulation or real-time observation. The only modi®cation needed from the original algorithm is in the way the optimum is estimated, and it is hence readily adapted from the deterministic algorithm. The basic idea behind the NP method is to adaptively sample from the feasible region. To adapt the sampling the feasible region is systematically partitioned and the sampling concentrated in the subset that is considered the most promising. A somewhat similar partitioning and random sampling scheme is proposed in Tang (1994), but that approach is directed towards continuous problems in the deterministic context, and does not move between ®xed regions as the NP method. We show that the NP algorithm generates an ergodic Markov chain and use this to develop conditions under which the NP method converges to an optimal solution with probability one.
The remainder of this paper is organized as follows. In Section 2 we give a complete description of the NP method. In Section 3 we prove the asymptotic convergence of the method. In Section 4 we present a numerical example to illustrate the method, and ®nally, Section 5 contains some concluding remarks.
2. The Nested Partitions Method
In mathematical notation, we want to solve the problem
min JyY 1
y[
where is a ®nite feasible region, and J X ?R is a performance function that is subject to noise. In other words, for any feasible point y [ Y Jy cannot be evaluated analytically. Often Jy is an expectation of some random estimate of the performance of a complex stochastic system given a parameter y,
Jy ELyX 2
Here Ly is a random variable which depends on the parameter y [ . We assume that Ly is an unbiased discrete event simulation estimate of the true performance, and refer to it as the sample performance.
2.1. The NP Methodology
We now describe a new approach applicable to simulation-based optimization: the nested partitions (NP) method. The basic idea of the NP method is very simple. In the k-th iteration there is a region sk( that is considered the most promising. In the ®rst iteration nothing is assumed to be known about where good solutions are located, so the entire feasible region s0 is taken as the most promising region. The most promising
274 SHI AND OÂ LAFSSON
region is then partitioned into Msk subregions, where Msk may depend on the subset sk but not the iteration. What remains of the feasible region, nsk, is aggregated into one region called the surrounding region. Therefore, at the k-th iteration Msk 1 disjoint subsets that cover the feasible region are considered. Each of these regions is sampled using some random sampling scheme, and the samples used to estimate the promising index for each region. This index is a set performance function that determines which region becomes the most promising region in the next iteration. If one of the subregions is found to be best this region becomes the most promising region. If the surrounding region is found to be best the method backtracks to a larger region. To choose this larger region a ®xed backtracking rule is used. The new most promising region is partitioned and sampled in a similar fashion. This generates a sequence of set partitions, with each partition nested within the last. We now assume that such a partitioning scheme has been ®xed. The following de®nitions will be used throughout the analysis.
DEFINITION 1 A region constructed using a ®xed partitioning scheme is called a valid region given the ®xed partition. The collection of all valid regions is denoted by . Singleton regions are of special interest, and 0 denotes the collection of all such valid regions.
DEFINITION 2 We de®ne the depth, d : ?N0, of any valid region iteratively with having depth zero, subregions of having depth one, and so forth. Since they cannot be partitioned further, we call the singleton regions in 0 regions of maximum depth.
Ultimately only the maximum depth regions are of interest, that is, we want to ®nd a region that contains only one point. Therefore, using the estimate proposed by AndradoÂttir (1995,1996), we let the estimated best region in the k-th iteration be the maximum depth region, soptk [ 0, that has been most frequently considered the most promising region. Consequently, the method must keep track of the number of times, nk s, a region s [ 0 has been visited by the k-th iteration. Note that it suf®ces to keep track of this for regions that have been visited at least once.
DEFINITION 3 We let the estimate of the best solution be
soptk [ arg max nksY 3
s[0
the most frequently visited singleton region by the k-th iteration, that is, the region that has
most often been the most promising region.
We note that in general soptk 6 sk, which is the key modi®cation of the deterministic algorithm. As is shown in Shi and OÂlafsson (1999), if the performance measure is deterministic, then sk converges with probability one to a global optimum and hence we can let soptk sk in the deterministic context. As we will see in the next section, however, sk does not converge at all for stochastic problems.
DEFINITION 4 Ifavalidregions[isformedbypartitioningavalidregionZ[,thens is called a subregion of region Z, and region Z is called a superregion of region s. We de®ne the superregion function s : ? as follows. Let s [ . De®ne ss Z[, if
NESTED PARTITIONS METHOD 275
and only if s; Z and if s(x(Z where x [ , then x Z or x s. For completeness we de®ne s .
The NP method shifts the focus from speci®c points in the feasible region to a space of subsets; namely the space of all valid regions. Consequently, a set performance function I : ?R is needed. This set function can then be used to select the most promising region and is therefore called the promising index of the region. The only condition imposed on such a promising index function is that if a valid region is a singleton then the promising index agrees with the original performance function as is stated in the ®nal de®nition of this section.
DEFINITION 5 A a set function I : ?R, is a valid promising index if and only if for all s fyg[0YIs Jy.
Finally note that since the performance of each point must be estimated, the promising index must also be estimated.
2.2. The NP Algorithm
We can now describe a generic implementation of the NP algorithm. Note that special consideration has to be given in the partitioning step to the two special cases of the most promising region being sk or sk [ 0 for any k 0.
NP Algorithm
Step 0. Initializing. Let k 0Y n0s 0Vs [ Y soptk , and s0 .
Step 1. Partitioning. First, unless the current most promising region is a singleton, partition it into disjoint subsets. In other words, if sk6[0, partition sk into Msk subregions s1kY F F F Y sMsk k. Here we allow Msk to depend on the region sk, but not the iteration. Otherwise, if sk [ 0, then Msk 1 and s1k sk.
Secondly, if sk 6 , aggregate the surrounding region nsk into one region sMsk1k and let M Msk 1. Otherwise, if sk let M Msk.
Step 2. Random Sampling. Use a random sampling procedure to select Nsj k independent points y j1Y y j2Y F F F Y y jNsjk from each of the regions sjk, and calculate the corresponding sample performance values,
Ly j1Y Ly j2Y F F F Y Ly jNsjk Y j 1Y 2Y F F F Y M 1X
The only requirement as on the random sampling procedure are that it be independent of previous iteration and that each point in the region has a positive probability of being selected. Similarly as above the number of sample points may depend on the region.
Step 3. Estimating the Promising Index. Given a promising index function, I: ?R, for
276 SHI AND OÂ LAFSSON
each region sjkY j 1Y 2Y F F F Y M, calculate the estimated promising index Isjk of the
region. In this paper we assume that the promising index has been selected to be IZ min JyY
y[Z
and is estimated using
ji
Isjk min Ly Yj 1Y2YFFFYMX 4
i[f1Y2YFFFYNsjk g
It is clear from equation (4) that accurately estimating the promising index, and hence the performance function, is not critical. Only the ordinal values affect how the NP algorithm proceeds. If subregion sj k contains the true global optimum, then it is suf®cient that
Isj kSIsjkY Vj 6 j X
If this equation holds then the correct subregion is identi®ed.
Step 4. Selecting the Most Promising Region. Select the index of the region with the best promising index.
If more than one region is equally promising, the tie should be broken such that each of these regions has equal probability of being selected. If the index (5) corresponds to a region that is a subregion of sk, then let this be the most promising region in the next iteration. Otherwise, if the index corresponds to the surrounding region, then backtrack to a larger region containing the current most promising region.
In this paper we consider two options for backtracking:
(1) Backtracking to the superregion of the current most promising region. That is, let
jk [ arg min IsjkX 5 j 1YFFFYM
(s kY if jSM, k
sk1 j k 6 sskY otherwise.
(2) Backtracking to the entire feasible region. That is, let
(
skY ifjSM, k
sk1 j k 7 Y otherwise.
We refer to the ®rst option as the NP I algorithm, and the latter as the NP II algorithm. Further discussion of these two variations may be found in Shi and OÂ lafsson (2000).
Step 5. Updating Counters. If s sk 1 [ 0 , then let nk 1 s nk s 1, and otherwiseletnk1snksforalls[0Ys6sk1.Ifthereexistss[0 such that nk 1sRnk1soptk, then let soptk 1 s, and otherwise let soptk 1 soptk. Let k k 1. Go back to Step 1.
The implementation of this algorithm differs from its previously proposed deterministic
NESTED PARTITIONS METHOD 277
version only in that we use estimation in Step 3, and the addition of the last step. The latter is necessary because the estimate of the best region is not sk but the most frequently visited singleton, and hence the counters fnk sgs [ 0 must be maintained and the most frequently visited singleton updated.
It is worthwhile to point out some computational issues related to the implementation of the NP algorithm. First for the NP I algorithm, notice that in order to backtrack in Step 4 the superregion of the current most promising region must be available, but this imposes minimal memory overhead because we only need to keep track of those regions that lead from the current most promising region back to the entire feasible region. This is generally a very small fraction of the number of valid regions. Secondly for both algorithms, from Step5weseethatwemustkeeptrackofthecountersnksforvalidregionss[0.This could impose a large memory burden on the algorithm. However, in practice we are not likely to visit all regions of maximum depth. Therefore, we can use a dynamic data structure to keep track of the counters nk s for only the maximum depth regions s [ 0 that are the most promising region at least once.
We ®nally note that for most problems there are many valid partitions, and the partitioning implicitly imposes a structure on the feasible region. If this structure is such that good solutions tend to be clustered together, the NP method is likely to rapidly concentrate the search in these regions and converge quickly. The partitioning is thus a means of concentrating the sampling effort, and similarly the backtracking is a means of making the sampling less less concentrated. Consequently, the partitioning, random sampling, and backtracking constitute the global search phase of the NP method. The estimation of a promising index, on the other hand, is of a different nature. In its simplest form, the estimated promising index is a summary statistic for the sampling information. For example, it can be an average or a minimum for a given region. However, it is possible to incorporate local search methods or heuristics into the computation of the promising index. That is, when estimating the promising index of any region, a local search may be started from each sample point and constrained to be within the region. To keep the discussion simple we only consider the best performance in the region as a promising index in this paper. For examples of other promising indices where local search techniques are successfully used in the deterministic context, we refer the reader to Shi, OÂ lafsson, and Sun (1999) and OÂ lafsson and Shi (2000). Next we address the asymptotic convergence of the method.
3. Convergence
The NP method uses the set performance function I X ?R to guide its movements, and this leads us to shift our focus to ®nding an element sopt [ s, where
s arg min IsX 8 s[0
Given certain assumptions, we show that the NP method identi®es an element in s, that is, an optimal solution for the problem (8) above. This, however, is equivalent to solving the
278 SHI AND OÂ LAFSSON
original problem (1) since the original performance function J X ?R and the set performance function I X ?R agree on singletons.
3.1. Basic Properties
In this section we show that the NP method generates an ergodic Markov chain. Furthermore, we show that the estimate of the best singleton region fsoptkgck1 converges with probability one to a maximizer of the stationary distribution of the Markov chain. The stationary distribution can hence be used as a basis for inference. Since the same method is used for estimating the global optimum, some of the proofs in this section are similar to those of Andrado ttir (1995, 1996). For any s, Z [ , we use PsY Z to denote the transition probability of the Markov chain moving from state s to state Z, and Pn sY Z to denote the n-step transition probability. We also let pA denote the probability of any event A.
Since in the k-th iteration, the next most promising region sk 1 only depends on the current most promising region sk, and the sampling information obtained in the k-th iteration, the sequence of most promising regions fskgck1 is a Markov chain with state space . We refer to this Markov chain as the NP Markov chain when referring to either the NP I or NP II algorithm, and the NP I Markov chain or the NP II Markov chain when referring to the NP I or NP II algorithm speci®cally. To give an expression for the transition probabilities of the NP Markov chain we require the following de®nition.
DEFINITION 6 Let bs denote the region that the algorithm backtracks to from state s [ nfg, that is, b X ? where
ssY for NP I,
bs Y for NP II. 9
Also, let
hs fZ[ X sZ sg 10
denote the set of subregions sampled if s [ is the most promising region.
In each iteration the NP method selects the next most promising region based on which region has the best estimated promising index. Therefore, the transition probabilities may be written in terms of the estimated promising index. We need to consider three cases:
(1) Ifs,thentheNPalgorithmmovestooneofthesubregionsinthenextiteration and the set of subregions that have the best estimated performance is given by
jhsj
PsYx X px[B1 X jB1j icpjB1j i 12
i1 i
B1 fx[hsjIxIZYVZ[hsgX 11 The transition probabilities are given by
NESTED PARTITIONS METHOD 279
for x [ hs and PsY x 0 for x6 [hs. Note that this equation uses the fact that ties are broken in a uniform manner.
(2) if s[nfg[0, then the NP algorithm either moves to a subregion or backtracks in the next iteration and the set of regions that have the best estimated performance is given by
jhsj1
PsYx X px[B2 XjB2jipjB2ji 14
i1 i
for x [ hs [ fnsg and PsY x 0 for x6 [hs [ fnsg.
(3) Ifs[0,thentheNPalgorithmeitherstaysinthecurrentmostpromisingregionor backtracks in the next iteration, and the transition probabilities are given by
B2 fx[hs[fnsgjIxIZYVZ[hs[fnsggX 13 The transition probabilities are thus given by
8>
> pInsSIs pIns Is Y <2
Y
x bs x s
otherwise.
PsY x
15
pIns Is 2
>: pInsRIs 0Y
Equations (11)±(15) completely de®ne the transition probabilities for both NP algorithms. We will show that the NP Markov chain has a unique stationary distribution, but ®rst we exclude certain trivial cases from the analysis. In particular we assume that there is no region s [ nfg such that despite noisy estimates, every point in s has, with probability one, better estimated performance than all the points outside of s, that is in s. Formally
we state the assumption as follows.
ASSUMPTION 1 Foralls[nfgYPsYbsR0.
We note that this assumption implies that for any Z [ Y Pk ZY R0, where k dZ or k 1 for the NP I or NP II algorithm, respectively. Also, since the random sampling scheme assumes that each point in the sample region is sampled with positive probability, Assumption 1 is equivalent to assuming that for all valid regions s [ , there exist at least one pair of points y1 [ s and y2 [ ns such that pLy1RLy2R0. Using this observation it can be seen that this assumption imposes no loss of generality. Indeed, if it does not hold, then we can let s0 be the smallest region for which it is not satis®ed. Then it is straightforward to verify that s0 is closed, all the other states are transient, and the Markov chain eventually enters s0 and never leaves. Thus, s0 contains all the interesting solutions, since, with probability one, all the points in s0 have better performance than all points outside s0. If s0 is a singleton it is an absorbing state which the NP method converges to in ®nite time, and no further analysis is necessary. If s0 is not a singleton, the analysis below can be applied by replacing with s0 as the feasible region. Hence, Assumption 1 imposes no loss of generality.
280 SHI AND OÂ LAFSSON
PROPOSITION 1 If Assumption 1 holds, then the NP Markov chain has a unique stationary distribution fpsgs [ .
Proof: Let P denote the set of positive recurrent states of the Markov chain. Since is ®nite, P 6 a0, which implies that the NP Markov chain has at least one stationary distribution. To show that it is also unique, we must show that P is irreducible. We ®rst show that [P. Let s[P. Then by Assumption 1 there exists some k such that Pk sY R0 and since s is positive recurrent, is also positive recurrent. Now to prove irreducibility, let sY Z [ P . Again by Assumption 1 there exists some k1 and k2 such that Pk1 sY R0 and Pk2 ZY R0. Furthermore, since Z [ P is recurrent and Pk2 ZY R0 then there exists some k3 such that Pk3 Y ZR0. Hence Pk1 k3 sY ZR0. This holds for any sY Z [ P, so P is irreducible, and hence the Markov chain has a unique stationary distribution fpsgs [ . j
We note that since a unique stationary distribution exists, the average number of visits to each state converges to this distribution with probability one (w.p.1), that is,
lim nk s psY Vs [ Y w.p.1. 16 k?c k
Furthermore, as PsY sR0 for some s [ 0, the chain is aperiodic and hence the k-step transition probabilities converge pointwise to the stationary distribution, limk?c Pk Y s ps w.p.1, Vs [ . The next step in establishing the convergence of the NP method is to use (16) to show that the estimate of the best solution converges to a maximum stationary probability of the NP Markov chain, and hence that the stationary distribution can be used for inference.
THEOREM 1 Assume that Assumption 1 holds. The estimate of the best region soptk converges to a maximum of the stationary distribution of the NP Markov chain, that is,
lim soptk [ arg max psY wXpX1X 17 k?c s[0
Proof: Let k0 mink1fkjsk [ 0g be the ®rst time the NP method reaches maximum depth. If sopt [ s is a global optimum, then there is a positive probability that sopt is sampled for dsopt consecutive iterations, which together with [ P implies that PdsoptY soptR0, i.e., sopt [ P. Hence there is at least one positive recurrent state in 0 and consequently k0Sc almost surely.
Now combine this with the known fact that equation (16) holds, and let A be such a set that pA 1 and for all o[AYk oSc and lim nksYo ps. Let o[A be a
0 k?c k
®xed sample point. Since 0 is ®nite, there exists a ®nite constant Ko k0o such that
for all k Ko,
arg max nk s(arg max psX
s[0 s[0
Now by de®nition soptkY o [ arg maxs [ 0 nksY o for all k k0, so soptkY o [ arg maxs [ 0 ps for all k Ko . This completes the proof. j
NESTED PARTITIONS METHOD 281
To prove asymptotic global convergence, we need to show that the singleton maximizer of the stationary distribution can be used as an estimate of the global optimum. As will be explained shortly, without some conditions about the problem structure this may not hold.
3.2. Global Convergence
We now show that given certain regularity conditions the observation that the NP method converges to a maximizer of the stationary distribution implies that it converges to a global optimum. In particular, by showing that the maximum depth region with the largest stationary probability is also a global optimum, it follows that the NP method converges to the global optimum.
Without assumptions about the problem structure this does not necessarily hold. For example, assume that sopt is a unique global optimum, but that there exist a set Z [ such that sopt(Z and JyRJf for all y[ZnsoptYf[nZ. Then we expect the transition probability PdZYZ into Z to be small, which implies that PdsoptYsopt into sopt is also likely to be small. This indicates that psopt may be quite small. Therefore, if the performance measure changes drastically from the global optimum to points that are close in the partitioning metric, the NP method is not guaranteed to ®nd the global optimum. Consequently, we must exclude such problem structures. We start with a de®nition.
DEFINITION 7 Let s and Z be any valid regions. Then there exists some sequence of regionss x0Yx1YFFFYxn ZthattheMarkovchaincanmovealongtogetfromstatesto state Z. We call the shortest such sequence, the shortest path from s to Z. We also de®ne ksY Z to be the length of the shortest path.
Note that the shortest path is unique and can be found by moving directly from s to the smallest valid region that contains both s and Z, and then proceeding directly from this region to Z. The following condition on transition probabilities along shortest paths connecting the global optimum to other maximum depth regions, turns out to be a suf®cient condition for global convergence. The interpretation of this assumption is discussed further at the end of the section.
ASSUMPTION 2 The set
S0 fx[0 XPkZYxZYxPkxYZxYZYVZ[0gY 18
satis®es S0(S, that is, it is a subset of the set of global optimizers.
The regularity condition in Equation (18), needed for global convergence of both the NP I and NP II algorithms, guarantees that the transition probability from any part of the feasible region to the global optimum is at least as large as going from the global optimum back to that region. This ensures that the points close to the global optimum are suf®ciently
good. We now use this assumption to prove global convergence of the NP I algorithm. THEOREM 2 Assume that the NP I algorithm is applied and Assumptions 1±2 hold. Then
arg max ps(SY 19 s[0
282 SHI AND OÂ LAFSSON
and consequently the NP I algorithm converges with probability one to a global optimum.
Proof: Let Z1Y Z2 be any regions of maximum depth. Let x0Y x1Y F F F Y xn1Y xn be the shortest path from Z1 to Z2, where n kZ1Y Z2. It is also clear from the Kolmogorov criterion (Wolff, 1989) that the NP I Markov chain are reversible, and we have that,
PxiY xi1pxi Pxi1Y xipxi1Y i 0Y 1Y 2Y F F F Y n 1X We can rewrite this as
pxi Pxi1Y xi pxi1Y i 0Y 1Y 2Y F F F Y n 1X PxiY xi1
By using this equation iteratively and using x0 Z1 and xn Z2 we can rewrite this as pZ1 PxnY xn1 F F F Px2Y x1Px1Y x0 pZ2X 20
Px0Y x1Px1Y x2 F F F Pxn1Y xn
Furthermore, since x0Y x1Y F F F Y xn1Y xn is the shortest path from x0 to xn then we also have
that
PnZ1Y Z2 Pnx0Y xn Px0Y x1Px1Y x2 F F F Pxn1Y xnY
and
PnZ2Y Z1 PnxnY x0 PxnY xn1 F F F Px2Y x1Px1Y x0X
We can therefore rewrite Equation (20) as
pZ1 PnZ2Y Z1 pZ2 PkZ2YZ1Z2Y Z1 pZ2X 21
PnZ1Y Z2 PkZ1YZ2Z1Y Z2
Now if Z2 [ arg maxs [ 0 ps then by de®nition pZ1 pZ2 so by Equation (21) we
have
PkZ2YZ1Z2Y Z1 1Y
PkZ1YZ2Z1Y Z2
that is, PkZ1YZ2Z1YZ2 PkZ2YZ1Z2YZ1. Since Z1 can be chosen arbitrarily, Assumption
2 now implies that Z2 [ S0(S, which proves the theorem. j Similarly, the following theorem establishes global convergence for the NP II algorithm.
THEOREM 3 Assume that the NP II algorithm is applied and Assumptions 1±2 hold. Then arg max ps(SY 22
s[0
and consequently the NP II algorithm converges with probability one to a global optimum.
Proof: We start by looking at a state Z [ n0 [ . Since the chain leaves this state with probability one, and the state can only be entered by transitions from state sZ, the balance equations are given by,
NESTED PARTITIONS METHOD 283
pZ PsZY ZpsZX
These equations can be solved iteratively to obtain
pZ PdZY ZpX 23 Now assume s [ 0 . The balance equations are given by
psPsY pssPssY sX Rewrite and use Equation (23) above to get
ps PssY s pss PsY
PssY s PdssY ssp PsY
PdsY s pX PsY
Now take another Z [ 0, then ps PdsY s p
PsY
PdsY s PZY
PsY PdZY Z Pds1ZY s pZ
PdZ1sY Z
PkZYs ZY s pZX
24
pZ
PksYZ sY Z
The remainder of the proof follows the last paragraph of Theorem 2 above and is
omitted. j
We note that a suf®cient condition for Equation (18) to be satis®ed is that in each iteration, the probability of making the “correct” decision is at least one half, and that the condition (18) depends on the partitioning, sampling, and method of estimation. This condition needed for global convergence is in general dif®cult to verify, but to gain some insights the remainder of the section is devoted to obtaining conditions on the partitioning, sampling, and estimation elements that are suf®cient for convergence. These conditions are considerably stronger than (18) but are somewhat more intuitive. For simplicity we assume that there is a unique global optimum sopt [ S. We start with a de®nition.
DEFINITION 8 Let g fs [ X sopt (sg denote all the regions containing the unique global optimum sopt and b ng denote the remaining regions. De®ne a function Y X gTb?R by
284 SHI AND OÂ LAFSSON
YsYsJ y1 J y1 Y 25 gb sb sg
where y1 denotes the random sample point from s[ that has the estimated best s
performance and is generated in Step 2 of the NP algorithm. Furthermore, let
YsYsLy1 Ly1Y 26 gb sb sg
denote the corresponding simulation estimate.
We note that for each s [ Y y1 [ s is a random variable that is de®ned by the sampling s
strategy, and that
1
L ys IsX
27
Also note that for any sg [ gY sb [ b, the random variable YsgY sb is de®ned by the partitioning and sampling strategies, whereas if we condition the random variable YsgYsb on the value of YsgYsb, its outcome depends only on the estimated sample performance, that is, the randomness inherent in the simulation estimate. We now obtain the following theorem for the NP I algorithm:
THEOREM 4 Assume that sopt is unique and
pYsgY sbR0jYsgY sb yR y 28
2EYsgY sb
for all miny[sb Jymaxy[sg Jyymaxy[sb Jyminy[sg JyYsg[g, and
sb [ b. Then the NP I algorithm converges to a global optimum.
Proof: First note that the assumption of this theorem can be satis®ed only if
y S1Y 2EYsgY sb
where we have y maxy [ sb Jy miny [ sg Jy. Thus, we must have 11
EYsgY sbR max Jy min Jy max Jy Jyopt Y 29 2 y[sb y[sg 2 y[sb
andthusEYsgYsbR0.Nowletsg [gYsb [b,letFYXFYsgYsb denotethedistribution function of YsgY sb, and condition on the value of YsgY sb:
NESTED PARTITIONS METHOD
285
pYsgYsbR0 R
Z
pYsgYsbR0jYsgYsb ydFYy Zy
2EYs Ys dFYy Rgb
R
1Z
2EYs Y s ydFY y
gbR
12 X
Now since by de®nition Ys Y s L y1 L y1 , this implies that gb sb sg
h1 1i 1 p L ysb RL ysg R2Y
1
that is, by Equation (27) we have pIsbRIsgR2, which implies
30
such that 1 i n 1. We need to consider three cases here:
(1) If sopt [xi1 and sopt [xi1 then xi1 [g and nxi [b. We also note that according to De®nition 6, xi 1 bxi and xi 1 [ hxi . Thus, according to equation (14) and the inequality (30) derived above:
jhx j1 i X
i1
jhx j1 i X
i1
jhx j1 i X
i1
jhx j1 i X
pIsbRIsgRpIsb IsgX
As before, let sopt denote the unique global optimum. Let
Z [ 0 ,
x0Y x1Y F F F Y xn1Y xn be the shortest path from sopt to Z, where n ksoptY Z. Take any i
PxiY xi1
R
pxi1 [ B2 X jB2j i c pjB2j iai
PxiY xi1X
i1
pIxi1 IZY VZ [ hxi X jB2j i c pjB2j iai
pInxi IZY VZ [ hxi X jB2j i c pjB2j iai pxi1 [B2 X jB2j icpjB2j iai
sopt [xi1 and sopt6[xi1 then xi1 [g and xi1 [b.
xi bxi1 bxi1. Similarly to the ®rst case we can use (14), or (12) if xi , to show that PxiY xi 1RPxiY xi 1.
(2) If
Furthermore,
and let
286 SHI AND OÂ LAFSSON
(3) If sopt 6 [xi1 and sopt 6 [xi1 then nxi1 [ g and xi1 [ b . Furthermore, xi1 bxi and xi1 [ hxi. Again, Equation (14) can be used to show that PxiY xi1RPxiY xi1.
Thus, we have that PxiYxi1RPxiYxi1, and since this holds for any i then it is clear that
Pxn1Y xn2 c c Px1Y x0RPx1Y x2 c c Pxn1Y xnX 31
Now we note that using a similar argument as above and noting that xnYnx0 [b, whereas x0Y nxn [ g, Equations (30) and (15) imply that PxnY xn1RPx0Y x1, and putting this together with (31) we have
PkZYsopt ZY sopt RPksoptYZ soptY ZX 32
Now ®nally, to verify that Equation (18) of Assumption 2 holds, let x[s0. Then by de®nition PkZYxZY x PkxYZxY ZY VZ [ 0. If x 6 sopt then we can select Z sopt and obtain PksoptYxsoptYx PkxYsoptxYsopt, which contradicts Equation (32) above. Thus, we must have x sopt, which implies that s0 s so Assumption 2 is satis®ed and global convergence follows by Theorem 2 above. j
The assumption of Theorem 4 illustrates suf®cient global convergence conditions on the partitioning and sampling on the one hand, and the estimation on the other. In particular, EYsgY sb depends only on the partitioning and sampling and the inequality (29), which is implicit in the assumption, can be satis®ed by partitioning such that there is certain amount of separation, or partial non-overlap, between the good and bad sets, and then using enough sample points. On the other hand, given a ®xed value of EYsgYsb, the inequality (28) depends on the sample performance being suf®ciently accurate and can be satis®ed by increasing the simulation time. However, verifying how much sample effort is “enough” and when the accuracy of the sample performance is “suf®cient” may still be fairly dif®cult. Finally, we note that although this theorem is only stated for NP I, similar conditions can be developed for the NP II algorithm.
4. Numerical Example
In this section we illustrate the methodology presented in this paper via a numerical example. A classical combinatorial optimization problem is the symmetric traveling salesman problem (TSP), which involves ®nding the shortest Hamiltonian cycle in an undirected graph with n nodes. This problem has many well known practical applications, such as routing robots through automatic warehouses, scheduling couriers for pickups at ATMs, and drilling holes through printed circuit boards. Furthermore, even for moderately small n, the number of alternative cycles is prohibitively large for enumerative methods, and since the optimization problem is NP-complete no ef®cient solution method is known to exist. If the “travel time” on the edges is random this dif®culty is exacerbated even further. The deterministic TSP is often used as a benchmark for combinatorial optimization algorithms, and in Shi, OÂ lafsson, and Sun (1999) a detailed deterministic
NESTED PARTITIONS METHOD 287
NP algorithm is developed and shown to perform very competitively when applied to this problem. The results of this section demonstrate the performance of the stochastic NP to the stochastic TSP.
Assume that there are n cities that must be visited and each city must be visited exactly once. Let Tij denote the random travel time between cities i and j, that is, the length of the edge between nodes i and j, for iYj[f1Y2YFFFYng. We assume that
Tij cij XijY 33
where XijBU aY aY aR0, is a uniform random variable with mean zero, which implies that ETij cij. We refer to the optimal tour as the tour that has the shortest average length, that is, if y i1Y i2Y F F F Y in is a feasible tour then the performance of this tour is
n1 X
Ji1Yi2YFFFYin
We estimate this performance by averaging r replications, that is,
k1
cikik1 cini1X
rn1 !
LiYiYFFFYi1X X c xs c xs Y
r 1 2 n r
where xs is the outcome of the s-th replication of the noise on the edge between cities i
s1 k1
ikik1 ikik1 ini1 ini1
ij
and j. This estimate is unbiased since ELri1Y i2Y F F F Y in Ji1Y i2Y F F F Y in, and consistent
since by the strong law of large numbers, limr?c Lri1Y i2Y F F F Y in Ji1Y i2Y F F F Y in with probability one.
As mentioned above, the details of how to implement the NP method for the deterministic TSP problem can be found in Shi, OÂ lafsson, and Sun (1999), and the only implementation difference between the deterministic and stochastic NP algorithms is the updating of the number of times singleton regions are visited and the use of estimation for the performance evaluation. We will therefore not repeat those implementation details here. This implementation, although speci®c to TSP in that it exploits some of its special structure, can be quite easily generalized to other combinatorial optimization problems as is illustrated by OÂ lafsson and Shi (2000). The TSP test problems employed are selected from the TSPLIB library of test problems (Reinelt, 1992), and are called eil51, eil76, and eil101, with 51, 76, and 101 cities, respectively. These problems are selected because they have a known optimal solution and are three dif®cult problems with similar structure but of different sizes, which allows us to evaluate how the method performs as the problem size increases. As the competitiveness of the NP method with respect to classical methods has already been established in the deterministic TSP context, the focus of our experiments is to evaluate the performance and robustness of the stochastic NP method with respect to the problem size and the variability in the performance estimates (sample performance). All of the experiments are performed on a Dell Dimension XPS R450 PC with 256MB RAM.
For these experiments we use two noise distributions, XijBU 1Y 1 and XijBU2Y2. This is compared to distances between nodes that range from 2 to 86 for eil51, from 2 to 85 for eil76, and from 1 to 92 for eil101. This is thus very signi®cant
288 SHI AND OÂ LAFSSON Table 1. Percentage over optimum for eil51.
Number of Replications (r)
Error 151025
U( 1, 1) Average tour length Standard error (S.E.)
Best tour length
Worst tour length
CPU time (seconds)
U( 2, 2) Average tour length Standard error (S.E.)
Best tour length
Worst tour length
CPU time (seconds)
2.54 2.91 2.84 2.77 1.04 1.07 0.96 1.09 1.17 1.41 0.94 1.41 5.40 5.87 5.16 5.40
92 86 95 94
3.58 3.12 2.86 2.80 1.85 1.28 1.36 0.80 1.17 1.17 0.70 1.41 7.98 5.16 6.57 4.23
91 97 88 100
noise, especially for shorter arcs. We perform a ®xed number of 300 iterations of the NP algorithm for every experiment. Table 1 shows the results for the eil51 and various number of replications r used for estimating the performance of each path. The numbers reported in the table are the average, standard error, minimum, and maximum percentage over the true optimum for 20 replications of the NP algorithm itself. As indicated by upper half of the table, if XijBU 1Y 1, then the number of replications used for estimation does not signi®cantly effect the performance of the algorithm and it consistently obtains tours that are within 3% of the optimal tour, and never worse than 6% above the length of the optimal tour. If the noise is greater, that is XijBU 2Y 2, then the NP algorithm can still ®nd tours of comparable quality; however, the performance of the algorithm is now more sensitive to the number of replications. Indeed, as might be expected, the fewer the replications the worse the performance of the algorithm. The interpretation is that if the noise is relatively high then too few replications does not assure that the estimated promising index is suf®ciently representative of the region. The same pattern can be seen for the average, the best tour, and the worst tour.
Table 2 compares the results for the different problems sizes, that is, the 51 city, 76 city, and 101 city problems. These results are based on 80 replications of the algorithm and a noise distribution XijBU 1Y 1. Although the performance degenerates somewhat as the problem size increases, for example the average percentage over optimum is 2.77% for eil51 but 5.38% for eil101, this is rather small when we consider that the number of possible tours is of the order 1065 for eil51 but 10159 for eil101. Also, the same pattern can
Table 2. Percentage over optimum for 51, 76, and 101 cities. Average S.E. Minimum
Maximum CPU Time
5.87 92 5.39 575 8.27 1967
eil51 eil76 eil101
2.77 1.03 0.94 3.12 0.77 1.12 5.38 1.34 1.75
NESTED PARTITIONS METHOD
Table 3. Percentage over optimum for variable noise in performance estimates.
289
eil51
Average CPU Time
2.77 92
3.09 94 11.6%
eil76
Average CPU Time
3.12 575 3.43 571 9.9%
eil101
U( 1, 1) U( 2, 2) Change
5.38 1967 5.78 1987 7.4%
Average
CPU Time
be seen for the best and worst tour found. Thus, the stochastic NP method appears to perform relatively well for large problems.
Finally, we focus on the variability in the performance estimates and Table 3 reports the summary results these experiments. This table compares XijBU 1Y 1 and XijBU2Y2 for all three problems and is again based on 20 replications. As is to be expected, the performance of the algorithm degenerates as the noise increases, although this change is rather slight and it is less for the larger problems, 7.4% for eil101 versus 11.6% for eil51. Thus, again these results indicate that the NP algorithm performs rather well for large problems, and that it can obtain good solutions despite signi®cant noise.
5. Discussion and Future Research
We have adapted the nested partitions (NP) method of Shi and OÂlafsson (2000) to stochastic optimization. The basic idea behind the NP method is simple. It samples from the entire feasible region in each iteration, but concentrates the sampling effort by iteratively partitioning the part of the feasible region that is considered the most viable to contain the global optimum. Each partitioning is hence nested within the last, eventually reducing the most promising region to a singleton. The partitioning method, and a backtracking feature that makes the sampling less concentrated, are part of a global search phase. However, local search methods can also be incorporated into the method by using them to help decide what region should be partitioned next, and hence where the sampling should be concentrated.
It is easy to see that the partitioning in the NP method plays an important role. It is not assumed that the feasible region has any particular structure, but the partitioning implicitly imposes a structure on feasible region. If a rich underlying structure is available, that should of course be utilized. The convergence and ef®ciency of the NP method rests on the partitioning being performed such that good solutions are clustered together in the same subregions. If this is accomplished, the NP method rapidly concentrates the search in those regions and converges quickly. We showed that under certain conditions the NP method converges almost surely to a global optimum. However, such asymptotic results do not provide information concerning how long the algorithm must be applied to achieve this convergence. Therefore, an important future research topic is to investigate bounds on the rate of convergence and how such bounds can be used to derive stopping criteria for the NP method (OÂ lafsson and Shi, 1998).
290 SHI AND OÂ LAFSSON
Acknowledgments
The authors would like to thank the associate editor and anonymous referee for comments and suggestions that greatly improved the content and presentation of this paper. The authors would also like to acknowledge the assistance of Mr. Yunpeng Pan in generating the numerical results presented in this paper.
References
S. AndradoÂttir, “A method for discrete stochastic optimization,” Management Science vol. 41 pp. 1946±1961, 1995.
S. AndradoÂttir, “A global search method for discrete stochastic optimization,” SIAM Journal on Optimization vol. 6 pp. 513±530, 1996.
D. A. Barry and B. Fristedt, Bandit Problems, Chapman and Hall: London, 1985.
R. E. Bechhofer, T. J. Santner, and D. Goldsman, Design and Analysis for Statistical Selection, Screening, and
Multiple Comparisons, John Wiley: New York, 1995.
L. Dai, “Convergence properties of ordinal comparison in the simulation of discrete event dynamic systems,”
Journal of Optimization Theory and Applications vol. 19 pp. 363±388, 1996.
M. C. Fu and J. Q. Hu, Conditional Monte Carlo: Gradient Estimation and Optimization Applications, Kluwer
Academic Press: 1997.
S. B. Gelfand and S. K. Mitter, “Simulated annealing with noisy or imprecise energy measurements,” Journal of
Optimization Theory and Applications vol. 62 pp. 49±62, 1989.
W. B. Gong, Y. C. Ho, and W. Zhai, “Stochastic comparison algorithm for discrete optimization with
estimation,” Proceedings of the 31st IEEE Conference on Decision and Control pp. 795±800, 1992.
P. Glasserman, Gradient Estimation Techniques for Discrete Event Systems, Kluwer: Boston, MA, 1991.
P. W. Glynn, “Likelihood ratio gradient estimation for stochastic systems,” Communications of the ACM vol. 33
pp. 75±84, 1990.
W. K. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika vol.
57 pp. 92±109, 1970.
Y. C. Ho and X. R. Cao, Discrete Event Systems and Perturbation Analysis, Kluwer: Boston, MA, 1991.
Y. C. Ho, R. S. Sreenivas, and P. Vakili, “Ordinal optimization of DEDS,” Discrete Event Dynamic Systems:
Theory and Applications vol. 2 pp. 61±88, 1992.
S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated annealing,” Science vol. 220 pp. 671±680,
1983.
J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” Annals of
Mathematical Statistics vol. 23 pp. 462±466, 1952.
P. L’Ecuyer and P. W. Glynn, “Stochastic optimization by simulation: convergence proofs for the GI/G/1 queue in
steady-state,” Management Science vol. 40 pp. 1562±1578, 1994.
M. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equations of state calculations
by fast computing machines,” J. of Chemical Physics vol. 21 pp. 1087±1092, 1953.
K. S. Narendra and M. A. L. Thathachar, Learning Automata: An Introduction, Prentice Hall: Englewood Cliffs,
1989.
W. I. Norkin, G. C. P ̄ug, and A. RuszczynÂski, “A branch and bound method for stochastic global optimization,”
Mathematical Programming vol. 83 pp. 425±450, 1998.
S. OÂ lafsson and L. Shi, “Stopping criteria for a simulation-based optimization method,” in D. J. Medeiros, E. F.
Watson, E. F. Manivannan, and J. Carson (eds.), Proceedings of the 1998 Winter Simulation Conference,
pp. 743±750, 1998.
S. OÂ lafsson and L. Shi, “A method for scheduling in parallel manufacturing systems with ̄exible resources,” IIE
Transactions vol. 32 pp. 135±146, 2000.
G. Reinelt, “TSPLIB – A traveling salesman problem library,” ORSA Journal on Computing vol. 3 pp. 376±84,
1992.
NESTED PARTITIONS METHOD 291
H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics vol. 22 pp. 400±407, 1951.
S. M. Robinson, “Analysis of sample-path optimization,” Mathematics of Operations Research vol. 21 pp. 513± 528, 1996.
R. Y. Rubinstein, “How to optimize discrete-event systems from a single sample path by the score function method,” Annals of Operations Research vol. 27 pp. 175±212, 1991.
R. Y. Rubinstein and A. Shapiro, “Optimization of simulation models by the score function method,” Mathematics and Computers in Simulation vol. 32 pp. 373±392, 1990.
R. Y. Rubinstein and A. Shapiro, Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method, Wiley: Chichester, England, 1993.
R. Y. Rubinstein, “The cross-entropy method for combinatorial and continuous optimization,” Methodology and Computing in Applied Probability vol. 1 pp. 127±190, 1999.
A. Shapiro, “Simulation-based optimization ± convergence analysis and statistical inference,” Communications in Statistics ± Stochastic Models vol. 12 pp. 425±454, 1996.
L. Shi and S. OÂlafsson, “An integrated framework for deterministic and stochastic optimization,” in S. AndradoÂttir, K. J. Healy, D. H. Withers, and B. L. Nelson (eds.), Proceedings of the 1997 Winter Simulation Conference, pp. 358±365, 1997.
L. Shi and S. OÂ lafsson, “Nested partitions method for global optimization,” Operations Research vol. 48(3) p. 390, 2000.
L. Shi, S. OÂ lafsson, and N. Sun, “New parallel randomized algorithms for the traveling salesman problem,” Computers & Operations Research vol. 26 pp. 371±394, 1999.
Z. B. Tang, “Adaptive partitioned random search to global optimization,” IEEE Trans. on Automatic Control vol. 39 pp. 2235±2244, 1994.
S. Yakowitz, P. L’Ecuyer, and F. VaÂzquez-Abad, “Global stochastic optimization with low-dispersion point sets,” Operations Research in press, 1999.
D. Yan and H. Mukai, “Stochastic discrete optimization,” SIAM J. Control and Optimization vol. 30 pp. 594± 612, 1992.