程序代写代做代考 data structure chain scheme algorithm Methodology and Computing in Applied Probability, 2:3, 271±291, 2000 # 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

Methodology and Computing in Applied Probability, 2:3, 271±291, 2000 # 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.
Nested Partitions Method for Stochastic Optimization
LEYUAN SHI leyuan@ie.engr.wisc.edu
Department of Industrial Engineering University of Wisconsin±Madison, 1513 University Avenue Madison, WI 53706
SIGURDUR OÂ LAFSSON olafsson@iastate.edu
Department of Industrial and Manufacturing Systems Engineering Iowa State University, 2019 Black Engineering Ames, IA 50011
Received September 28, 1998, Revised June 26, 2000; Accepted July 18, 2000
Abstract. The nested partitions (NP) method is a recently proposed new alternative for global optimization. Primarily aimed at problems with large but ®nite feasible regions, the method employs a global sampling strategy that is continuously adapted via a partitioning of the feasible region. In this paper we adapt the original NP method to stochastic optimization where the performance is estimated using simulation. We prove asymptotic convergence of the new method and present a numerical example to illustrate its potential.
Keywords: simulation-based optimization, combinatorial optimization, Markov chain Monte Carlo. AMS 1991 Subject Classification: 78M50, 78M35
1. Introduction
Optimizing the con®guration of complex discrete event systems such as computer systems, communication networks, and manufacturing systems is often a daunting task. Many factors contribute to this dif®culty, including the combinatorial explosion of alternatives and lack of structure available to guide the search. In the deterministic context, this often leads to NP-hard optimization problems. In the case of stochastic optimization, the dif®culty is exacerbated by the added randomness, and simulation is often the only tool available for optimizing such stochastic systems.
Simulation optimization methods can be traced to two seminal papers by Robbins and Monro (1951) and Kiefer and Wolfowitz (1952). These papers gave rise to the family of techniques called stochastic approximation (SA) methods, and the interested reader is referred to Benveniste, MeÂtiver, and Priouret (1990) for a discussion of such methods. In recent years, simulation-based optimization has been greatly advanced through the use of two classes of methods that extract signi®cantly increased amount of information from a single simulation run. These techniques, perturbation analysis (PA) methods (Ho and Cao, 1991; Glasserman, 1991; Fu and Hu, 1997), and score function (SF) or likelihood ratio (LR) methods (Glynn, 1990; Rubinstein and Shapiro, 1993), make it possible to obtain not

272 SHI AND OÂ LAFSSON
only a system performance estimate, but also all sensitivities from a single sample path. The PA methods may be used along with a SA type procedure for simulation-based stochastic optimization (L’Ecuyer and Glynn, 1994), and the SF technique can be applied for stochastic optimization, without having to resort to an iterative procedure and multiple simulations, by replacing the optimization problem with its stochastic counterpart, which can then be solved using the powerful arsenal of mathematical programming (Rubinstein, 1991; Rubinstein and Shapiro, 1990). Similar ideas are employed in the sample-path optimization method, that also uses mathematical programming methods to solve stochastic optimization problems with only a single sample path evaluated (Robinson, 1996). All of these methods are primarily intended for continuous stochastic optimization.
In this paper, we consider discrete stochastic optimization problems, in particular problems with a ®nite feasible region. Applications that involve discrete parameters and an uncertain performance measure are abundant in numerous areas, but despite its practical interest, the general discrete stochastic optimization problem has received relatively little attention to date. Recent methods proposed for this problem include simulating annealing (Gelfand and Mitter, 1989), the stochastic ruler method (Yan and Mukai, 1992), the stochastic comparison method (Gong, Ho and Zhai, 1992), the cross-entropy method (Rubinstein, 1998), and the method of AndradoÂttir (1995, 1996). In each of the above methods a Markov chain is constructed and convergence almost surely or in probability is proved by analyzing the long-run behavior of the Markov chain. Such optimization methods can be traced back to the Metropolis method (Metropolis, Rosenbluth, Rosenbluth, Teller and Teller, 1953), which was later generalized by Hastings (1970). The well known simulated annealing algorithm, which imitates the physical process of annealing, is based on the same concept (Kirkpatrick, Gelatt, and Vecchi, 1983). The stochastic ruler method resembles simulated annealing, but instead of solving the original problem directly, it solves an alternative problem that agrees with the original under certain conditions. If the two problems agree, the method converges in probability to a global optimum. The stochastic comparison method is also based on ideas similar to the stochastic ruler and simulated annealing algorithms. The method of AndradoÂttir (1995) can be considered a discrete stochastic approximation method. The method compares two neighboring points in each iteration and moves to the point that is found to be better. Other related literature, that also deals with ®nite stochastic optimization, includes ranking-and- selection and multiple-comparison methods (Bechhofer, Santner, and Goldsman, 1995), that are typically used for comparing a small number of alternatives, multi-armed bandit methods (Barry and Fristedt, 1985), learning automata (Narendra and Thathachar, 1989), and the low-dispersion point set method of Yakowitz, L’Ecuyer, and VaÂzquez-Abad, (1999). Also of relevance to the analysis here is the recent work of Norkin, P ̄ug and RuszczynÂski (1998), where the authors extend the classical deterministic branch-and- bound method to stochastic problems. Finally, related to ranking-and-selection procedures, but aimed at large-scale optimization is ordinal optimization (Ho, Sreenivas and Vakili, 1992; Dai, 1996). This methodology differs from traditional cardinal optimization methods in that its aim is only to ®nd feasible points with performance that places it in a given percentile of all feasible points.
In a recent paper, Shi and OÂ lafsson (2000) introduce a new optimization method, the nested partitions (NP) method, for global optimization when the objective function can be

NESTED PARTITIONS METHOD 273
evaluated deterministically. This method has been found to be quite ef®cient for various combinatorial optimization problems (Shi, OÂ lafsson, and Sun, 1999; OÂ lafsson and Shi, 2000). In this paper we present a stochastic NP method that can be applied when no analytical expression exists for the objective function, and it is estimated through simulation or real-time observation. The only modi®cation needed from the original algorithm is in the way the optimum is estimated, and it is hence readily adapted from the deterministic algorithm. The basic idea behind the NP method is to adaptively sample from the feasible region. To adapt the sampling the feasible region is systematically partitioned and the sampling concentrated in the subset that is considered the most promising. A somewhat similar partitioning and random sampling scheme is proposed in Tang (1994), but that approach is directed towards continuous problems in the deterministic context, and does not move between ®xed regions as the NP method. We show that the NP algorithm generates an ergodic Markov chain and use this to develop conditions under which the NP method converges to an optimal solution with probability one.
The remainder of this paper is organized as follows. In Section 2 we give a complete description of the NP method. In Section 3 we prove the asymptotic convergence of the method. In Section 4 we present a numerical example to illustrate the method, and ®nally, Section 5 contains some concluding remarks.
2. The Nested Partitions Method
In mathematical notation, we want to solve the problem
min J􏷦y􏷧Y 􏷦1􏷧
y[‰
where ‰ is a ®nite feasible region, and J X ‰?R is a performance function that is subject to noise. In other words, for any feasible point y [ ‰Y J􏷦y􏷧 cannot be evaluated analytically. Often J􏷦y􏷧 is an expectation of some random estimate of the performance of a complex stochastic system given a parameter y,
J􏷦y􏷧 􏷩 E􏷪L􏷦y􏷧􏷫X 􏷦2􏷧
Here L􏷦y􏷧 is a random variable which depends on the parameter y [ ‰. We assume that L􏷦y􏷧 is an unbiased discrete event simulation estimate of the true performance, and refer to it as the sample performance.
2.1. The NP Methodology
We now describe a new approach applicable to simulation-based optimization: the nested partitions (NP) method. The basic idea of the NP method is very simple. In the k-th iteration there is a region s􏷦k􏷧(‰ that is considered the most promising. In the ®rst iteration nothing is assumed to be known about where good solutions are located, so the entire feasible region s􏷦0􏷧 􏷩 ‰ is taken as the most promising region. The most promising

274 SHI AND OÂ LAFSSON
region is then partitioned into Ms􏷦k􏷧 subregions, where Ms􏷦k􏷧 may depend on the subset s􏷦k􏷧 but not the iteration. What remains of the feasible region, ‰ns􏷦k􏷧, is aggregated into one region called the surrounding region. Therefore, at the k-th iteration Ms􏷦k􏷧 􏷨 1 disjoint subsets that cover the feasible region are considered. Each of these regions is sampled using some random sampling scheme, and the samples used to estimate the promising index for each region. This index is a set performance function that determines which region becomes the most promising region in the next iteration. If one of the subregions is found to be best this region becomes the most promising region. If the surrounding region is found to be best the method backtracks to a larger region. To choose this larger region a ®xed backtracking rule is used. The new most promising region is partitioned and sampled in a similar fashion. This generates a sequence of set partitions, with each partition nested within the last. We now assume that such a partitioning scheme has been ®xed. The following de®nitions will be used throughout the analysis.
DEFINITION 1 A region constructed using a ®xed partitioning scheme is called a valid region given the ®xed partition. The collection of all valid regions is denoted by ƒ. Singleton regions are of special interest, and ƒ0 􏷥 ƒ denotes the collection of all such valid regions.
DEFINITION 2 We de®ne the depth, d : ƒ?N0, of any valid region iteratively with ‰ having depth zero, subregions of ‰ having depth one, and so forth. Since they cannot be partitioned further, we call the singleton regions in ƒ0 regions of maximum depth.
Ultimately only the maximum depth regions are of interest, that is, we want to ®nd a region that contains only one point. Therefore, using the estimate proposed by AndradoÂttir (1995,1996), we let the estimated best region in the k-th iteration be the maximum depth region, s”opt􏷦k􏷧 [ ƒ0, that has been most frequently considered the most promising region. Consequently, the method must keep track of the number of times, nk 􏷦s􏷧, a region s [ ƒ0 has been visited by the k-th iteration. Note that it suf®ces to keep track of this for regions that have been visited at least once.
DEFINITION 3 We let the estimate of the best solution be
s”opt􏷦k􏷧 [ arg max nk􏷦s􏷧Y 􏷦3􏷧
s[ƒ0
the most frequently visited singleton region by the k-th iteration, that is, the region that has
most often been the most promising region.
We note that in general s”opt􏷦k􏷧 6􏷩 s􏷦k􏷧, which is the key modi®cation of the deterministic algorithm. As is shown in Shi and OÂlafsson (1999), if the performance measure is deterministic, then s􏷦k􏷧 converges with probability one to a global optimum and hence we can let s”opt􏷦k􏷧 􏷩 s􏷦k􏷧 in the deterministic context. As we will see in the next section, however, s􏷦k􏷧 does not converge at all for stochastic problems.
DEFINITION 4 Ifavalidregions[ƒisformedbypartitioningavalidregionZ[ƒ,thens is called a subregion of region Z, and region Z is called a superregion of region s. We de®ne the superregion function s : ƒ ? ƒ as follows. Let s [ ƒ ‰. De®ne s􏷦s􏷧 􏷩 Z[ƒ, if

NESTED PARTITIONS METHOD 275
and only if s; Z and if s(x(Z where x [ ƒ, then x 􏷩 Z or x 􏷩 s. For completeness we de®ne s􏷦‰􏷧 􏷩 ‰.
The NP method shifts the focus from speci®c points in the feasible region ‰ to a space of subsets; namely the space of all valid regions. Consequently, a set performance function I : ƒ?R is needed. This set function can then be used to select the most promising region and is therefore called the promising index of the region. The only condition imposed on such a promising index function is that if a valid region is a singleton then the promising index agrees with the original performance function as is stated in the ®nal de®nition of this section.
DEFINITION 5 A a set function I : ƒ?R, is a valid promising index if and only if for all s 􏷩 fyg[ƒ0YI􏷦s􏷧 􏷩 J􏷦y􏷧.
Finally note that since the performance of each point must be estimated, the promising index must also be estimated.
2.2. The NP Algorithm
We can now describe a generic implementation of the NP algorithm. Note that special consideration has to be given in the partitioning step to the two special cases of the most promising region being s􏷦k􏷧 􏷩 ‰ or s􏷦k􏷧 [ ƒ0 for any k 􏷤 0.
NP Algorithm
Step 0. Initializing. Let k 􏷩 0Y n0􏷦s􏷧 􏷩 0Vs [ ƒY s”opt􏷦k􏷧 􏷩 ‰, and s􏷦0􏷧 􏷩 ‰.
Step 1. Partitioning. First, unless the current most promising region is a singleton, partition it into disjoint subsets. In other words, if s􏷦k􏷧6[ƒ0, partition s􏷦k􏷧 into Ms􏷦k􏷧 subregions s1􏷦k􏷧Y F F F Y sMs􏷦k􏷧 􏷦k􏷧. Here we allow Ms􏷦k􏷧 to depend on the region s􏷦k􏷧, but not the iteration. Otherwise, if s􏷦k􏷧 [ ƒ0, then Ms􏷦k􏷧 􏷩 1 and s1􏷦k􏷧 􏷩 s􏷦k􏷧.
Secondly, if s􏷦k􏷧 6􏷩 ‰, aggregate the surrounding region ‰ns􏷦k􏷧 into one region sMs􏷦k􏷧􏷨1􏷦k􏷧 and let M 􏷩 Ms􏷦k􏷧 􏷨 1. Otherwise, if s􏷦k􏷧 􏷩 ‰ let M 􏷩 Ms􏷦k􏷧.
Step 2. Random Sampling. Use a random sampling procedure to select Nsj 􏷦k􏷧 independent points y j1Y y j2Y F F F Y 􏷦y jNsj􏷦k􏷧 􏷧 from each of the regions sj􏷦k􏷧, and calculate the corresponding sample performance values,
L􏷦y j1􏷧Y L􏷦y j2􏷧Y F F F Y L􏷦y jNsj􏷦k􏷧 􏷧Y j 􏷩 1Y 2Y F F F Y M 􏷨 1X
The only requirement as on the random sampling procedure are that it be independent of previous iteration and that each point in the region has a positive probability of being selected. Similarly as above the number of sample points may depend on the region.
Step 3. Estimating the Promising Index. Given a promising index function, I: ƒ?R, for

276 SHI AND OÂ LAFSSON
”
each region sj􏷦k􏷧Y j 􏷩 1Y 2Y F F F Y M, calculate the estimated promising index I􏷦sj􏷦k􏷧􏷧 of the
region. In this paper we assume that the promising index has been selected to be I􏷦Z􏷧 􏷩 min J􏷦y􏷧Y
y[Z
and is estimated using
” ji
I􏷦sj􏷦k􏷧􏷧 􏷩 min L􏷦y 􏷧Yj 􏷩 1Y2YFFFYMX 􏷦4􏷧
i[f1Y2YFFFYNsj􏷦k􏷧 g
It is clear from equation (4) that accurately estimating the promising index, and hence the performance function, is not critical. Only the ordinal values affect how the NP algorithm proceeds. If subregion sj􏷢 􏷦k􏷧 contains the true global optimum, then it is suf®cient that
””􏷢 I􏷦sj􏷢 􏷦k􏷧􏷧SI􏷦sj􏷦k􏷧􏷧Y Vj 6􏷩 j X
If this equation holds then the correct subregion is identi®ed.
Step 4. Selecting the Most Promising Region. Select the index of the region with the best promising index.
If more than one region is equally promising, the tie should be broken such that each of these regions has equal probability of being selected. If the index (5) corresponds to a region that is a subregion of s􏷦k􏷧, then let this be the most promising region in the next iteration. Otherwise, if the index corresponds to the surrounding region, then backtrack to a larger region containing the current most promising region.
In this paper we consider two options for backtracking:
(1) Backtracking to the superregion of the current most promising region. That is, let
””
jk [ arg min I􏷦sj􏷦k􏷧􏷧X 􏷦5􏷧 j 􏷩 1YFFFYM
(s 􏷦k􏷧Y if j”SM, k
s􏷦k􏷨1􏷧􏷩 j” k 􏷦6􏷧 s􏷦s􏷦k􏷧􏷧Y otherwise.
(2) Backtracking to the entire feasible region. That is, let
(
s􏷦k􏷧Y ifj”SM, k
s􏷦k􏷨1􏷧􏷩 j” k 􏷦7􏷧 ‰Y otherwise.
We refer to the ®rst option as the NP I algorithm, and the latter as the NP II algorithm. Further discussion of these two variations may be found in Shi and OÂ lafsson (2000).
Step 5. Updating Counters. If s 􏷩 s􏷦k 􏷨 1􏷧 [ ƒ0 , then let nk 􏷨 1 􏷦s􏷧 􏷩 nk 􏷦s􏷧 􏷨 1, and otherwiseletnk􏷨1􏷦s􏷧􏷩nk􏷦s􏷧foralls[ƒ0Ys6􏷩s􏷦k􏷨1􏷧.Ifthereexistss[ƒ0 such that nk 􏷨 1􏷦s􏷧Rnk􏷨1􏷦s”opt􏷦k􏷧􏷧, then let s”opt􏷦k 􏷨 1􏷧 􏷩 s, and otherwise let s”opt􏷦k 􏷨 1􏷧 􏷩 s”opt􏷦k􏷧. Let k 􏷩 k 􏷨 1. Go back to Step 1.
The implementation of this algorithm differs from its previously proposed deterministic

NESTED PARTITIONS METHOD 277
version only in that we use estimation in Step 3, and the addition of the last step. The latter is necessary because the estimate of the best region is not s􏷦k􏷧 but the most frequently visited singleton, and hence the counters fnk 􏷦s􏷧gs [ ƒ0 must be maintained and the most frequently visited singleton updated.
It is worthwhile to point out some computational issues related to the implementation of the NP algorithm. First for the NP I algorithm, notice that in order to backtrack in Step 4 the superregion of the current most promising region must be available, but this imposes minimal memory overhead because we only need to keep track of those regions that lead from the current most promising region back to the entire feasible region. This is generally a very small fraction of the number of valid regions. Secondly for both algorithms, from Step5weseethatwemustkeeptrackofthecountersnk􏷦s􏷧forvalidregionss[ƒ0.This could impose a large memory burden on the algorithm. However, in practice we are not likely to visit all regions of maximum depth. Therefore, we can use a dynamic data structure to keep track of the counters nk 􏷦s􏷧 for only the maximum depth regions s [ ƒ0 that are the most promising region at least once.
We ®nally note that for most problems there are many valid partitions, and the partitioning implicitly imposes a structure on the feasible region. If this structure is such that good solutions tend to be clustered together, the NP method is likely to rapidly concentrate the search in these regions and converge quickly. The partitioning is thus a means of concentrating the sampling effort, and similarly the backtracking is a means of making the sampling less less concentrated. Consequently, the partitioning, random sampling, and backtracking constitute the global search phase of the NP method. The estimation of a promising index, on the other hand, is of a different nature. In its simplest form, the estimated promising index is a summary statistic for the sampling information. For example, it can be an average or a minimum for a given region. However, it is possible to incorporate local search methods or heuristics into the computation of the promising index. That is, when estimating the promising index of any region, a local search may be started from each sample point and constrained to be within the region. To keep the discussion simple we only consider the best performance in the region as a promising index in this paper. For examples of other promising indices where local search techniques are successfully used in the deterministic context, we refer the reader to Shi, OÂ lafsson, and Sun (1999) and OÂ lafsson and Shi (2000). Next we address the asymptotic convergence of the method.
3. Convergence
The NP method uses the set performance function I X ƒ?R to guide its movements, and this leads us to shift our focus to ®nding an element sopt [ s, where
s 􏷩 arg min I􏷦s􏷧X 􏷦8􏷧 s[ƒ0
Given certain assumptions, we show that the NP method identi®es an element in s, that is, an optimal solution for the problem (8) above. This, however, is equivalent to solving the

278 SHI AND OÂ LAFSSON
original problem (1) since the original performance function J X ‰?R and the set performance function I X ƒ?R agree on singletons.
3.1. Basic Properties
In this section we show that the NP method generates an ergodic Markov chain. Furthermore, we show that the estimate of the best singleton region fs”opt􏷦k􏷧gck􏷩1 converges with probability one to a maximizer of the stationary distribution of the Markov chain. The stationary distribution can hence be used as a basis for inference. Since the same method is used for estimating the global optimum, some of the proofs in this section are similar to those of Andrado ttir (1995, 1996). For any s, Z [ ƒ, we use P􏷦sY Z􏷧 to denote the transition probability of the Markov chain moving from state s to state Z, and Pn 􏷦sY Z􏷧 to denote the n-step transition probability. We also let p􏷪A􏷫 denote the probability of any event A.
Since in the k-th iteration, the next most promising region s􏷦k 􏷨 1􏷧 only depends on the current most promising region s􏷦k􏷧, and the sampling information obtained in the k-th iteration, the sequence of most promising regions fs􏷦k􏷧gck􏷩1 is a Markov chain with state space ƒ. We refer to this Markov chain as the NP Markov chain when referring to either the NP I or NP II algorithm, and the NP I Markov chain or the NP II Markov chain when referring to the NP I or NP II algorithm speci®cally. To give an expression for the transition probabilities of the NP Markov chain we require the following de®nition.
DEFINITION 6 Let b􏷦s􏷧 denote the region that the algorithm backtracks to from state s [ ƒnf‰g, that is, b X ƒ?ƒ where
􏷥 s􏷦s􏷧Y for NP I,
b􏷦s􏷧 􏷩 ‰Y for NP II. 􏷦9􏷧
Also, let
h􏷦s􏷧 􏷩 fZ[ƒ X s􏷦Z􏷧 􏷩 sg 􏷦10􏷧
denote the set of subregions sampled if s [ ƒ is the most promising region.
In each iteration the NP method selects the next most promising region based on which region has the best estimated promising index. Therefore, the transition probabilities may be written in terms of the estimated promising index. We need to consider three cases:
(1) Ifs􏷩‰,thentheNPalgorithmmovestooneofthesubregionsinthenextiteration and the set of subregions that have the best estimated performance is given by
jh􏷦s􏷧j
P􏷦sYx􏷧 􏷩 X p􏷪x[B1 X jB1j 􏷩 i􏷫cp􏷪jB1j 􏷩 i􏷫 􏷦12􏷧
i􏷩1 i
””
B1 􏷩fx[h􏷦s􏷧jI􏷦x􏷧􏷣I􏷦Z􏷧YVZ[h􏷦s􏷧gX 􏷦11􏷧 The transition probabilities are given by

NESTED PARTITIONS METHOD 279
for x [ h􏷦s􏷧 and P􏷦sY x􏷧 􏷩 0 for x6 [h􏷦s􏷧. Note that this equation uses the fact that ties are broken in a uniform manner.
(2) if s[ƒn􏷦f‰g[ƒ0􏷧, then the NP algorithm either moves to a subregion or backtracks in the next iteration and the set of regions that have the best estimated performance is given by
jh􏷦s􏷧j􏷨1
P􏷦sYx􏷧􏷩 X p􏷪x[B2 XjB2j􏷩i􏷫􏶒p􏷪jB2j􏷩i􏷫 􏷦14􏷧
i􏷩1 i
for x [ h􏷦s􏷧 [ f‰nsg and P􏷦sY x􏷧 􏷩 0 for x6 [h􏷦s􏷧 [ f‰nsg.
(3) Ifs[ƒ0,thentheNPalgorithmeitherstaysinthecurrentmostpromisingregionor backtracks in the next iteration, and the transition probabilities are given by
””
B2 􏷩fx[h􏷦s􏷧[f‰nsgjI􏷦x􏷧􏷣I􏷦Z􏷧YVZ[h􏷦s􏷧[f‰nsggX 􏷦13􏷧 The transition probabilities are thus given by
8> ” ” ” ”
> p􏷪I􏷦‰ns􏷧SI􏷦s􏷧􏷫 􏷨 p􏷪I􏷦‰ns􏷧 􏷩 I􏷦s􏷧􏷫 Y <2 Y x 􏷩 b􏷦s􏷧 x 􏷩 s otherwise. P􏷦sY x􏷧 􏷩 ”” 􏷦15􏷧 ” ” p􏷪I􏷦‰ns􏷧 􏷩 I􏷦s􏷧􏷫 2 >: p􏷪I􏷦‰ns􏷧RI􏷦s􏷧􏷫 􏷨 0Y
Equations (11)±(15) completely de®ne the transition probabilities for both NP algorithms. We will show that the NP Markov chain has a unique stationary distribution, but ®rst we exclude certain trivial cases from the analysis. In particular we assume that there is no region s [ ƒnf‰g such that despite noisy estimates, every point in s has, with probability one, better estimated performance than all the points outside of s, that is in ‰ s. Formally
we state the assumption as follows.
ASSUMPTION 1 Foralls[ƒnf‰gYP􏷦sYb􏷦s􏷧􏷧R0.
We note that this assumption implies that for any Z [ ƒY Pk 􏷦ZY ‰􏷧R0, where k 􏷩 d􏷦Z􏷧 or k 􏷩 1 for the NP I or NP II algorithm, respectively. Also, since the random sampling scheme assumes that each point in the sample region is sampled with positive probability, Assumption 1 is equivalent to assuming that for all valid regions s [ ƒ, there exist at least one pair of points y1 [ s and y2 [ ‰ns such that p􏷪L􏷦y1􏷧RL􏷦y2􏷧􏷫R0. Using this observation it can be seen that this assumption imposes no loss of generality. Indeed, if it does not hold, then we can let s0 be the smallest region for which it is not satis®ed. Then it is straightforward to verify that s0 is closed, all the other states are transient, and the Markov chain eventually enters s0 and never leaves. Thus, s0 contains all the interesting solutions, since, with probability one, all the points in s0 have better performance than all points outside s0. If s0 is a singleton it is an absorbing state which the NP method converges to in ®nite time, and no further analysis is necessary. If s0 is not a singleton, the analysis below can be applied by replacing ‰ with s0 as the feasible region. Hence, Assumption 1 imposes no loss of generality.

280 SHI AND OÂ LAFSSON
PROPOSITION 1 If Assumption 1 holds, then the NP Markov chain has a unique stationary distribution fp􏷦s􏷧gs [ ƒ.
Proof: Let ƒP denote the set of positive recurrent states of the Markov chain. Since ƒ is ®nite, ƒP 6􏷩 a0, which implies that the NP Markov chain has at least one stationary distribution. To show that it is also unique, we must show that ƒP is irreducible. We ®rst show that ‰[ƒP. Let s[ƒP. Then by Assumption 1 there exists some k such that Pk 􏷦sY ‰􏷧R0 and since s is positive recurrent, ‰ is also positive recurrent. Now to prove irreducibility, let sY Z [ ƒP . Again by Assumption 1 there exists some k1 and k2 such that Pk1 􏷦sY ‰􏷧R0 and Pk2 􏷦ZY ‰􏷧R0. Furthermore, since Z [ ƒP is recurrent and Pk2 􏷦ZY ‰􏷧R0 then there exists some k3 such that Pk3 􏷦‰Y Z􏷧R0. Hence Pk1 􏷨k3 􏷦sY Z􏷧R0. This holds for any sY Z [ ƒP, so ƒP is irreducible, and hence the Markov chain has a unique stationary distribution fp􏷦s􏷧gs [ ƒ. j
We note that since a unique stationary distribution exists, the average number of visits to each state converges to this distribution with probability one (w.p.1), that is,
lim nk 􏷦s􏷧 􏷩 p􏷦s􏷧Y Vs [ ƒY w.p.1. 􏷦16􏷧 k?c k
Furthermore, as P􏷦sY s􏷧R0 for some s [ ƒ0, the chain is aperiodic and hence the k-step transition probabilities converge pointwise to the stationary distribution, limk?c Pk 􏷦‰Y s􏷧 􏷩 p􏷦s􏷧 w.p.1, Vs [ ƒ. The next step in establishing the convergence of the NP method is to use (16) to show that the estimate of the best solution converges to a maximum stationary probability of the NP Markov chain, and hence that the stationary distribution can be used for inference.
THEOREM 1 Assume that Assumption 1 holds. The estimate of the best region s”opt􏷦k􏷧 converges to a maximum of the stationary distribution of the NP Markov chain, that is,
lim s”opt􏷦k􏷧 [ arg max p􏷦s􏷧Y wXpX1X 􏷦17􏷧 k?c s[ƒ0
Proof: Let k0 􏷩 mink􏷤1fkjs􏷦k􏷧 [ ƒ0g be the ®rst time the NP method reaches maximum depth. If sopt [ s is a global optimum, then there is a positive probability that sopt is sampled for d􏷦sopt􏷧 consecutive iterations, which together with ‰ [ ƒP implies that Pd􏷦sopt􏷧􏷦‰Y sopt􏷧R0, i.e., sopt [ ƒP. Hence there is at least one positive recurrent state in ƒ0 and consequently k0Sc almost surely.
Now combine this with the known fact that equation (16) holds, and let A be such a set that p􏷪A􏷫 􏷩 1 and for all o[AYk 􏷦o􏷧Sc and lim nk􏷦sYo􏷧 􏷩 p􏷦s􏷧. Let o[A be a
0 k?c k
®xed sample point. Since ƒ0 is ®nite, there exists a ®nite constant Ko 􏷤 k0􏷦o􏷧 such that
for all k 􏷤 Ko,
arg max nk 􏷦s􏷧(arg max p􏷦s􏷧X
s[ƒ0 s[ƒ0
Now by de®nition s”opt􏷦kY o􏷧 [ arg maxs [ ƒ0 nk􏷦sY o􏷧 for all k 􏷤 k0, so s”opt􏷦kY o􏷧 [ arg maxs [ ƒ0 p􏷦s􏷧 for all k 􏷤 Ko . This completes the proof. j

NESTED PARTITIONS METHOD 281
To prove asymptotic global convergence, we need to show that the singleton maximizer of the stationary distribution can be used as an estimate of the global optimum. As will be explained shortly, without some conditions about the problem structure this may not hold.
3.2. Global Convergence
We now show that given certain regularity conditions the observation that the NP method converges to a maximizer of the stationary distribution implies that it converges to a global optimum. In particular, by showing that the maximum depth region with the largest stationary probability is also a global optimum, it follows that the NP method converges to the global optimum.
Without assumptions about the problem structure this does not necessarily hold. For example, assume that sopt is a unique global optimum, but that there exist a set Z [ ƒ such that sopt(Z and J􏷦y􏷧RJ􏷦f􏷧 for all y[ZnsoptYf[‰nZ. Then we expect the transition probability Pd􏷦Z􏷧􏷦‰YZ􏷧 into Z to be small, which implies that Pd􏷦sopt􏷧􏷦‰Ysopt􏷧 into sopt is also likely to be small. This indicates that p􏷦sopt􏷧 may be quite small. Therefore, if the performance measure changes drastically from the global optimum to points that are close in the partitioning metric, the NP method is not guaranteed to ®nd the global optimum. Consequently, we must exclude such problem structures. We start with a de®nition.
DEFINITION 7 Let s and Z be any valid regions. Then there exists some sequence of regionss 􏷩 x0Yx1YFFFYxn 􏷩 ZthattheMarkovchaincanmovealongtogetfromstatesto state Z. We call the shortest such sequence, the shortest path from s to Z. We also de®ne k􏷦sY Z􏷧 to be the length of the shortest path.
Note that the shortest path is unique and can be found by moving directly from s to the smallest valid region that contains both s and Z, and then proceeding directly from this region to Z. The following condition on transition probabilities along shortest paths connecting the global optimum to other maximum depth regions, turns out to be a suf®cient condition for global convergence. The interpretation of this assumption is discussed further at the end of the section.
ASSUMPTION 2 The set
S0 􏷩fx[ƒ0 XPk􏷦ZYx􏷧􏷦ZYx􏷧􏷤Pk􏷦xYZ􏷧􏷦xYZ􏷧YVZ[ƒ0gY 􏷦18􏷧
satis®es S0(S, that is, it is a subset of the set of global optimizers.
The regularity condition in Equation (18), needed for global convergence of both the NP I and NP II algorithms, guarantees that the transition probability from any part of the feasible region to the global optimum is at least as large as going from the global optimum back to that region. This ensures that the points close to the global optimum are suf®ciently
good. We now use this assumption to prove global convergence of the NP I algorithm. THEOREM 2 Assume that the NP I algorithm is applied and Assumptions 1±2 hold. Then
arg max p􏷦s􏷧(SY 􏷦19􏷧 s[ƒ0

282 SHI AND OÂ LAFSSON
and consequently the NP I algorithm converges with probability one to a global optimum.
Proof: Let Z1Y Z2 be any regions of maximum depth. Let x0Y x1Y F F F Y xn􏶓1Y xn be the shortest path from Z1 to Z2, where n 􏷩 k􏷦Z1Y Z2􏷧. It is also clear from the Kolmogorov criterion (Wolff, 1989) that the NP I Markov chain are reversible, and we have that,
P􏷦xiY xi􏷨1􏷧p􏷦xi􏷧 􏷩 P􏷦xi􏷨1Y xi􏷧p􏷦xi􏷨1􏷧Y i 􏷩 0Y 1Y 2Y F F F Y n 􏶓 1X We can rewrite this as
p􏷦xi􏷧 􏷩 P􏷦xi􏷨1Y xi􏷧 p􏷦xi􏷨1􏷧Y i 􏷩 0Y 1Y 2Y F F F Y n 􏶓 1X P􏷦xiY xi􏷨1􏷧
By using this equation iteratively and using x0 􏷩 Z1 and xn 􏷩 Z2 we can rewrite this as p􏷦Z1􏷧 􏷩 P􏷦xnY xn􏶓1􏷧 F F F P􏷦x2Y x1􏷧P􏷦x1Y x0􏷧 p􏷦Z2􏷧X 􏷦20􏷧
P􏷦x0Y x1􏷧P􏷦x1Y x2􏷧 F F F P􏷦xn􏶓1Y xn􏷧
Furthermore, since x0Y x1Y F F F Y xn􏶓1Y xn is the shortest path from x0 to xn then we also have
that
Pn􏷦Z1Y Z2􏷧 􏷩 Pn􏷦x0Y xn􏷧 􏷩 P􏷦x0Y x1􏷧P􏷦x1Y x2􏷧 F F F P􏷦xn􏶓1Y xn􏷧Y
and
Pn􏷦Z2Y Z1􏷧 􏷩 Pn􏷦xnY x0􏷧 􏷩 P􏷦xnY xn􏶓1􏷧 F F F P􏷦x2Y x1􏷧P􏷦x1Y x0􏷧X
We can therefore rewrite Equation (20) as
p􏷦Z1􏷧 􏷩 Pn􏷦Z2Y Z1􏷧 p􏷦Z2􏷧 􏷩 Pk􏷦Z2YZ1􏷧􏷦Z2Y Z1􏷧 p􏷦Z2􏷧X 􏷦21􏷧
Pn􏷦Z1Y Z2􏷧 Pk􏷦Z1YZ2􏷧􏷦Z1Y Z2􏷧
Now if Z2 [ arg maxs [ ƒ0 p􏷦s􏷧 then by de®nition p􏷦Z1 􏷧 􏷣 p􏷦Z2 􏷧 so by Equation (21) we
have
Pk􏷦Z2YZ1􏷧􏷦Z2Y Z1􏷧 􏷣 1Y
Pk􏷦Z1YZ2􏷧􏷦Z1Y Z2􏷧
that is, Pk􏷦Z1YZ2􏷧􏷦Z1YZ2􏷧 􏷤 Pk􏷦Z2YZ1􏷧􏷦Z2YZ1􏷧. Since Z1 can be chosen arbitrarily, Assumption
2 now implies that Z2 [ S0(S, which proves the theorem. j Similarly, the following theorem establishes global convergence for the NP II algorithm.
THEOREM 3 Assume that the NP II algorithm is applied and Assumptions 1±2 hold. Then arg max p􏷦s􏷧(SY 􏷦22􏷧
s[ƒ0
and consequently the NP II algorithm converges with probability one to a global optimum.
Proof: We start by looking at a state Z [ ƒn􏷦ƒ0 [ ‰􏷧. Since the chain leaves this state with probability one, and the state can only be entered by transitions from state s􏷦Z􏷧, the balance equations are given by,

NESTED PARTITIONS METHOD 283
p􏷦Z􏷧 􏷩 P􏷦s􏷦Z􏷧Y Z􏷧p􏷦s􏷦Z􏷧􏷧X
These equations can be solved iteratively to obtain
p􏷦Z􏷧 􏷩 Pd􏷦Z􏷧􏷦‰Y Z􏷧p􏷦‰􏷧X 􏷦23􏷧 Now assume s [ ƒ0 . The balance equations are given by
p􏷦s􏷧P􏷦sY ‰􏷧 􏷩 p􏷦s􏷦s􏷧􏷧P􏷦s􏷦s􏷧Y s􏷧X Rewrite and use Equation (23) above to get
p􏷦s􏷧 􏷩 P􏷦s􏷦s􏷧Y s􏷧􏷧 p􏷦s􏷦s􏷧􏷧 P􏷦sY ‰􏷧
􏷩 P􏷦s􏷦s􏷧Y s􏷧􏷧 Pd􏷦s􏷦s􏷧􏷧􏷦‰Y s􏷦s􏷧􏷧p􏷦‰􏷧 P􏷦sY ‰􏷧
􏷩 Pd􏷦s􏷧􏷦‰Y s􏷧 p􏷦‰􏷧X P􏷦sY ‰􏷧
Now take another Z [ ƒ0, then p􏷦s􏷧 􏷩 Pd􏷦s􏷧􏷦‰Y s􏷧 p􏷦‰􏷧
P􏷦sY ‰􏷧
􏷩 Pd􏷦s􏷧􏷦‰Y s􏷧 P􏷦ZY ‰􏷧
P􏷦sY ‰􏷧 Pd􏷦Z􏷧􏷦‰Y Z􏷧 􏷩 Pd􏷦s􏷧􏷨1􏷦ZY s􏷧 p􏷦Z􏷧
Pd􏷦Z􏷧􏷨1􏷦sY Z􏷧
􏷩 Pk􏷦ZYs􏷧 􏷦ZY s􏷧 p􏷦Z􏷧X
􏷦24􏷧
p􏷦Z􏷧
Pk􏷦sYZ􏷧 􏷦sY Z􏷧
The remainder of the proof follows the last paragraph of Theorem 2 above and is
omitted. j
We note that a suf®cient condition for Equation (18) to be satis®ed is that in each iteration, the probability of making the “correct” decision is at least one half, and that the condition (18) depends on the partitioning, sampling, and method of estimation. This condition needed for global convergence is in general dif®cult to verify, but to gain some insights the remainder of the section is devoted to obtaining conditions on the partitioning, sampling, and estimation elements that are suf®cient for convergence. These conditions are considerably stronger than (18) but are somewhat more intuitive. For simplicity we assume that there is a unique global optimum sopt [ S. We start with a de®nition.
DEFINITION 8 Let ƒg 􏷩 fs [ ƒ X sopt (sg denote all the regions containing the unique global optimum sopt and ƒb 􏷩 ƒnƒg denote the remaining regions. De®ne a function Y X ƒgTƒb?R by

284 SHI AND OÂ LAFSSON
􏷮􏷯􏷮􏷯
Y􏷦sYs􏷧􏷩J y􏷪1􏷫 􏶓J y􏷪1􏷫 Y 􏷦25􏷧 gb sb sg
where y􏷪1􏷫 denotes the random sample point from s[ƒ that has the estimated best s
performance and is generated in Step 2 of the NP algorithm. Furthermore, let
􏷮􏷯􏷮􏷯
Y”􏷦sYs􏷧􏷩Ly􏷪1􏷫 􏶓Ly􏷪1􏷫Y 􏷦26􏷧 gb sb sg
denote the corresponding simulation estimate.
We note that for each s [ ƒY y􏷪1􏷫 [ s is a random variable that is de®ned by the sampling s
strategy, and that
􏷮􏷯
􏷪1􏷫
L ys 􏷩I􏷦s􏷧X
”
􏷦27􏷧
Also note that for any sg [ ƒgY sb [ ƒb, the random variable Y􏷦sgY sb􏷧 is de®ned by the partitioning and sampling strategies, whereas if we condition the random variable Y”􏷦sgYsb􏷧 on the value of Y􏷦sgYsb􏷧, its outcome depends only on the estimated sample performance, that is, the randomness inherent in the simulation estimate. We now obtain the following theorem for the NP I algorithm:
THEOREM 4 Assume that sopt is unique and
p􏷪Y”􏷦sgY sb􏷧R0jY􏷦sgY sb􏷧 􏷩 y􏷫R y 􏷦28􏷧
2E􏷪Y􏷦sgY sb􏷧􏷫
for all miny[sb J􏷦y􏷧􏶓maxy[sg J􏷦y􏷧􏷣y􏷣maxy[sb J􏷦y􏷧􏶓miny[sg J􏷦y􏷧Ysg[ƒg, and
sb [ ƒb. Then the NP I algorithm converges to a global optimum.
Proof: First note that the assumption of this theorem can be satis®ed only if
y S1Y 2E􏷪Y􏷦sgY sb􏷧􏷫
where we have y 􏷣 maxy [ sb J􏷦y􏷧 􏶓 miny [ sg J􏷦y􏷧. Thus, we must have 1􏷰􏷱1􏷰􏷱
E􏷪Y􏷦sgY sb􏷧􏷫R max J􏷦y􏷧 􏶓 min J􏷦y􏷧 􏷩 max J􏷦y􏷧 􏶓 J􏷦yopt􏷧 Y 􏷦29􏷧 2 y[sb y[sg 2 y[sb
andthusE􏷪Y􏷦sgYsb􏷧􏷫R0.Nowletsg [ƒgYsb [ƒb,letFYXFY􏷦sgYsb􏷧 denotethedistribution function of Y􏷦sgY sb􏷧, and condition on the value of Y􏷦sgY sb􏷧:

NESTED PARTITIONS METHOD
285
p􏷪Y”􏷦sgYsb􏷧R0􏷫 􏷩 R
Z
p􏷪Y”􏷦sgYsb􏷧R0jY􏷦sgYsb􏷧 􏷩 y􏷫dFY􏷦y􏷧 Zy
2E􏷪Y􏷦s Ys 􏷧􏷫dFY􏷦y􏷧 Rgb
R
1Z
􏷩 2E􏷪Y􏷦s Y s 􏷧􏷫 ydFY 􏷦y􏷧
gbR
􏷩 12 X
􏷮􏷯􏷮􏷯
Now since by de®nition Y”􏷦s Y s 􏷧 􏷩 L y􏷪1􏷫 􏶓 L y􏷪1􏷫 , this implies that gb sb sg
h􏷮􏷪1􏷫􏷯 􏷮􏷪1􏷫􏷯i 1 p L ysb RL ysg R2Y
””1
that is, by Equation (27) we have p􏷪I􏷦sb􏷧RI􏷦sg􏷧􏷫R2, which implies
””””
􏷦30􏷧
such that 1 􏷣 i 􏷣 n 􏶓 1. We need to consider three cases here:
(1) If sopt [xi􏶓1 and sopt [xi􏷨1 then xi􏶓1 [ƒg and ‰nxi [ƒb. We also note that according to De®nition 6, xi 􏷨 1 􏷩 b􏷦xi 􏷧 and xi 􏶓 1 [ h􏷦xi 􏷧. Thus, according to equation (14) and the inequality (30) derived above:
jh􏷦x 􏷧j􏷨1 i X
i􏷩1
jh􏷦x 􏷧j􏷨1 i X
i􏷩1
jh􏷦x 􏷧j􏷨1 i X
i􏷩1
jh􏷦x 􏷧j􏷨1 i X
p􏷪I􏷦sb􏷧RI􏷦sg􏷧􏷫Rp􏷪I􏷦sb􏷧 􏷣 I􏷦sg􏷧􏷫X
As before, let sopt denote the unique global optimum. Let
Z [ ƒ0 ,
x0Y x1Y F F F Y xn􏶓1Y xn be the shortest path from sopt to Z, where n 􏷩 k􏷦soptY Z􏷧. Take any i
P􏷦xiY xi􏶓1􏷧 􏷩
􏷩
R
p􏷪xi􏶓1 [ B2 X jB2j 􏷩 i􏷫 c p􏷪jB2j 􏷩 i􏷫ai ””
􏷩
􏷩 P􏷦xiY xi􏷨1􏷧X
i􏷩1
p􏷪I􏷦xi􏶓1􏷧 􏷣 I􏷦Z􏷧Y VZ [ h􏷦xi􏷧 X jB2j 􏷩 i􏷫 c p􏷪jB2j 􏷩 i􏷫ai ””
p􏷪I􏷦‰nxi􏷧 􏷣 I􏷦Z􏷧Y VZ [ h􏷦xi􏷧 X jB2j 􏷩 i􏷫 c p􏷪jB2j 􏷩 i􏷫ai p􏷪xi􏷨1 [B2 X jB2j 􏷩 i􏷫cp􏷪jB2j 􏷩 i􏷫ai
sopt [xi􏶓1 and sopt6[xi􏷨1 then xi􏶓1 [ƒg and xi􏷨1 [ƒb.
xi 􏷩 b􏷦xi􏶓1􏷧 􏷩 b􏷦xi􏷩1􏷧. Similarly to the ®rst case we can use (14), or (12) if xi 􏷩 ‰, to show that P􏷦xiY xi 􏶓 1􏷧RP􏷦xiY xi 􏷨 1􏷧.
(2) If
Furthermore,
and let

286 SHI AND OÂ LAFSSON
(3) If sopt 6 [xi􏶓1 and sopt 6 [xi􏷨1 then ‰nxi􏶓1 [ ƒg and xi􏷨1 [ ƒb . Furthermore, xi􏶓1 􏷩 b􏷦xi􏷧 and xi􏷨1 [ h􏷦xi􏷧. Again, Equation (14) can be used to show that P􏷦xiY xi􏶓1􏷧RP􏷦xiY xi􏷨1􏷧.
Thus, we have that P􏷦xiYxi􏶓1􏷧RP􏷦xiYxi􏷨1􏷧, and since this holds for any i then it is clear that
P􏷦xn􏶓1Y xn􏶓2􏷧 c 􏶒 􏶒 􏶒 c P􏷦x1Y x0􏷧RP􏷦x1Y x2􏷧 c 􏶒 􏶒 􏶒 c P􏷦xn􏶓1Y xn􏷧X 􏷦31􏷧
Now we note that using a similar argument as above and noting that xnY‰nx0 [ƒb, whereas x0Y ‰nxn [ ƒg, Equations (30) and (15) imply that P􏷦xnY xn􏶓1􏷧RP􏷦x0Y x1􏷧, and putting this together with (31) we have
Pk􏷦ZYsopt 􏷧 􏷦ZY sopt 􏷧RPk􏷦soptYZ 􏷧 􏷦soptY Z􏷧X 􏷦32􏷧
Now ®nally, to verify that Equation (18) of Assumption 2 holds, let x[s0. Then by de®nition Pk􏷦ZYx􏷧􏷦ZY x􏷧 􏷤 Pk􏷦xYZ􏷧􏷦xY Z􏷧Y VZ [ ƒ0. If x 6􏷩 sopt then we can select Z 􏷩 sopt and obtain Pk􏷦soptYx􏷧􏷦soptYx􏷧 􏷤 Pk􏷦xYsopt􏷧􏷦xYsopt􏷧, which contradicts Equation (32) above. Thus, we must have x 􏷩 sopt, which implies that s0 􏷩 s so Assumption 2 is satis®ed and global convergence follows by Theorem 2 above. j
The assumption of Theorem 4 illustrates suf®cient global convergence conditions on the partitioning and sampling on the one hand, and the estimation on the other. In particular, E􏷪Y􏷦sgY sb􏷧􏷫 depends only on the partitioning and sampling and the inequality (29), which is implicit in the assumption, can be satis®ed by partitioning such that there is certain amount of separation, or partial non-overlap, between the good and bad sets, and then using enough sample points. On the other hand, given a ®xed value of E􏷪Y􏷦sgYsb􏷧􏷫, the inequality (28) depends on the sample performance being suf®ciently accurate and can be satis®ed by increasing the simulation time. However, verifying how much sample effort is “enough” and when the accuracy of the sample performance is “suf®cient” may still be fairly dif®cult. Finally, we note that although this theorem is only stated for NP I, similar conditions can be developed for the NP II algorithm.
4. Numerical Example
In this section we illustrate the methodology presented in this paper via a numerical example. A classical combinatorial optimization problem is the symmetric traveling salesman problem (TSP), which involves ®nding the shortest Hamiltonian cycle in an undirected graph with n nodes. This problem has many well known practical applications, such as routing robots through automatic warehouses, scheduling couriers for pickups at ATMs, and drilling holes through printed circuit boards. Furthermore, even for moderately small n, the number of alternative cycles is prohibitively large for enumerative methods, and since the optimization problem is NP-complete no ef®cient solution method is known to exist. If the “travel time” on the edges is random this dif®culty is exacerbated even further. The deterministic TSP is often used as a benchmark for combinatorial optimization algorithms, and in Shi, OÂ lafsson, and Sun (1999) a detailed deterministic

NESTED PARTITIONS METHOD 287
NP algorithm is developed and shown to perform very competitively when applied to this problem. The results of this section demonstrate the performance of the stochastic NP to the stochastic TSP.
Assume that there are n cities that must be visited and each city must be visited exactly once. Let Tij denote the random travel time between cities i and j, that is, the length of the edge between nodes i and j, for iYj[f1Y2YFFFYng. We assume that
Tij 􏷩 cij 􏷨 XijY 􏷦33􏷧
where XijBU􏷦􏶓 aY a􏷧Y aR0, is a uniform random variable with mean zero, which implies that E􏷪Tij􏷫 􏷩 cij. We refer to the optimal tour as the tour that has the shortest average length, that is, if y 􏷩 􏷦i1Y i2Y F F F Y in􏷧 is a feasible tour then the performance of this tour is
n􏶓1 X
J􏷦i1Yi2YFFFYin􏷧􏷩
We estimate this performance by averaging r replications, that is,
k􏷩1
cikik􏷨1 􏷨cini1X
rn􏶓1􏷮 􏷯􏷮 􏷯!
L􏷦iYiYFFFYi􏷧􏷩1X X c 􏷨x􏷦s􏷧 􏷨 c 􏷨x􏷦s􏷧 Y
r 1 2 n r
where x􏷦s􏷧 is the outcome of the s-th replication of the noise on the edge between cities i
s􏷩1 k􏷩1
ikik􏷨1 ikik􏷨1 ini1 ini1
ij
and j. This estimate is unbiased since E􏷪Lr􏷦i1Y i2Y F F F Y in􏷧􏷫 􏷩 J􏷦i1Y i2Y F F F Y in􏷧, and consistent
since by the strong law of large numbers, limr?c Lr􏷦i1Y i2Y F F F Y in􏷧 􏷩 J􏷦i1Y i2Y F F F Y in􏷧 with probability one.
As mentioned above, the details of how to implement the NP method for the deterministic TSP problem can be found in Shi, OÂ lafsson, and Sun (1999), and the only implementation difference between the deterministic and stochastic NP algorithms is the updating of the number of times singleton regions are visited and the use of estimation for the performance evaluation. We will therefore not repeat those implementation details here. This implementation, although speci®c to TSP in that it exploits some of its special structure, can be quite easily generalized to other combinatorial optimization problems as is illustrated by OÂ lafsson and Shi (2000). The TSP test problems employed are selected from the TSPLIB library of test problems (Reinelt, 1992), and are called eil51, eil76, and eil101, with 51, 76, and 101 cities, respectively. These problems are selected because they have a known optimal solution and are three dif®cult problems with similar structure but of different sizes, which allows us to evaluate how the method performs as the problem size increases. As the competitiveness of the NP method with respect to classical methods has already been established in the deterministic TSP context, the focus of our experiments is to evaluate the performance and robustness of the stochastic NP method with respect to the problem size and the variability in the performance estimates (sample performance). All of the experiments are performed on a Dell Dimension XPS R450 PC with 256MB RAM.
For these experiments we use two noise distributions, XijBU􏷦􏶓 1Y 1􏷧 and XijBU􏷦􏶓2Y2􏷧. This is compared to distances between nodes that range from 2 to 86 for eil51, from 2 to 85 for eil76, and from 1 to 92 for eil101. This is thus very signi®cant

288 SHI AND OÂ LAFSSON Table 1. Percentage over optimum for eil51.
Number of Replications (r)
Error 151025
U( 􏶓 1, 1) Average tour length Standard error (S.E.)
Best tour length
Worst tour length
CPU time (seconds)
U( 􏶓 2, 2) Average tour length Standard error (S.E.)
Best tour length
Worst tour length
CPU time (seconds)
2.54 2.91 2.84 2.77 1.04 1.07 0.96 1.09 1.17 1.41 0.94 1.41 5.40 5.87 5.16 5.40
92 86 95 94
3.58 3.12 2.86 2.80 1.85 1.28 1.36 0.80 1.17 1.17 0.70 1.41 7.98 5.16 6.57 4.23
91 97 88 100
noise, especially for shorter arcs. We perform a ®xed number of 300 iterations of the NP algorithm for every experiment. Table 1 shows the results for the eil51 and various number of replications r used for estimating the performance of each path. The numbers reported in the table are the average, standard error, minimum, and maximum percentage over the true optimum for 20 replications of the NP algorithm itself. As indicated by upper half of the table, if XijBU􏷦􏶓 1Y 1􏷧, then the number of replications used for estimation does not signi®cantly effect the performance of the algorithm and it consistently obtains tours that are within 3% of the optimal tour, and never worse than 6% above the length of the optimal tour. If the noise is greater, that is XijBU􏷦􏶓 2Y 2􏷧, then the NP algorithm can still ®nd tours of comparable quality; however, the performance of the algorithm is now more sensitive to the number of replications. Indeed, as might be expected, the fewer the replications the worse the performance of the algorithm. The interpretation is that if the noise is relatively high then too few replications does not assure that the estimated promising index is suf®ciently representative of the region. The same pattern can be seen for the average, the best tour, and the worst tour.
Table 2 compares the results for the different problems sizes, that is, the 51 city, 76 city, and 101 city problems. These results are based on 80 replications of the algorithm and a noise distribution XijBU􏷦􏶓 1Y 1􏷧. Although the performance degenerates somewhat as the problem size increases, for example the average percentage over optimum is 2.77% for eil51 but 5.38% for eil101, this is rather small when we consider that the number of possible tours is of the order 1065 for eil51 but 10159 for eil101. Also, the same pattern can
Table 2. Percentage over optimum for 51, 76, and 101 cities. Average S.E. Minimum
Maximum CPU Time
5.87 92 5.39 575 8.27 1967
eil51 eil76 eil101
2.77 1.03 0.94 3.12 0.77 1.12 5.38 1.34 1.75

NESTED PARTITIONS METHOD
Table 3. Percentage over optimum for variable noise in performance estimates.
289
eil51
Average CPU Time
2.77 92
3.09 94 11.6%
eil76
Average CPU Time
3.12 575 3.43 571 9.9%
eil101
U( 􏶓 1, 1) U( 􏶓 2, 2) Change
5.38 1967 5.78 1987 7.4%
Average
CPU Time
be seen for the best and worst tour found. Thus, the stochastic NP method appears to perform relatively well for large problems.
Finally, we focus on the variability in the performance estimates and Table 3 reports the summary results these experiments. This table compares XijBU􏷦􏶓 1Y 1􏷧 and XijBU􏷦􏶓2Y2􏷧 for all three problems and is again based on 20 replications. As is to be expected, the performance of the algorithm degenerates as the noise increases, although this change is rather slight and it is less for the larger problems, 7.4% for eil101 versus 11.6% for eil51. Thus, again these results indicate that the NP algorithm performs rather well for large problems, and that it can obtain good solutions despite signi®cant noise.
5. Discussion and Future Research
We have adapted the nested partitions (NP) method of Shi and OÂlafsson (2000) to stochastic optimization. The basic idea behind the NP method is simple. It samples from the entire feasible region in each iteration, but concentrates the sampling effort by iteratively partitioning the part of the feasible region that is considered the most viable to contain the global optimum. Each partitioning is hence nested within the last, eventually reducing the most promising region to a singleton. The partitioning method, and a backtracking feature that makes the sampling less concentrated, are part of a global search phase. However, local search methods can also be incorporated into the method by using them to help decide what region should be partitioned next, and hence where the sampling should be concentrated.
It is easy to see that the partitioning in the NP method plays an important role. It is not assumed that the feasible region has any particular structure, but the partitioning implicitly imposes a structure on feasible region. If a rich underlying structure is available, that should of course be utilized. The convergence and ef®ciency of the NP method rests on the partitioning being performed such that good solutions are clustered together in the same subregions. If this is accomplished, the NP method rapidly concentrates the search in those regions and converges quickly. We showed that under certain conditions the NP method converges almost surely to a global optimum. However, such asymptotic results do not provide information concerning how long the algorithm must be applied to achieve this convergence. Therefore, an important future research topic is to investigate bounds on the rate of convergence and how such bounds can be used to derive stopping criteria for the NP method (OÂ lafsson and Shi, 1998).

290 SHI AND OÂ LAFSSON
Acknowledgments
The authors would like to thank the associate editor and anonymous referee for comments and suggestions that greatly improved the content and presentation of this paper. The authors would also like to acknowledge the assistance of Mr. Yunpeng Pan in generating the numerical results presented in this paper.
References
S. AndradoÂttir, “A method for discrete stochastic optimization,” Management Science vol. 41 pp. 1946±1961, 1995.
S. AndradoÂttir, “A global search method for discrete stochastic optimization,” SIAM Journal on Optimization vol. 6 pp. 513±530, 1996.
D. A. Barry and B. Fristedt, Bandit Problems, Chapman and Hall: London, 1985.
R. E. Bechhofer, T. J. Santner, and D. Goldsman, Design and Analysis for Statistical Selection, Screening, and
Multiple Comparisons, John Wiley: New York, 1995.
L. Dai, “Convergence properties of ordinal comparison in the simulation of discrete event dynamic systems,”
Journal of Optimization Theory and Applications vol. 19 pp. 363±388, 1996.
M. C. Fu and J. Q. Hu, Conditional Monte Carlo: Gradient Estimation and Optimization Applications, Kluwer
Academic Press: 1997.
S. B. Gelfand and S. K. Mitter, “Simulated annealing with noisy or imprecise energy measurements,” Journal of
Optimization Theory and Applications vol. 62 pp. 49±62, 1989.
W. B. Gong, Y. C. Ho, and W. Zhai, “Stochastic comparison algorithm for discrete optimization with
estimation,” Proceedings of the 31st IEEE Conference on Decision and Control pp. 795±800, 1992.
P. Glasserman, Gradient Estimation Techniques for Discrete Event Systems, Kluwer: Boston, MA, 1991.
P. W. Glynn, “Likelihood ratio gradient estimation for stochastic systems,” Communications of the ACM vol. 33
pp. 75±84, 1990.
W. K. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika vol.
57 pp. 92±109, 1970.
Y. C. Ho and X. R. Cao, Discrete Event Systems and Perturbation Analysis, Kluwer: Boston, MA, 1991.
Y. C. Ho, R. S. Sreenivas, and P. Vakili, “Ordinal optimization of DEDS,” Discrete Event Dynamic Systems:
Theory and Applications vol. 2 pp. 61±88, 1992.
S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated annealing,” Science vol. 220 pp. 671±680,
1983.
J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” Annals of
Mathematical Statistics vol. 23 pp. 462±466, 1952.
P. L’Ecuyer and P. W. Glynn, “Stochastic optimization by simulation: convergence proofs for the GI/G/1 queue in
steady-state,” Management Science vol. 40 pp. 1562±1578, 1994.
M. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equations of state calculations
by fast computing machines,” J. of Chemical Physics vol. 21 pp. 1087±1092, 1953.
K. S. Narendra and M. A. L. Thathachar, Learning Automata: An Introduction, Prentice Hall: Englewood Cliffs,
1989.
W. I. Norkin, G. C. P ̄ug, and A. RuszczynÂski, “A branch and bound method for stochastic global optimization,”
Mathematical Programming vol. 83 pp. 425±450, 1998.
S. OÂ lafsson and L. Shi, “Stopping criteria for a simulation-based optimization method,” in D. J. Medeiros, E. F.
Watson, E. F. Manivannan, and J. Carson (eds.), Proceedings of the 1998 Winter Simulation Conference,
pp. 743±750, 1998.
S. OÂ lafsson and L. Shi, “A method for scheduling in parallel manufacturing systems with ̄exible resources,” IIE
Transactions vol. 32 pp. 135±146, 2000.
G. Reinelt, “TSPLIB – A traveling salesman problem library,” ORSA Journal on Computing vol. 3 pp. 376±84,
1992.

NESTED PARTITIONS METHOD 291
H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics vol. 22 pp. 400±407, 1951.
S. M. Robinson, “Analysis of sample-path optimization,” Mathematics of Operations Research vol. 21 pp. 513± 528, 1996.
R. Y. Rubinstein, “How to optimize discrete-event systems from a single sample path by the score function method,” Annals of Operations Research vol. 27 pp. 175±212, 1991.
R. Y. Rubinstein and A. Shapiro, “Optimization of simulation models by the score function method,” Mathematics and Computers in Simulation vol. 32 pp. 373±392, 1990.
R. Y. Rubinstein and A. Shapiro, Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method, Wiley: Chichester, England, 1993.
R. Y. Rubinstein, “The cross-entropy method for combinatorial and continuous optimization,” Methodology and Computing in Applied Probability vol. 1 pp. 127±190, 1999.
A. Shapiro, “Simulation-based optimization ± convergence analysis and statistical inference,” Communications in Statistics ± Stochastic Models vol. 12 pp. 425±454, 1996.
L. Shi and S. OÂlafsson, “An integrated framework for deterministic and stochastic optimization,” in S. AndradoÂttir, K. J. Healy, D. H. Withers, and B. L. Nelson (eds.), Proceedings of the 1997 Winter Simulation Conference, pp. 358±365, 1997.
L. Shi and S. OÂ lafsson, “Nested partitions method for global optimization,” Operations Research vol. 48(3) p. 390, 2000.
L. Shi, S. OÂ lafsson, and N. Sun, “New parallel randomized algorithms for the traveling salesman problem,” Computers & Operations Research vol. 26 pp. 371±394, 1999.
Z. B. Tang, “Adaptive partitioned random search to global optimization,” IEEE Trans. on Automatic Control vol. 39 pp. 2235±2244, 1994.
S. Yakowitz, P. L’Ecuyer, and F. VaÂzquez-Abad, “Global stochastic optimization with low-dispersion point sets,” Operations Research in press, 1999.
D. Yan and H. Mukai, “Stochastic discrete optimization,” SIAM J. Control and Optimization vol. 30 pp. 594± 612, 1992.