8.2 Simulated Annealing-Based Global Search

The success of global search methods in locating a globally optimal solution (say, a global minimum) of a given function y(x) over x   hinges on a balance between an exploration process, a guidance process, and a convergence-inducing process. The exploration process gives the search a mechanism of sampling a sufficiently diverse set of points x in . This exploration process is usually stochastic in nature. The guidance process is an explicit or implicit process which evaluates the relative "quality" of search points (e.g., two consecutive search points) and biases the exploration process to move towards regions of high quality solutions in . Finally, the convergence-inducing process assures the ultimate convergence of the search to a fixed solution x*. The dynamic interaction among the above three processes is thus responsible for giving the search process its global optimizing character. As an exercise, one might consider identifying the above three processes in the global optimization method of stochastic gradient descent presented in the previous section. Here, the exploration process is realized by the noise term in Equation (8.1.4). The convergence-inducing process is realized effectively by the noise amplitude schedule c(t) and by the gradient term y(x). On the other hand, the search guidance process is not readily identifiable. In fact, this method lacks an effective guidance process and the only guidance available is the local guidance due to the gradient term. Note that gradient-based guidance is only effective when the function y(x) being minimized has its global minimum (or a near optimal minimum) located at the bottom of a wide "valley" relative to other shallow local minima. A good example of such a function is y(x) = x sin.

Another stochastic method for global optimization is (stochastic) simulated annealing (Kirkpatrick et al., 1983; Kirkpatrick, 1984; Aart and Korst, 1989). This method does not use gradient information explicitly and thus is applicable to a wider range of functions (specifically, functions whose gradients are expensive to compute or functions which are not differentiable) compared to the stochastic gradient method.

Simulated annealing is analogous to the physical behavior of annealing a molten metal (Metropolis et al, 1953; Kirkpatrick et al.,1983). Above its melting temperature, a metal enters a phase where atoms (particles) are positioned at random according to statistical mechanics. As with all physical systems, the particles of the molten metal seek minimum energy configurations (states) if allowed to cool. A minimum energy configuration means a highly ordered state such as a defect-free crystal lattice. In order to achieve the defect-free crystal, the metal is annealed: First, the metal is heated above its melting point and then cooled slowly until the metal solidifies into a "perfect" crystalline structure. Slow cooling (as opposed to quenching) is necessary to prevent dislocations of atoms and other crystal lattice disruption. The defect-free crystal state corresponds to the global minimum energy configuration.

Next, a brief presentation of related statistical mechanics concepts is given which will help us understand the underlying principles of the simulated annealing global optimization method. Statistical mechanics is the central discipline of condensed matter physics which deals with the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature [for a concise presentation of the basic ideas of statistical mechanics, see Shrödinger (1946)]. The starting point of statistical mechanics is an energy function E(x) which measures the thermal energy of a physical system in a given state x, where x belongs to a set of possible states . If the system's absolute temperature T is not zero (i.e., T > 0), then the state x will vary in time causing E to fluctuate. Being a physical system, the system will evolve its state in an average direction corresponding to that of decreasing energy E. This continues until no further decrease in the average of E is possible, which indicates that the system has reached thermal equilibrium. A fundamental result from physics is that at thermal equilibrium each of the possible states x occurs with probability


where k is Boltzmann's constant and the denominator is a constant which restricts P(x) between zero and one. Equation (8.2.1) is known as the Boltzmann-Gibbs distribution.

Now define a set of transition probabilities W(x  x') from a state x into x'. What is the condition on so that the system may reach and then remain in thermal equilibrium? A sufficient condition for maintaining equilibrium is that the average number of transitions from x to x' and from x' to x be equal:

P(xW(x  x') = P(x') W(x'  x) (8.2.2)

or, by dividing by W(x  x') and using Equation (8.2.1)


where E = E(x') - E(x). In simulating physical systems, a common choice is the Metropolis state transition probability (Metropolis et al., 1953) where


This has the advantage of making more transitions to lower energy states than those in Equation (8.2.3), and therefore reaches equilibrium more rapidly. Note that transitions from low to high energy states are possible except when T = 0.

Simulated annealing optimization for finding global minima of a function y(x) borrows from the above theory. Two operations are involved in simulated annealing: a thermostatic operation which schedules decreases in the "temperature" and a stochastic relaxation operation which iteratively finds the equilibrium solution at the new temperature using the final state of the system at the previous temperature as a starting point. Here, a function y of discrete or continuous variables can be thought of as the energy function E in the above analysis. Simulated annealing introduces artificial thermal noise which is gradually decreased in time. This noise is controlled by a new parameter T which replaces the constant kT in Equation (8.2.4). Noise allows occasional hill-climbing interspersed with descents. The idea is to apply uniform random perturbations x to the search point x and then determine the resulting change y = y(x + x- y(x). If the value of y is reduced (i.e., y < 0) the new search point x' = x + x is adopted. On the other hand, if the perturbation leads to an increase in y (i.e., y > 0) the new search point x' may or may not be adopted. In this case, the determination of whether to accept x' or not is stochastic, with probability . Hence, for large values of T, the probability of an uphill move in y is large. However, for small T the probability of an uphill move is low; i.e., as T decreases fewer uphill moves are allowed. This leads to an effective guidance of search since the uphill moves are done in a controlled fashion, so there is no danger of jumping out of a local minimum and falling into a worse one.

The following is a step-by-step statement of a general purpose simulated annealing optimization algorithm for finding the global minimum of a multi-variate function y(x), x  .

1. Initialize x to an arbitrary point in . Choose a "cooling" schedule for T. Initialize T at a sufficiently large value.

2. Compute x' = x + x, where x is a small uniform random perturbation.

3. Compute y = y(x') - y(x).

4. Use the Metropolis transition rule for deciding whether to accept x' as the new search point (or else remain at x). That is if y < 0, the search point becomes x', otherwise accept x' as the new point with a transition probability W(x  x') = . For this purpose select a uniform random number a between zero and one. If W(x  x') > a then x' becomes the new search point, otherwise the search point remains at x.

5. Repeat steps 2 through 4 until the system reaches an equilibrium. Equilibrium is reached when the number of accepted transitions becomes insignificant which happens when the search point is at or very close to a local minimum. In practice, steps 2 through 4 may be repeated for a fixed prespecified number of cycles.

6. Update T according to the annealing schedule chosen in step 1 and repeat steps 2 through 5. Stop when T reaches zero.

The effectiveness of simulated annealing in locating global minima and its speed of convergence critically depends on the choice of the cooling schedule for T. Generally speaking, if the cooling schedule is too fast, a premature convergence to a local minimum might occur, and if it is too slow, the algorithm will require an excessive amount of computation time to converge. Unfortunately, it has been found theoretically (Geman and Geman, 1984) that T must be reduced very slowly in proportion to the inverse log of time (processing cycle)


to guarantee that the simulated annealing search converges almost always to the global minimum. The problem of accelerating simulated annealing search has received increased attention, and a number of methods have been proposed to accelerate the search (e.g., Szu, 1986; Salamon et al., 1988). In practice, a suboptimal solution is sometimes sufficient and faster cooling schedules may be employed. For example, one may even try a schedule of the form T(t) =  T(t - 1) where 0.85    0.98, which reduces the temperature exponentially fast.

Because of its generality as a global optimization method, simulated annealing has been applied to many optimization problems. Example applications can be found in Geman and Geman (1984), Sontag and Sussann (1985), Ligthart et al. (1986), Romeo (1989), Rutenbar (1989), and Johnson et al. (1989; 1991).

By now the reader might be wondering about the applicability of simulating annealing to neural networks. As is shown in the next section, it turns out that simulated annealing can be naturally mapped onto recurrent neural networks with stochastic units, for the purpose of global retrieval and/or optimal learning. Simulated annealing may also be easily applied to the training of deterministic multilayer feedforward nets. Here, one can simply interpret the error function E(w) [e.g., the SSE function in Equation (5.1.13) or the "batch" version of the entropy function in Equation (5.2.16)] as the multi-variate scalar function y(x) in the above algorithm. However, because of its intrinsic slow search speed simulated annealing should only be considered, in the training of deterministic multilayer nets, if a global or near global solution is desired and if one suspects that E is a complex multimodal function.

Back to the Table of Contents

Back to Main Menu