The success of global search methods in locating
a globally optimal solution (say, a global minimum) of a given
function y(x) over x hinges on
a balance between an exploration process, a guidance process,
and a convergence-inducing process. The exploration process gives
the search a mechanism of sampling a sufficiently diverse set
of points x in . This exploration process is usually stochastic
in nature. The guidance process is an explicit or implicit process
which evaluates the relative "quality" of search points
(e.g., two consecutive search points) and biases the exploration
process to move towards regions of high quality solutions in .
Finally, the convergence-inducing process assures the ultimate
convergence of the search to a fixed solution x*.
The dynamic interaction among the above three processes is thus
responsible for giving the search process its global optimizing
character. As an exercise, one might consider identifying the
above three processes in the global optimization method of stochastic
gradient descent presented in the previous section. Here, the
exploration process is realized by the noise term in Equation
(8.1.4). The convergence-inducing process is realized effectively
by the noise amplitude schedule c(t) and by the
gradient term y(x). On the other hand, the search
guidance process is not readily identifiable. In fact, this method
lacks an effective guidance process and the only guidance available
is the local guidance due to the gradient term. Note that gradient-based
guidance is only effective when the function y(x)
being minimized has its global minimum (or a near optimal minimum)
located at the bottom of a wide "valley" relative to
other shallow local minima. A good example of such a function
is y(x) = x sin
.
Another stochastic method for global optimization
is (stochastic) simulated annealing (Kirkpatrick et al., 1983;
Kirkpatrick, 1984; Aart and Korst, 1989). This method does not
use gradient information explicitly and thus is applicable to
a wider range of functions (specifically, functions whose gradients
are expensive to compute or functions which are not differentiable)
compared to the stochastic gradient method.
Simulated annealing is analogous to the physical
behavior of annealing a molten metal (Metropolis et al, 1953;
Kirkpatrick et al.,1983). Above its melting temperature, a metal
enters a phase where atoms (particles) are positioned at random
according to statistical mechanics. As with all physical systems,
the particles of the molten metal seek minimum energy configurations
(states) if allowed to cool. A minimum energy configuration means
a highly ordered state such as a defect-free crystal lattice.
In order to achieve the defect-free crystal, the metal is annealed:
First, the metal is heated above its melting point and then cooled
slowly until the metal solidifies into a "perfect" crystalline
structure. Slow cooling (as opposed to quenching) is necessary
to prevent dislocations of atoms and other crystal lattice disruption.
The defect-free crystal state corresponds to the global minimum
energy configuration.
Next, a brief presentation of related statistical
mechanics concepts is given which will help us understand the
underlying principles of the simulated annealing global optimization
method. Statistical mechanics is the central discipline of condensed
matter physics which deals with the behavior of systems with many
degrees of freedom in thermal equilibrium at a finite temperature
[for a concise presentation of the basic ideas of statistical
mechanics, see Shrödinger (1946)]. The starting point of
statistical mechanics is an energy function E(x)
which measures the thermal energy of a physical system in a given
state x, where x belongs to a set of possible states
. If the system's absolute temperature T is not zero (i.e.,
T > 0), then the state x will vary
in time causing E to fluctuate. Being a physical system,
the system will evolve its state in an average direction corresponding
to that of decreasing energy E. This continues until no
further decrease in the average of E is possible, which
indicates that the system has reached thermal equilibrium. A
fundamental result from physics is that at thermal equilibrium
each of the possible states x occurs with probability
(8.2.1)
where k is Boltzmann's constant and the denominator
is a constant which restricts P(x) between zero
and one. Equation (8.2.1) is known as the Boltzmann-Gibbs distribution.
Now define a set of transition probabilities W(x x')
from a state x into x'. What is the condition on
so that the system may reach and then
remain in thermal equilibrium? A sufficient condition for maintaining
equilibrium is that the average number of transitions from x
to x' and from x' to x be equal:
P(x) W(x x') = P(x') W(x' x)
(8.2.2)
or, by dividing by W(x x')
and using Equation (8.2.1)
(8.2.3)
where E = E(x') - E(x).
In simulating physical systems, a common choice is the Metropolis
state transition probability (Metropolis et al., 1953) where
(8.2.4)
This has the advantage of making more transitions
to lower energy states than those in Equation (8.2.3), and therefore
reaches equilibrium more rapidly. Note that transitions from
low to high energy states are possible except when T = 0.
Simulated annealing optimization for finding global
minima of a function y(x) borrows from the above
theory. Two operations are involved in simulated annealing: a
thermostatic operation which schedules decreases in the "temperature"
and a stochastic relaxation operation which iteratively finds
the equilibrium solution at the new temperature using the final
state of the system at the previous temperature as a starting
point. Here, a function y of discrete or continuous variables
can be thought of as the energy function E in the above
analysis. Simulated annealing introduces artificial thermal noise
which is gradually decreased in time. This noise is controlled
by a new parameter T which replaces the constant kT
in Equation (8.2.4). Noise allows occasional hill-climbing interspersed
with descents. The idea is to apply uniform random perturbations
x to the search point x and then determine the resulting
change y = y(x + x) - y(x).
If the value of y is reduced (i.e., y < 0)
the new search point x' = x + x
is adopted. On the other hand, if the perturbation leads to an
increase in y (i.e., y > 0) the new
search point x' may or may not be adopted. In this case,
the determination of whether to accept x' or not is stochastic,
with probability
. Hence, for large values
of T, the probability of an uphill move in y is
large. However, for small T the probability of an uphill
move is low; i.e., as T decreases fewer uphill moves are
allowed. This leads to an effective guidance of search since
the uphill moves are done in a controlled fashion, so there is
no danger of jumping out of a local minimum and falling into a
worse one.
The following is a step-by-step statement of a general
purpose simulated annealing optimization algorithm for finding
the global minimum of a multi-variate function y(x),
x .
1. Initialize x to an arbitrary point in . Choose a "cooling" schedule for T. Initialize T at a sufficiently large value.
2. Compute x' = x + x, where x is a small uniform random perturbation.
3. Compute y = y(x') - y(x).
4. Use the Metropolis transition rule for deciding
whether to accept x' as the new search point (or else remain
at x). That is if y < 0, the search
point becomes x', otherwise accept x' as the new
point with a transition probability W(x x') =
.
For this purpose select a uniform random number a between
zero and one. If W(x x') > a
then x' becomes the new search point, otherwise the search
point remains at x.
5. Repeat steps 2 through 4 until the system reaches an equilibrium. Equilibrium is reached when the number of accepted transitions becomes insignificant which happens when the search point is at or very close to a local minimum. In practice, steps 2 through 4 may be repeated for a fixed prespecified number of cycles.
6. Update T according to the annealing schedule
chosen in step 1 and repeat steps 2 through 5. Stop when T
reaches zero.
The effectiveness of simulated annealing in locating
global minima and its speed of convergence critically depends
on the choice of the cooling schedule for T. Generally
speaking, if the cooling schedule is too fast, a premature convergence
to a local minimum might occur, and if it is too slow, the algorithm
will require an excessive amount of computation time to converge.
Unfortunately, it has been found theoretically (Geman and Geman,
1984) that T must be reduced very slowly in proportion
to the inverse log of time (processing cycle)
(8.2.5)
to guarantee that the simulated annealing search
converges almost always to the global minimum. The problem of
accelerating simulated annealing search has received increased
attention, and a number of methods have been proposed to accelerate
the search (e.g., Szu, 1986; Salamon et al., 1988). In practice,
a suboptimal solution is sometimes sufficient and faster cooling
schedules may be employed. For example, one may even try a schedule
of the form T(t) = T(t - 1)
where 0.85 0.98, which reduces the temperature
exponentially fast.
Because of its generality as a global optimization
method, simulated annealing has been applied to many optimization
problems. Example applications can be found in Geman and Geman
(1984), Sontag and Sussann (1985), Ligthart et al. (1986), Romeo
(1989), Rutenbar (1989), and Johnson et al. (1989; 1991).
By now the reader might be wondering about the applicability
of simulating annealing to neural networks. As is shown in the
next section, it turns out that simulated annealing can be naturally
mapped onto recurrent neural networks with stochastic units, for
the purpose of global retrieval and/or optimal learning. Simulated
annealing may also be easily applied to the training of deterministic
multilayer feedforward nets. Here, one can simply interpret the
error function E(w) [e.g., the SSE function in Equation
(5.1.13) or the "batch" version of the entropy function
in Equation (5.2.16)] as the multi-variate scalar function y(x)
in the above algorithm. However, because of its intrinsic slow
search speed simulated annealing should only be considered, in
the training of deterministic multilayer nets, if a global or
near global solution is desired and if one suspects that E
is a complex multimodal function.