This chapter discusses probabilistic global search
methods which are suited for neural network optimization. Global
search methods, as opposed to deterministic gradient-based search
methods, must be used in optimization problems where reaching
the global minimum (or maximum) is at a premium. However, the
price one pays for using global search methods is increased computational
and/or storage requirements as compared to that of local search.
The intrinsic slowness of global search methods is mainly due
to the slow but crucial exploration mechanisms employed.

Global search methods may be used as optimal learning
algorithms for neural networks. Some global search methods may
also be mapped onto recurrent neural networks such that the retrieval
dynamics of these networks escape local minima and evolve towards
global minimum.

Two major probabilistic global search methods are
covered in this chapter. The first method is stochastic simulated
annealing which is motivated by statistical mechanics, and the
second method is genetic algorithms which is motivated by the
mechanics of natural selection and natural genetics. The exploration
mechanism in simulated annealing is governed by the Boltzmann-Gibbs
probability distribution, and its convergence is determined by
a "cooling" schedule of slowly decreasing "temperature."
This method is especially appealing since it can be naturally
implemented by a stochastic neural network known as the Boltzmann
machine. The Boltzmann machine is a stochastic version of Hopfield's
energy minimizing net which is capable of almost guaranteed convergence
to the global minimum of an arbitrary bounded quadratic energy
function. Simulated annealing is also applied to optimal weight
learning in generally interconnected multilayer Boltzmann machines,
thus extending the applicability of the Boltzmann machine from
combinatorial optimization to optimal supervised learning of complex
binary mappings. However, these desirable features of Boltzmann
machines come with slow learning and/or retrieval.

Mean-field annealing is a deterministic approximation
(based on mean-field theory) to stochastic simulated annealing
where the mean behavior of the stochastic state transitions are
used to characterize the Boltzmann machine. This approximation
is found to preserve the optimal characteristics of the Boltzmann
machine but with one to two orders of magnitude speed advantage.
Mean-field annealing is applied in the context of retrieval dynamics
and weights learning in a Boltzmann machine. It is interesting
to see that applying mean-field theory to a single-layer Boltzmann
machine leads to the deterministic continuous Hopfield net.

Genetic algorithms (GA's) are introduced as another
method for optimal neural network design. They employ a parallel
multipoint probabilistic search strategy which is biased towards
reinforcing search points of high fitness. The most distinguishing
feature of GA's is their flexibility and applicability to a wide
range of optimization problems. In the domain of neural networks,
GA's are useful as global search methods for synthesizing the
weights of generally interconnected networks, optimal network
architectures and learning parameters, and optimal learning rules.

It is argued that, in order to make global search methods more speed efficient, local gradient information (if available) could be used advantageously. In the context of GA optimization, it is possible to think of the GA as an evolutionary mechanism which could be accelerated by simple learning processes. This observation motivates the hybrid GA/gradient descent method for feedforward multilayer net training introduced at the end of the chapter.

† **8.1.1** Plot and
find analytically the global minima of the following functions:

a. *y*(*x*) = *x* sin;
*x* [0.05, 0.5]

b. *y*(*x*) = (*x*2 + 2*x*) cos(*x*);
|*x*| < 5

c. *y*(*x*) = 10 sin2*x* + 0.2 (*x* + 3)2;
|*x*| < 20

d. *y*(*x*) = *x*6 - 15*x*4 + 27*x*2 + 250;
|*x*| < 3.7

e. ; |*x*1| < 1.5 and |*x*2| < 1.5

† **8.1.2** Plot and
count the number of minima, maxima, and saddle points for the
following functions:

† **8.1.3** Employ the
gradient descent search rule given by

to find a minimum of the function *y*(*x*) = *x* sin
starting from the initial search point: a. *x*(0) = 0.05;
b. *x*(0) = 0.1; c. *x*(0) = 0.15.
Assume = 10-4.

† **8.1.4** Repeat Problem
8.1.3 for the function in Problem 8.1.1 (c) with *x*(0) = -20
and = 0.01.

† **8.1.5** Employ the
gradient descent/ascent global search strategy described in Section
8.1 to find the global minima of the functions (a) - (d) in Problem
8.1.1.

† **8.1.6** "Global
descent" is a global search method which was discussed in
Section 5.1.2. Implement the global descent method to search
for the global minimum of the function *y*(*x*) given
in Problem 8.1.1 (c). Assume , = 0.005, *x* = 0.01,
= 2, and experiment with different values of the repeller
parameter . Plot *y*[*x*(*t*)] versus *t*
for various values of *k*.

† **8.1.7** For the uni-variate
case, *y* = *y*(*x*), the global minimum
can be reached by gradient descent on the noisy function

where *N*(*t*) is a "noise signal"
and *c*(*t*) is a parameter which controls the magnitude
of noise. Apply the gradient descent rule (with = 0.01)
of Problem 8.1.3 to . Use the resulting "stochastic"
gradient rule to find the global minimum of *y*(*x*)
in Problem 8.1.1 (c). Assume that *N*(*t*) is a normally
distributed sequence with a mean of zero and a variance of one,
and that *c*(*t*) = 150 exp(-*t*).
Start from *x*(0) = -20
and experiment with different values of , [0.01, 0.001].
Plot *x*(*t*) versus *t* (*t* = 0, 1, 2, ...).
What range of values of are likely to lead to the global minimum
of *y*(*x*)?

† **8.2.1** Use the simulated
annealing algorithm described in Section 8.2 with the temperature
schedule in Equation (8.2.5) to find the global minima of the
functions in Problem 8.1.1 (a) - (c) and Problem 8.1.2 (a) and
(b). Make intelligent choices for the variance of the random
perturbation **x** taking into account the domain of the function
being optimized. Also, make use of the plots of these functions
to estimate the largest possible change in *y* due to **x**,
, and use this information to guide your estimate of *T*o.

* **8.3.1** Consider the simple
model of a biological neuron with output *x* given by *x*(*net*) = sgn(*net*),
where *net* is the post-synaptic potential. Experimental
observations (Katz, 1966) show that this post-synaptic potential
is normally distributed; i.e.,

where is the mean potential. The distribution width,
, is determined by the parameters of the noise sources associated
with synaptic junctions. Show that the probability that the neuron
fires (i.e., the probability of its output to be equal to one)
is given by

where . Note how the pseudo-temperature now has
the physical interpretation as being proportional to the fluctuation
of the post-synaptic potential of a real neuron. Next show that
the above probability can be roughly approximated by . Hint:
Compare the series expansion of to that of tanh(*x*).

**8.3.2** Derive Equation
(8.3.4). (Hint: employ the thermal equilibrium condition of Equation
(8.2.3) and Equation (8.3.1), assume that all units have equal
probability of being selected for updating and that only one unit
updates its state at a given time). Show that according to Equation
(8.3.4) the probability of a transition (bit-flip) that increases
the energy *E* is always less than 0.5.

**8.3.3** Show that the relative-entropy
*H* in Equation (8.3.9) is positive or zero.

**8.3.4** Derive Equation
(8.3.12) starting from Equation (8.3.8).

* **8.3.5** Derive Equation
(8.3.16) by performing a gradient descent on *H* in Equation
(8.3.15).

**8.5.1** Employ Hardy's theorem
(see Equation (8.5.2) and associated discussion) in a simple iterative
procedure to find the largest number in the set *Q* = {1, 3, 4.5, 1.5, 4.2, 2}.

**8.5.2** Consider the ten
strings in population *S*(0) in Table 8.5.1 and the two schemata
*H*1 = *11***
and *H*2 = *01**0.
Which schemata are matched by which strings in the population
*S*(0)? What are the order and defining length of each of
*H*1 and *H*2?

**8.5.3** Consider the problem
of finding the global minimum of the function , *x* [0.05, 0.5],
using GA. Assume the initial population *S*(0) as in Table
8.5.1 and let *P**c* = 0.8
and *P**m* = 0.01
as in the first simulation of Example 8.5.2. Use Equation (8.5.4)
to compute a lower bound for the expected number of schemata of
the form *11***
in the generation at *t* = 1. Repeat using the
approximation of Equation (8.5.5). Next, compare these two bounds
to the actual number of schemata of the form *11***
in population *S*(1) in Table 8.5.1.

**8.5.4** Repeat Problem 8.5.3
with the schema *01**0.

† **8.5.5** Find the
global minimum of the functions in Problem 8.1.1 (a) - (c) and
Problem 8.1.2 (a) and (b) using the standard genetic algorithm
of Section 8.5.1. Use binary strings of dimension *n* = 8.
Assume *P**c* = 0.85,
*P**m* = 0.02,
and a uniformly distributed initial population of 10 strings.
Compare your results to those in Problem 8.2.1.

**8.5.6** Consider the two-layer
feedforward net shown in Figure 8.6.1. Assume that a binary coding
of weights is used where each weight is represented by a *b*-bit
substring for the purpose of representing the network as a GA
string **s** of contiguous weight substrings. Show that the
dimension of **s** is equal to [(*n* + 1)*J* + (*J* + 1)*L*]*b*,
where *n*, *J*, and *L* are the input vector dimension,
number of hidden units, and number of output units, respectively.

† **8.5.7** Use the standard
genetic algorithm to find integer weights in the range [-15, +15]
for the neural network in Figure P8.5.7 such that the network
solves the XOR problem. Assume a signed-binary coding of 5 bits
(sign bit plus four magnitude bits) for each weight. Also, assume
*P**c* = 0.85,
*P**m* = 0.01,
and experiment with population sizes of *M* = 10,
20, 30, 40, and 50. The total number of correct responses may
be used as the fitness function.