8.7 Summary

This chapter discusses probabilistic global search methods which are suited for neural network optimization. Global search methods, as opposed to deterministic gradient-based search methods, must be used in optimization problems where reaching the global minimum (or maximum) is at a premium. However, the price one pays for using global search methods is increased computational and/or storage requirements as compared to that of local search. The intrinsic slowness of global search methods is mainly due to the slow but crucial exploration mechanisms employed.

Global search methods may be used as optimal learning algorithms for neural networks. Some global search methods may also be mapped onto recurrent neural networks such that the retrieval dynamics of these networks escape local minima and evolve towards global minimum.

Two major probabilistic global search methods are covered in this chapter. The first method is stochastic simulated annealing which is motivated by statistical mechanics, and the second method is genetic algorithms which is motivated by the mechanics of natural selection and natural genetics. The exploration mechanism in simulated annealing is governed by the Boltzmann-Gibbs probability distribution, and its convergence is determined by a "cooling" schedule of slowly decreasing "temperature." This method is especially appealing since it can be naturally implemented by a stochastic neural network known as the Boltzmann machine. The Boltzmann machine is a stochastic version of Hopfield's energy minimizing net which is capable of almost guaranteed convergence to the global minimum of an arbitrary bounded quadratic energy function. Simulated annealing is also applied to optimal weight learning in generally interconnected multilayer Boltzmann machines, thus extending the applicability of the Boltzmann machine from combinatorial optimization to optimal supervised learning of complex binary mappings. However, these desirable features of Boltzmann machines come with slow learning and/or retrieval.

Mean-field annealing is a deterministic approximation (based on mean-field theory) to stochastic simulated annealing where the mean behavior of the stochastic state transitions are used to characterize the Boltzmann machine. This approximation is found to preserve the optimal characteristics of the Boltzmann machine but with one to two orders of magnitude speed advantage. Mean-field annealing is applied in the context of retrieval dynamics and weights learning in a Boltzmann machine. It is interesting to see that applying mean-field theory to a single-layer Boltzmann machine leads to the deterministic continuous Hopfield net.

Genetic algorithms (GA's) are introduced as another method for optimal neural network design. They employ a parallel multipoint probabilistic search strategy which is biased towards reinforcing search points of high fitness. The most distinguishing feature of GA's is their flexibility and applicability to a wide range of optimization problems. In the domain of neural networks, GA's are useful as global search methods for synthesizing the weights of generally interconnected networks, optimal network architectures and learning parameters, and optimal learning rules.

It is argued that, in order to make global search methods more speed efficient, local gradient information (if available) could be used advantageously. In the context of GA optimization, it is possible to think of the GA as an evolutionary mechanism which could be accelerated by simple learning processes. This observation motivates the hybrid GA/gradient descent method for feedforward multilayer net training introduced at the end of the chapter.

Problems

8.1.1 Plot and find analytically the global minima of the following functions:

a. y(x) = x sin; x  [0.05, 0.5]

b. y(x) = (x2 + 2x) cos(x); |x| < 5

c. y(x) = 10 sin2x + 0.2 (x + 3)2; |x| < 20

d. y(x) = x6 - 15x4 + 27x2 + 250; |x| < 3.7

e. ; |x1| < 1.5 and |x2| < 1.5

f. ; |x1| < 2 and |x2| < 2

8.1.2 Plot and count the number of minima, maxima, and saddle points for the following functions:

a. ; |x1|  1.25 and |x2|  1.5

b. ; |x1|  0.3 and |x2|  0.5

c. ; |x1|  0.12 and -2  x2  0.8

8.1.3 Employ the gradient descent search rule given by

; t = 0, 1, 2, ...

to find a minimum of the function y(x) = x sin starting from the initial search point: a. x(0) = 0.05; b. x(0) = 0.1; c. x(0) = 0.15. Assume  = 10-4.

8.1.4 Repeat Problem 8.1.3 for the function in Problem 8.1.1 (c) with x(0) = -20 and  = 0.01.

8.1.5 Employ the gradient descent/ascent global search strategy described in Section 8.1 to find the global minima of the functions (a) - (d) in Problem 8.1.1.

8.1.6 "Global descent" is a global search method which was discussed in Section 5.1.2. Implement the global descent method to search for the global minimum of the function y(x) given in Problem 8.1.1 (c). Assume ,  = 0.005, x = 0.01,  = 2, and experiment with different values of the repeller parameter . Plot y[x(t)] versus t for various values of k.

8.1.7 For the uni-variate case, y = y(x), the global minimum can be reached by gradient descent on the noisy function


where N(t) is a "noise signal" and c(t) is a parameter which controls the magnitude of noise. Apply the gradient descent rule (with  = 0.01) of Problem 8.1.3 to . Use the resulting "stochastic" gradient rule to find the global minimum of y(x) in Problem 8.1.1 (c). Assume that N(t) is a normally distributed sequence with a mean of zero and a variance of one, and that c(t) = 150 exp(-t). Start from x(0) = -20 and experiment with different values of ,   [0.01, 0.001]. Plot x(t) versus t (t = 0, 1, 2, ...). What range of values of are likely to lead to the global minimum of y(x)?

8.2.1 Use the simulated annealing algorithm described in Section 8.2 with the temperature schedule in Equation (8.2.5) to find the global minima of the functions in Problem 8.1.1 (a) - (c) and Problem 8.1.2 (a) and (b). Make intelligent choices for the variance of the random perturbation x taking into account the domain of the function being optimized. Also, make use of the plots of these functions to estimate the largest possible change in y due to x, , and use this information to guide your estimate of To.

* 8.3.1 Consider the simple model of a biological neuron with output x given by x(net) = sgn(net), where net is the post-synaptic potential. Experimental observations (Katz, 1966) show that this post-synaptic potential is normally distributed; i.e.,


where is the mean potential. The distribution width, , is determined by the parameters of the noise sources associated with synaptic junctions. Show that the probability that the neuron fires (i.e., the probability of its output to be equal to one) is given by


where . Note how the pseudo-temperature now has the physical interpretation as being proportional to the fluctuation of the post-synaptic potential of a real neuron. Next show that the above probability can be roughly approximated by . Hint: Compare the series expansion of to that of tanh(x).

8.3.2 Derive Equation (8.3.4). (Hint: employ the thermal equilibrium condition of Equation (8.2.3) and Equation (8.3.1), assume that all units have equal probability of being selected for updating and that only one unit updates its state at a given time). Show that according to Equation (8.3.4) the probability of a transition (bit-flip) that increases the energy E is always less than 0.5.

8.3.3 Show that the relative-entropy H in Equation (8.3.9) is positive or zero.

8.3.4 Derive Equation (8.3.12) starting from Equation (8.3.8).

* 8.3.5 Derive Equation (8.3.16) by performing a gradient descent on H in Equation (8.3.15).

8.5.1 Employ Hardy's theorem (see Equation (8.5.2) and associated discussion) in a simple iterative procedure to find the largest number in the set Q = {1, 3, 4.5, 1.5, 4.2, 2}.

8.5.2 Consider the ten strings in population S(0) in Table 8.5.1 and the two schemata H1 = *11*** and H2 = *01**0. Which schemata are matched by which strings in the population S(0)? What are the order and defining length of each of H1 and H2?

8.5.3 Consider the problem of finding the global minimum of the function , x  [0.05, 0.5], using GA. Assume the initial population S(0) as in Table 8.5.1 and let Pc = 0.8 and Pm = 0.01 as in the first simulation of Example 8.5.2. Use Equation (8.5.4) to compute a lower bound for the expected number of schemata of the form *11*** in the generation at t = 1. Repeat using the approximation of Equation (8.5.5). Next, compare these two bounds to the actual number of schemata of the form *11*** in population S(1) in Table 8.5.1.

8.5.4 Repeat Problem 8.5.3 with the schema *01**0.

8.5.5 Find the global minimum of the functions in Problem 8.1.1 (a) - (c) and Problem 8.1.2 (a) and (b) using the standard genetic algorithm of Section 8.5.1. Use binary strings of dimension n = 8. Assume Pc = 0.85, Pm = 0.02, and a uniformly distributed initial population of 10 strings. Compare your results to those in Problem 8.2.1.

8.5.6 Consider the two-layer feedforward net shown in Figure 8.6.1. Assume that a binary coding of weights is used where each weight is represented by a b-bit substring for the purpose of representing the network as a GA string s of contiguous weight substrings. Show that the dimension of s is equal to [(n + 1)J + (J + 1)L]b, where n, J, and L are the input vector dimension, number of hidden units, and number of output units, respectively.

8.5.7 Use the standard genetic algorithm to find integer weights in the range [-15, +15] for the neural network in Figure P8.5.7 such that the network solves the XOR problem. Assume a signed-binary coding of 5 bits (sign bit plus four magnitude bits) for each weight. Also, assume Pc = 0.85, Pm = 0.01, and experiment with population sizes of M = 10, 20, 30, 40, and 50. The total number of correct responses may be used as the fitness function.


Figure P8.5.7
Back to the Table of Contents

Back to Main Menu