8.3 Simulated Annealing for Stochastic Neural Networks

In a stochastic neural network the units have non-deterministic activation functions as discussed in Section 3.1.6. Here, units behave stochastically with the output (assumed to be bipolar binary) of the ith unit taking the value xi = +1 with probability f(neti) and value xi = -1 with probability 1 - f(neti), where neti is the weighted-sum (net input) of unit i and

(8.3.1)

where P(xi) represents the probability distribution of xi. There are several possible choices of f which could have been made in Equation (8.3.1), but the choice of the sigmoid function is motivated by statistical mechanics [Glauber, 1963; Little, 1974. See also Amari (1971)] where the units behave stochastically exactly like the spins in an Ising model of a magnetic material in statistical physics [for more details on the connections between stochastic units and the Ising model, the reader is referred to Hinton and Sejnowski (1983), Peretto (1984), Amit (1989), and Hertz et al., (1991)]. Equation (8.3.1) may also be derived based on observations relating to the stochastic nature of the post-synaptic potential of biological neurons ( Shaw and Vasudevan, 1974). Here, the neuron may be approximated as a linear threshold gate (LTG) [recall Equation (1.1.1)] with zero threshold, signum activation function, and with the net input (post-synaptic potential) being a Gaussian random variable as explored in Problem 8.3.1.

The parameter in Equation (8.3.1) controls the steepness of the sigmoid f(net) at net = 0. We may think of as the reciprocal pseudo-temperature,  . When the "temperature" approaches zero, the sigmoid becomes a step function and the stochastic unit becomes deterministic. As T increases, this sharp threshold is "softened" in a stochastic way, thus making the unit stochastic. Next, it is shown how stochastic neural nets with units described by Equation (8.3.1) and with controllable temperature T form a natural substrate for the implementation of simulated annealing-based optimization.

Since simulated annealing is a global optimization method, one might be tempted to consider its use to enhance the convergence to optimal minima of gradient-type nets such as the Hopfield net used for combinatorial optimization. In combinatorial optimization problems one is interested in finding a solution vector x  {-1, 1}n (or {0, 1}n) which best minimizes an objective function y(x). When y(x) is quadratic, the continuous Hopfield net may be used to search for local minimizers of y(x). This is achieved by first mapping y(x) onto the quadratic energy function of the Hopfield net and then using the net as a gradient descent algorithm to minimize E(x), thus minimizing y(x) (see Section 7.5 for details).

Rather than using the suboptimal Hopfield net approach, the desired global minimum of y(x) may be obtained by a direct application of the simulated annealing method of Section 8.2. However, the promise of fast analog hardware implementations of the Hopfield net leads us to take another look at the network optimization approach, but with modifications which improve convergence to global solutions. The following is an optimization method based on an efficient way of incorporating simulated annealing search in a discrete Hopfield net. This method, though, is only suitable for quadratic functions.

Consider the discrete-state discrete-time recurrent net (discrete Hopfield model) with ith unit dynamics described by Equation (7.1.37), repeated here for convenience:

(8.3.2)

This deterministic network has a quadratic Liapunov function (energy function) if its weight matrix is symmetric with positive (or zero) diagonal elements and if serial state update is used (recall Problem 7.1.18). In other words, this net will always converge to one of the local minima of its energy function E(x) given by

(8.3.3)

Next, consider a stochastic version of the above net, referred to as the stochastic Hopfield net, where we replace the deterministic threshold units by stochastic units according to Equation (8.3.1). By employing Equation (8.3.1) and assuming "thermal" equilibrium one may find the transition probability from xi to -xi (i.e., the probability to flip unit i from +1 to -1 or vice versa) as

(8.3.4)

The right most term in Equation (8.3.4) is obtained by using Equation (8.3.3) which gives . The transition probabilities in Equation (8.3.4) give a complete description of the stochastic sequence of states in the stochastic Hopfield net. Note that if T = 0, the probability of a transition which increases the energy E(x) becomes zero and that of a transition which decreases E(x) becomes one; hence, the net reduces to the stable deterministic Hopfield net. On the other hand, for T > 0 a transition which increases E(x) is allowed but with a probability which is smaller than that of a transition which decreases E(x). This last observation coupled with the requirement that one and only one stochastic unit (chosen randomly and uniformly) is allowed to flip its state at a given time, guarantees that "thermal" equilibrium will be reached for any T > 0.

It may now be concluded that the serially updated stochastic Hopfield net with the stochastic dynamics in Equation (8.3.1) is stable (in an average sense) for T  0 as long as its interconnection matrix is symmetric with positive diagonal elements. In other words, this stochastic net will reach an equilibrium state where the average value of E is a constant when T is held fixed for a sufficiently long period of time. Now, if a slowly decreasing temperature T is used with an initially large value, the stochastic Hopfield net becomes equivalent to a simulated annealing algorithm; i.e., when initialized at a random binary state x(0) the net will perform a stochastic global search seeking the global minimum of E(x). As discussed in Section 8.2, at the beginning of the computation, a higher temperature should be used so that it is easier for the states to escape from local minima. Then, as the computation proceeds, the temperature is gradually decreased according to a pre-specified cooling schedule. Finally, as the temperature approaches zero, the state, now placed (hopefully) near the global minimum, will converge to this minimum.

The above stochastic net is usually referred to as the Boltzmann machine because, at equilibrium, the probability of the states of the net is given by the Boltzmann-Gibbs distribution of Equation (8.2.1), or equivalently

(8.3.5)

where x and x' are two states in {-1, 1}n which differ in only one bit. Equation (8.3.5) can be easily derived by employing Equations (8.3.1) and (8.3.3).

In the following, statistical mechanics ideas are extended to learning in stochastic recurrent networks or "Boltzmann learning" ( Hinton and Sejnowski, 1983, 1986; Ackley et al., 1985). These networks consist of n arbitrarily interconnected stochastic units where the state xi of the ith unit is 1 or -1 with probability f(netixi) as in Equation (8.3.1). The units are divided into visible and hidden units as shown in Figure 8.3.1. The hidden units have no direct inputs from, nor do they supply direct outputs to, the outside world. The visible units may (but need not to) be further divided into input units and output units. The units are interconnected in an arbitrary way, but whatever the interconnection pattern, all connections must be symmetric, wij = wji, with wii = 0. The net activity at unit i is given by , where thresholds (biases) are omitted for convenience and self-feedback is not allowed. This leads to an energy function for this net given by

(8.3.6)

Figure 8.3.1. A stochastic net with visible and hidden units. Connections between any two units (if they exist) must be symmetric. The visible units may be further divided into input and output units as illustrated.

whose minima are the stable states x* characterized by  = sgn(neti), i = 1, 2, ..., n. According to the discussion in Section 8.3.1, and because of the existence of an energy function, we find that the states of the present stochastic net are governed by the Boltzmann-Gibbs distribution of Equation (8.2.1), which may be adapted for the present network/energy function as

(8.3.7)

where  = {-1, +1}n is the set of all possible states. Thus far, we have designed a stable stochastic net (Boltzmann machine) which is capable of reaching the global minimum (or a good suboptimal minimum) of its energy function by slowly decreasing the pseudo-temperature T =  starting from a sufficiently high temperature. Therefore, we may view this net as an extension of the stochastic Hopfield net to include hidden units. The presence of hidden units has the advantage of theoretically allowing the net to represent (learn) arbitrary mappings/associations. In Boltzmann learning, the weights wij are adjusted to give the states of the visible units a particular desired probability distribution.

Next, the derivation of a Boltzmann learning rule is presented. This rule is derived by minimizing a measure of difference between the probability of finding the states of visible units in the freely running net and a set of desired probabilities for these states. Before proceeding any further, we denote the states of the visible units by the activity pattern and those of the hidden units by . Let N and K be the numbers of visible and hidden units, respectively. Then, the set A of all visible patterns has a total of 2N members (state configurations), and the set G of all hidden patterns has 2K members. The vector x still represents the state of the whole network and it belongs to the set of 2N+K = 2n possible network states. The probability P() of finding the visible units in state irrespective of is then obtained as

(8.3.8)

Here, P(x) = P(, ) denotes the joint probability that the visible units are in state and the hidden units are in state , given that the network is operating in its clamped condition. Also, E(x) = E(, ) is the energy of the network when the visible units are in state and the hidden units are jointly in state . The term Z is the denominator in Equation (8.3.7) and should be interpreted as . Equation (8.3.8) gives the actual probability P() of finding the visible units in state in the freely running network at "thermal" equilibrium. This probability is determined by the weights wij. Now, assume that we are given a set of desired probabilities R(), independent of the wij's, for the visible states. Then, the objective is to bring the distribution as close as possible to R() by adjusting the wij's. A suitable measure of the difference between P() and R() is the relative-entropy H (see Section 5.2.6)

(8.3.9)

which is positive or zero (H is zero only if R() = P() for all ). Therefore, we may arrive at a learning equation by performing gradient descent on H:

(8.3.10)

where is a small positive constant. Using Equations (8.3.6) through (8.3.8) and recalling that wij = wji we find

(8.3.11)

where denotes , and xi (xj) should be interpreted as the state of unit i (j), given that the visible units are in state and the hidden units are jointly in state . Next, using Equation (8.3.8) and noting that the quantity is the average <xi xj>, gives

(8.3.12)

Thus, by substituting Equation (8.3.12) in (8.3.10) leads to the Boltzmann learning rule:

(8.3.13)

where is used and

(8.3.14)

represents the value of <xi xj> when the visible units are clamped in state , averaged over 's according to their probabilities R(). Note that in Equation (8.3.14), the term was replaced by according to Bayes' rule.

The first term in the right hand side of Equation (8.3.13) is essentially a Hebbian learning term, with the visible units clamped. While the second term corresponds to anti-Hebbian learning with the system free running. Note that learning converges when the free unit-unit correlations are equal to the clamped ones. It is very important that the correlations in the Boltzmann learning rule be computed when the system is in thermal equilibrium at temperature T =  > 0, since the derivation of this rule hinges on the Boltzmann-Gibbs distribution of Equation (8.3.7). At equilibrium, the state x fluctuates and we measure the correlations <xi xj> by taking a time average of xi xj. This must be done twice; once with the visible units clamped in each of their states for which is non-zero, and once with the 's unclamped. Thermal equilibrium must be reached for each of these computations.

As the reader may have already suspected by examining Equations (8.3.13) and (8.3.14), Boltzmann learning is very computationally intensive. Usually, one starts with a high temperature (very small ) and chooses a cooling schedule. At each of these temperatures the network is allowed to follow its stochastic dynamics according to Equation (8.3.1) or, equivalently, many units are sampled and are updated according to Equation (8.3.4). The temperature is lowered slowly according to the preselected schedule until T approaches zero and equilibrium is reached. This simulated annealing search must be repeated with clamped visible units in each desired pattern and with unclamped visible units. In the computation of <xi xj>clamped, the states are clamped to randomly drawn patterns from the training set according to a given probability distribution R(). For each such training pattern, the network seeks equilibrium following the same annealing schedule. The weights are updated only after enough training patterns are taken. This whole process is repeated many times to achieve convergence to a good set of weights wij.

The learning rule described above is compatible with pattern completion, in which a trained net is expected to fill in missing bits of a partial pattern when such a pattern is clamped on the visible nodes. This is reminiscent of retrieval in the dynamic associative memories of Chapter 7. Note that the presence of hidden units allows for high memory capacity. During retrieval, the weights derived in the training phase are held fixed and simulated annealing-based global retrieval is used as discussed in Section 8.3.1. Here, the visible units are clamped at corresponding known bits of a noisy/partial input pattern. Starting from a high temperature, the net follows the stochastic dynamics in Equation (8.3.1) or equivalently, state transitions are made according to Equation (8.3.4). The temperature is gradually lowered according to an appropriate schedule until the dynamics become deterministic at T = 0 and convergence to the "closest" pattern (global solution) is (hopefully) achieved.

We may also extend Boltzmann learning to handle the association of input/output pairs of patterns, as in supervised learning in multi-layer perceptron nets. Here, we need to distinguish between two types of visible units: input and output. Let us represent the input units, the output units, and the hidden units by the states , , and , respectively. Then we want the network to learn associations   . We may now pose the problem as follows: for each we want to adjust the wij's such that the conditional probability distribution P(|) is as close as possible to a desired distribution R(|). Assuming that the 's occur with probabilities p(), a suitable error measure is

(8.3.15)

which leads to the Boltzmann learning rule (Hopfield, 1987)

(8.3.16)

In Equation (8.3.16), both the inputs and outputs are clamped in the Hebbian term, while only the input states are clamped in the anti-Hebbian (unlearning) term, with averages over the inputs taken in both cases. Examples of applications of learning Boltzmann machines can be found in Ackley et al. (1985), Parks (1987), Sejnowski et al. (1986), Kohonen et al. (1988), and Lippmann (1989). Theoretically, Boltzmann machines with learning may outperform gradient-based learning such as backprop. However, the demanding computational overhead associated with these machines would usually render them impractical in software simulations. Specialized electronic ( Alspector and Allen, 1987) and optoelectronic ( Farhat, 1987; Ticknor and Barrett, 1987) hardware has been developed for the Boltzmann machine. However, such hardware implementations are still experimental in nature. Variations and related networks can be found in Derthick (1984), Hopfield et al. (1983), Smolensky (1986), van Hemman et al. (1990), Galland and Hinton (1991), and Apolloni and De Falco (1991).