In a stochastic neural network the units have non-deterministic
activation functions as discussed in Section 3.1.6. Here, units
behave stochastically with the output (assumed to be bipolar binary)
of the *i*th unit taking the value *x**i* = +1
with probability *f*(*net**i*)
and value *x**i* = -1
with probability 1 - *f*(*net**i*),
where *net**i*
is the weighted-sum (net input) of unit *i* and

(8.3.1)

where *P*(*x**i*)
represents the probability distribution of *x**i*.
There are several possible choices of *f* which could have
been made in Equation (8.3.1), but the choice of the sigmoid function
is motivated by statistical mechanics [Glauber, 1963; Little,
1974. See also Amari (1971)] where the units behave stochastically
exactly like the spins in an Ising model of a magnetic material
in statistical physics [for more details on the connections between
stochastic units and the Ising model, the reader is referred to
Hinton and Sejnowski (1983), Peretto (1984), Amit (1989), and
Hertz et al., (1991)]. Equation (8.3.1) may also be derived based
on observations relating to the stochastic nature of the post-synaptic
potential of biological neurons ( Shaw and Vasudevan, 1974). Here,
the neuron may be approximated as a linear threshold gate (LTG)
[recall Equation (1.1.1)] with zero threshold, signum activation
function, and with the net input (post-synaptic potential) being
a Gaussian random variable as explored in Problem 8.3.1.

The parameter in Equation (8.3.1) controls the
steepness of the sigmoid *f*(*net*) at *net* =
0. We may think of as the reciprocal pseudo-temperature, .
When the "temperature" approaches zero, the sigmoid
becomes a step function and the stochastic unit becomes deterministic.
As *T* increases, this sharp threshold is "softened"
in a stochastic way, thus making the unit stochastic. Next, it
is shown how stochastic neural nets with units described by Equation
(8.3.1) and with controllable temperature *T* form a natural
substrate for the implementation of simulated annealing-based
optimization.

**8.3.1 Global Convergence in a Stochastic Recurrent
Neural Net: The Boltzmann Machine**

**8.3.2 Learning in Boltzmann Machines**

In the following, statistical mechanics ideas are
extended to learning in stochastic recurrent networks or "Boltzmann
learning" (
Hinton and Sejnowski, 1983, 1986;
Ackley et al.,
1985). These networks consist of *n* arbitrarily interconnected
stochastic units where the state *x**i*
of the *i*th unit is 1 or -1
with probability *f*(*net**ixi*)
as in Equation (8.3.1). The units are divided into visible and
hidden units as shown in Figure 8.3.1. The hidden units have
no direct inputs from, nor do they supply direct outputs to, the
outside world. The visible units may (but need not to) be further
divided into input units and output units. The units are interconnected
in an arbitrary way, but whatever the interconnection pattern,
all connections must be symmetric, *w**ij* = *w**ji*,
with *w**ii* = 0.
The net activity at unit *i* is given by ,
where thresholds (biases) are omitted for convenience and self-feedback
is not allowed. This leads to an energy function for this net
given by

(8.3.6)

Figure 8.3.1. A stochastic net with visible and hidden
units. Connections between any two units (if they exist) must
be symmetric. The visible units may be further divided into input
and output units as illustrated.

whose minima are the stable states **x***
characterized by = sgn(*net**i*),
*i* = 1, 2, ..., *n*. According
to the discussion in Section 8.3.1, and because of the existence
of an energy function, we find that the states of the present
stochastic net are governed by the Boltzmann-Gibbs distribution
of Equation (8.2.1), which may be adapted for the present network/energy
function as

(8.3.7)

where = {-1, +1}*n*
is the set of all possible states. Thus far, we have designed
a stable stochastic net (Boltzmann machine) which is capable of
reaching the global minimum (or a good suboptimal minimum) of
its energy function by slowly decreasing the pseudo-temperature
*T* = starting from a
sufficiently high temperature. Therefore, we may view this net
as an extension of the stochastic Hopfield net to include hidden
units. The presence of hidden units has the advantage of theoretically
allowing the net to represent (learn) arbitrary mappings/associations.
In Boltzmann learning, the weights *w**ij*
are adjusted to give the states of the visible units a particular
desired probability distribution.

Next, the derivation of a Boltzmann learning rule
is presented. This rule is derived by minimizing a measure of
difference between the probability of finding the states of visible
units in the freely running net and a set of desired probabilities
for these states. Before proceeding any further, we denote the
states of the visible units by the activity pattern and those
of the hidden units by . Let *N* and *K* be the numbers
of visible and hidden units, respectively. Then, the set *A*
of all visible patterns has a total of 2*N*
members (state configurations), and the set *G* of all hidden
patterns has 2*K*
members. The vector **x** still represents the state of the
whole network and it belongs to the set of 2*N*+*K* = 2*n*
possible network states. The probability *P*() of finding
the visible units in state irrespective of is then obtained
as

(8.3.8)

Here, *P*(**x**) = *P*(, )
denotes the joint probability that the visible units are in state
and the hidden units are in state , given that the network is
operating in its clamped condition. Also, *E*(**x**) = *E*(, )
is the energy of the network when the visible units are in state
and the hidden units are jointly in state . The term *Z*
is the denominator in Equation (8.3.7) and should be interpreted
as . Equation (8.3.8) gives the actual
probability *P*() of finding the visible units in state
in the freely running network at "thermal" equilibrium.
This probability is determined by the weights *w**ij*.
Now, assume that we are given a set of desired probabilities
*R*(), independent of the *w**ij*'s,
for the visible states. Then, the objective is to bring the distribution
as close as possible to *R*() by
adjusting the *w**ij*'s.
A suitable measure of the difference between *P*() and *R*()
is the relative-entropy *H* (see Section 5.2.6)

(8.3.9)

which is positive or zero (*H* is zero only
if *R*() = *P*() for all ). Therefore, we
may arrive at a learning equation by performing gradient descent
on *H*:

(8.3.10)

where is a small positive constant. Using Equations
(8.3.6) through (8.3.8) and recalling that *w**ij* = *w**ji*
we find

(8.3.11)

where denotes ,
and *x**i* (*x**j*)
should be interpreted as the state of unit *i* (*j*),
given that the visible units are in state and the hidden units
are jointly in state . Next, using Equation (8.3.8) and noting
that the quantity is the average <*x**i* *x**j*>,
gives

(8.3.12)

Thus, by substituting Equation (8.3.12) in (8.3.10)
leads to the Boltzmann learning rule:

(8.3.13)

where is used and

(8.3.14)

represents the value of <*x**i* *x**j*>
when the visible units are clamped in state , averaged over 's
according to their probabilities *R*(). Note that in Equation
(8.3.14), the term was replaced by
according to Bayes' rule.

The first term in the right hand side of Equation
(8.3.13) is essentially a Hebbian learning term, with the visible
units clamped. While the second term corresponds to anti-Hebbian
learning with the system free running. Note that learning converges
when the free unit-unit correlations are equal to the clamped
ones. It is very important that the correlations in the Boltzmann
learning rule be computed when the system is in thermal equilibrium
at temperature *T* = > 0,
since the derivation of this rule hinges on the Boltzmann-Gibbs
distribution of Equation (8.3.7). At equilibrium, the state **x**
fluctuates and we measure the correlations <*x**i* *x**j*>
by taking a time average of *x**i* *x**j*.
This must be done twice; once with the visible units clamped
in each of their states for which is
non-zero, and once with the 's unclamped. Thermal equilibrium
must be reached for each of these computations.

As the reader may have already suspected by examining
Equations (8.3.13) and (8.3.14), Boltzmann learning is very computationally
intensive. Usually, one starts with a high temperature (very
small ) and chooses a cooling schedule. At each of these temperatures
the network is allowed to follow its stochastic dynamics according
to Equation (8.3.1) or, equivalently, many units are sampled and
are updated according to Equation (8.3.4). The temperature is
lowered slowly according to the preselected schedule until *T*
approaches zero and equilibrium is reached. This simulated annealing
search must be repeated with clamped visible units in each desired
pattern and with unclamped visible units. In the computation
of <*x**i xj*>clamped,
the states are clamped to randomly drawn patterns from the training
set according to a given probability distribution *R*().
For each such training pattern, the network seeks equilibrium
following the same annealing schedule. The weights are updated
only after enough training patterns are taken. This whole process
is repeated many times to achieve convergence to a good set of
weights *w**ij*.

The learning rule described above is compatible
with pattern completion, in which a trained net is expected to
fill in missing bits of a partial pattern when such a pattern
is clamped on the visible nodes. This is reminiscent of retrieval
in the dynamic associative memories of Chapter 7. Note that the
presence of hidden units allows for high memory capacity. During
retrieval, the weights derived in the training phase are held
fixed and simulated annealing-based global retrieval is used as
discussed in Section 8.3.1. Here, the visible units are clamped
at corresponding known bits of a noisy/partial input pattern.
Starting from a high temperature, the net follows the stochastic
dynamics in Equation (8.3.1) or equivalently, state transitions
are made according to Equation (8.3.4). The temperature is gradually
lowered according to an appropriate schedule until the dynamics
become deterministic at *T* = 0 and convergence
to the "closest" pattern (global solution) is (hopefully)
achieved.

We may also extend Boltzmann learning to handle
the association of input/output pairs of patterns, as in supervised
learning in multi-layer perceptron nets. Here, we need to distinguish
between two types of visible units: input and output. Let us
represent the input units, the output units, and the hidden units
by the states , , and , respectively. Then we want the network
to learn associations . We may now pose the problem
as follows: for each we want to adjust the *w**ij*'s
such that the conditional probability distribution *P*(|)
is as close as possible to a desired distribution *R*(|).
Assuming that the 's occur with probabilities *p*(), a suitable
error measure is

(8.3.15)

which leads to the Boltzmann learning rule (Hopfield,
1987)

(8.3.16)

In Equation (8.3.16), both the inputs and outputs
are clamped in the Hebbian term, while only the input states are
clamped in the anti-Hebbian (unlearning) term, with averages over
the inputs taken in both cases. Examples of applications of learning
Boltzmann machines can be found in
Ackley et al. (1985),
Parks
(1987),
Sejnowski et al. (1986),
Kohonen et al. (1988), and
Lippmann
(1989). Theoretically, Boltzmann machines with learning may outperform
gradient-based learning such as backprop. However, the demanding
computational overhead associated with these machines would usually
render them impractical in software simulations. Specialized
electronic (
Alspector and Allen, 1987) and optoelectronic (
Farhat,
1987;
Ticknor and Barrett, 1987) hardware has been developed for
the Boltzmann machine. However, such hardware implementations
are still experimental in nature. Variations and related networks
can be found in
Derthick (1984),
Hopfield et al. (1983),
Smolensky
(1986),
van Hemman et al. (1990),
Galland and Hinton (1991), and
Apolloni and De Falco (1991).