In a stochastic neural network the units have non-deterministic
activation functions as discussed in Section 3.1.6. Here, units
behave stochastically with the output (assumed to be bipolar binary)
of the ith unit taking the value xi = +1
with probability f(neti)
and value xi = -1
with probability 1 - f(neti),
where neti
is the weighted-sum (net input) of unit i and
(8.3.1)
where P(xi)
represents the probability distribution of xi.
There are several possible choices of f which could have
been made in Equation (8.3.1), but the choice of the sigmoid function
is motivated by statistical mechanics [Glauber, 1963; Little,
1974. See also Amari (1971)] where the units behave stochastically
exactly like the spins in an Ising model of a magnetic material
in statistical physics [for more details on the connections between
stochastic units and the Ising model, the reader is referred to
Hinton and Sejnowski (1983), Peretto (1984), Amit (1989), and
Hertz et al., (1991)]. Equation (8.3.1) may also be derived based
on observations relating to the stochastic nature of the post-synaptic
potential of biological neurons ( Shaw and Vasudevan, 1974). Here,
the neuron may be approximated as a linear threshold gate (LTG)
[recall Equation (1.1.1)] with zero threshold, signum activation
function, and with the net input (post-synaptic potential) being
a Gaussian random variable as explored in Problem 8.3.1.
The parameter in Equation (8.3.1) controls the
steepness of the sigmoid f(net) at net =
0. We may think of as the reciprocal pseudo-temperature, .
When the "temperature" approaches zero, the sigmoid
becomes a step function and the stochastic unit becomes deterministic.
As T increases, this sharp threshold is "softened"
in a stochastic way, thus making the unit stochastic. Next, it
is shown how stochastic neural nets with units described by Equation
(8.3.1) and with controllable temperature T form a natural
substrate for the implementation of simulated annealing-based
optimization.
8.3.1 Global Convergence in a Stochastic Recurrent
Neural Net: The Boltzmann Machine
Since simulated annealing is a global optimization
method, one might be tempted to consider its use to enhance the
convergence to optimal minima of gradient-type nets such as the
Hopfield net used for combinatorial optimization. In combinatorial
optimization problems one is interested in finding a solution
vector x {-1, 1}n
(or {0, 1}n)
which best minimizes an objective function y(x).
When y(x) is quadratic, the continuous Hopfield
net may be used to search for local minimizers of y(x).
This is achieved by first mapping y(x) onto the
quadratic energy function of the Hopfield net and then using the
net as a gradient descent algorithm to minimize E(x),
thus minimizing y(x) (see Section 7.5 for details).
Rather than using the suboptimal Hopfield net approach,
the desired global minimum of y(x) may be obtained
by a direct application of the simulated annealing method of Section
8.2. However, the promise of fast analog hardware implementations
of the Hopfield net leads us to take another look at the network
optimization approach, but with modifications which improve convergence
to global solutions. The following is an optimization method
based on an efficient way of incorporating simulated annealing
search in a discrete Hopfield net. This method, though, is only
suitable for quadratic functions.
Consider the discrete-state discrete-time recurrent
net (discrete Hopfield model) with ith unit dynamics described
by Equation (7.1.37), repeated here for convenience:
This deterministic network has a quadratic Liapunov
function (energy function) if its weight matrix is symmetric with
positive (or zero) diagonal elements and if serial state update
is used (recall Problem 7.1.18). In other words, this net will
always converge to one of the local minima of its energy function
E(x) given by
Next, consider a stochastic version of the above
net, referred to as the stochastic Hopfield net, where we replace
the deterministic threshold units by stochastic units according
to Equation (8.3.1). By employing Equation (8.3.1) and assuming
"thermal" equilibrium one may find the transition probability
from xi
to -xi
(i.e., the probability to flip unit i from +1 to -1
or vice versa) as
The right most term in Equation (8.3.4) is obtained
by using Equation (8.3.3) which gives
It may now be concluded that the serially updated
stochastic Hopfield net with the stochastic dynamics in Equation
(8.3.1) is stable (in an average sense) for T 0
as long as its interconnection matrix is symmetric with positive
diagonal elements. In other words, this stochastic net will reach
an equilibrium state where the average value of E is a
constant when T is held fixed for a sufficiently long period
of time. Now, if a slowly decreasing temperature T is
used with an initially large value, the stochastic Hopfield net
becomes equivalent to a simulated annealing algorithm; i.e., when
initialized at a random binary state x(0) the net will
perform a stochastic global search seeking the global minimum
of E(x). As discussed in Section 8.2, at the beginning
of the computation, a higher temperature should be used so that
it is easier for the states to escape from local minima. Then,
as the computation proceeds, the temperature is gradually decreased
according to a pre-specified cooling schedule. Finally, as the
temperature approaches zero, the state, now placed (hopefully)
near the global minimum, will converge to this minimum.
The above stochastic net is usually referred to
as the Boltzmann machine because, at equilibrium, the probability
of the states of the net is given by the Boltzmann-Gibbs distribution
of Equation (8.2.1), or equivalently
where x and x' are two states in {-1, 1}n
which differ in only one bit. Equation (8.3.5) can be easily
derived by employing Equations (8.3.1) and (8.3.3).
8.3.2 Learning in Boltzmann Machines
In the following, statistical mechanics ideas are
extended to learning in stochastic recurrent networks or "Boltzmann
learning" (
Hinton and Sejnowski, 1983, 1986;
Ackley et al.,
1985). These networks consist of n arbitrarily interconnected
stochastic units where the state xi
of the ith unit is 1 or -1
with probability f(netixi)
as in Equation (8.3.1). The units are divided into visible and
hidden units as shown in Figure 8.3.1. The hidden units have
no direct inputs from, nor do they supply direct outputs to, the
outside world. The visible units may (but need not to) be further
divided into input units and output units. The units are interconnected
in an arbitrary way, but whatever the interconnection pattern,
all connections must be symmetric, wij = wji,
with wii = 0.
The net activity at unit i is given by
Figure 8.3.1. A stochastic net with visible and hidden
units. Connections between any two units (if they exist) must
be symmetric. The visible units may be further divided into input
and output units as illustrated.
whose minima are the stable states x*
characterized by
where = {-1, +1}n
is the set of all possible states. Thus far, we have designed
a stable stochastic net (Boltzmann machine) which is capable of
reaching the global minimum (or a good suboptimal minimum) of
its energy function by slowly decreasing the pseudo-temperature
T =
Next, the derivation of a Boltzmann learning rule
is presented. This rule is derived by minimizing a measure of
difference between the probability of finding the states of visible
units in the freely running net and a set of desired probabilities
for these states. Before proceeding any further, we denote the
states of the visible units by the activity pattern and those
of the hidden units by . Let N and K be the numbers
of visible and hidden units, respectively. Then, the set A
of all visible patterns has a total of 2N
members (state configurations), and the set G of all hidden
patterns has 2K
members. The vector x still represents the state of the
whole network and it belongs to the set of 2N+K = 2n
possible network states. The probability P() of finding
the visible units in state irrespective of is then obtained
as
Here, P(x) = P(, )
denotes the joint probability that the visible units are in state
and the hidden units are in state , given that the network is
operating in its clamped condition. Also, E(x) = E(, )
is the energy of the network when the visible units are in state
and the hidden units are jointly in state . The term Z
is the denominator in Equation (8.3.7) and should be interpreted
as
which is positive or zero (H is zero only
if R() = P() for all ). Therefore, we
may arrive at a learning equation by performing gradient descent
on H:
where is a small positive constant. Using Equations
(8.3.6) through (8.3.8) and recalling that wij = wji
we find
where
Thus, by substituting Equation (8.3.12) in (8.3.10)
leads to the Boltzmann learning rule:
where
represents the value of <xi xj>
when the visible units are clamped in state , averaged over 's
according to their probabilities R(). Note that in Equation
(8.3.14), the term
The first term in the right hand side of Equation
(8.3.13) is essentially a Hebbian learning term, with the visible
units clamped. While the second term corresponds to anti-Hebbian
learning with the system free running. Note that learning converges
when the free unit-unit correlations are equal to the clamped
ones. It is very important that the correlations in the Boltzmann
learning rule be computed when the system is in thermal equilibrium
at temperature T =
As the reader may have already suspected by examining
Equations (8.3.13) and (8.3.14), Boltzmann learning is very computationally
intensive. Usually, one starts with a high temperature (very
small ) and chooses a cooling schedule. At each of these temperatures
the network is allowed to follow its stochastic dynamics according
to Equation (8.3.1) or, equivalently, many units are sampled and
are updated according to Equation (8.3.4). The temperature is
lowered slowly according to the preselected schedule until T
approaches zero and equilibrium is reached. This simulated annealing
search must be repeated with clamped visible units in each desired
pattern and with unclamped visible units. In the computation
of <xi xj>clamped,
the states are clamped to randomly drawn patterns from the training
set according to a given probability distribution R().
For each such training pattern, the network seeks equilibrium
following the same annealing schedule. The weights are updated
only after enough training patterns are taken. This whole process
is repeated many times to achieve convergence to a good set of
weights wij.
The learning rule described above is compatible
with pattern completion, in which a trained net is expected to
fill in missing bits of a partial pattern when such a pattern
is clamped on the visible nodes. This is reminiscent of retrieval
in the dynamic associative memories of Chapter 7. Note that the
presence of hidden units allows for high memory capacity. During
retrieval, the weights derived in the training phase are held
fixed and simulated annealing-based global retrieval is used as
discussed in Section 8.3.1. Here, the visible units are clamped
at corresponding known bits of a noisy/partial input pattern.
Starting from a high temperature, the net follows the stochastic
dynamics in Equation (8.3.1) or equivalently, state transitions
are made according to Equation (8.3.4). The temperature is gradually
lowered according to an appropriate schedule until the dynamics
become deterministic at T = 0 and convergence
to the "closest" pattern (global solution) is (hopefully)
achieved.
We may also extend Boltzmann learning to handle
the association of input/output pairs of patterns, as in supervised
learning in multi-layer perceptron nets. Here, we need to distinguish
between two types of visible units: input and output. Let us
represent the input units, the output units, and the hidden units
by the states , , and , respectively. Then we want the network
to learn associations . We may now pose the problem
as follows: for each we want to adjust the wij's
such that the conditional probability distribution P(|)
is as close as possible to a desired distribution R(|).
Assuming that the 's occur with probabilities p(), a suitable
error measure is
which leads to the Boltzmann learning rule (Hopfield,
1987)
In Equation (8.3.16), both the inputs and outputs
are clamped in the Hebbian term, while only the input states are
clamped in the anti-Hebbian (unlearning) term, with averages over
the inputs taken in both cases. Examples of applications of learning
Boltzmann machines can be found in
Ackley et al. (1985),
Parks
(1987),
Sejnowski et al. (1986),
Kohonen et al. (1988), and
Lippmann
(1989). Theoretically, Boltzmann machines with learning may outperform
gradient-based learning such as backprop. However, the demanding
computational overhead associated with these machines would usually
render them impractical in software simulations. Specialized
electronic (
Alspector and Allen, 1987) and optoelectronic (
Farhat,
1987;
Ticknor and Barrett, 1987) hardware has been developed for
the Boltzmann machine. However, such hardware implementations
are still experimental in nature. Variations and related networks
can be found in
Derthick (1984),
Hopfield et al. (1983),
Smolensky
(1986),
van Hemman et al. (1990),
Galland and Hinton (1991), and
Apolloni and De Falco (1991).
(8.3.2)
(8.3.3)
(8.3.4)
.
The transition probabilities in Equation (8.3.4) give a complete
description of the stochastic sequence of states in the stochastic
Hopfield net. Note that if T = 0, the probability
of a transition which increases the energy E(x)
becomes zero and that of a transition which decreases E(x)
becomes one; hence, the net reduces to the stable deterministic
Hopfield net. On the other hand, for T > 0
a transition which increases E(x) is allowed but
with a probability which is smaller than that of a transition
which decreases E(x). This last observation coupled
with the requirement that one and only one stochastic unit (chosen
randomly and uniformly) is allowed to flip its state at a given
time, guarantees that "thermal" equilibrium will be
reached for any T > 0.
(8.3.5)
,
where thresholds (biases) are omitted for convenience and self-feedback
is not allowed. This leads to an energy function for this net
given by
(8.3.6)
= sgn(neti),
i = 1, 2, ..., n. According
to the discussion in Section 8.3.1, and because of the existence
of an energy function, we find that the states of the present
stochastic net are governed by the Boltzmann-Gibbs distribution
of Equation (8.2.1), which may be adapted for the present network/energy
function as
(8.3.7)
starting from a
sufficiently high temperature. Therefore, we may view this net
as an extension of the stochastic Hopfield net to include hidden
units. The presence of hidden units has the advantage of theoretically
allowing the net to represent (learn) arbitrary mappings/associations.
In Boltzmann learning, the weights wij
are adjusted to give the states of the visible units a particular
desired probability distribution.
(8.3.8)
. Equation (8.3.8) gives the actual
probability P() of finding the visible units in state
in the freely running network at "thermal" equilibrium.
This probability is determined by the weights wij.
Now, assume that we are given a set of desired probabilities
R(), independent of the wij's,
for the visible states. Then, the objective is to bring the distribution
as close as possible to R() by
adjusting the wij's.
A suitable measure of the difference between P() and R()
is the relative-entropy H (see Section 5.2.6)
(8.3.9)
(8.3.10)
(8.3.11)
denotes
,
and xi (xj)
should be interpreted as the state of unit i (j),
given that the visible units are in state and the hidden units
are jointly in state . Next, using Equation (8.3.8) and noting
that the quantity
is the average <xi xj>,
gives
(8.3.12)
(8.3.13)
is used and
(8.3.14)
was replaced by
according to Bayes' rule.
> 0,
since the derivation of this rule hinges on the Boltzmann-Gibbs
distribution of Equation (8.3.7). At equilibrium, the state x
fluctuates and we measure the correlations <xi xj>
by taking a time average of xi xj.
This must be done twice; once with the visible units clamped
in each of their states for which
is
non-zero, and once with the 's unclamped. Thermal equilibrium
must be reached for each of these computations.
(8.3.15)
(8.3.16)