In the previous section, a method for training multilayer
neural nets was described which uses a GA to search for optimal
weight configurations. Here, an alternative learning method is
described which performs global search for finding optimal targets
for hidden units based on a hybrid GA/gradient descent search
strategy. This method is a supervised learning method which is
suited for arbitrarily interconnected feedforward neural nets.
In the following, this hybrid learning method is described in
the context of a multilayer feedforward net having a single hidden
layer.

In Section 2.3, the universal approximation capabilities
of single hidden layer feedforward nets was established for a
wide variety of hidden unit activation functions, including the
threshold activation function. This implies that an arbitrary
nonlinearly separable mapping can always be decomposed into two
linearly separable mappings which are realized as the cascade
of two single-layer neural nets, as long as the first layer (hidden
layer) has a sufficient number of nonlinear units (e.g., sigmoids
or LTG's).

In supervised learning, a desired target vector
(pattern) is specified for each input vector in a given training
set. A linearly separable set of training input/target pairs
can be efficiently learned in a single layer net using the gradient-based
LMS or delta learning rules [See Equations (3.1.35) and (3.1.53)].
On the other hand, a general complex training set may not be
linearly separable which necessitates the use of a hidden layer.
Thus, more sophisticated learning rules must be used which are
usually far less efficient (slower) than LMS or delta rules and,
as in the case of backprop, may not always lead to satisfactory
solutions. Ideally, one would like to "decouple" the
training of a multiple-layer network into the training of two
(or more) single layer networks. This could be done if some method
of finding an appropriate set of hidden unit activations could
be found. These hidden unit activations will be called hidden
targets because they can be used as target vectors to train the
first layer. This approach would be useful if it could be guaranteed
that the mapping from the input to the hidden targets and that
from those hidden targets to the desired output targets are linearly
separable. Now, once those hidden targets are found, efficient
learning of the weights can proceed independently for the hidden
and output layers using the delta rule. In fact, backprop may
be thought of as employing a dynamic version of the above method
where hidden targets are estimated according to Equation (5.1.10)
for each training pair. However, because of its gradient descent
nature backprop's estimate of the hidden targets does not guarantee
finding optimal hidden target configuration. The following is
a more efficient method for training multilayer feedforward neural
nets, which utilizes the above hidden target-based supervised
learning strategy. Here, a GA is used to "evolve" proper
hidden targets, and a gradient descent search (LMS or delta rule)
is used to learn optimal network interconnection weights (Song,
1992; Hassoun and Song, 1993 a, b).

**8.6.1 Hybrid GA/Gradient Descent Method for Feedforward
Multilayer Net Training**

Figure 8.6.1. A two-layer fully interconnected feedforward
neural network.

Now, if we are given a set of hidden target column
vectors {**h**1, **h**2, ..., **h***m*},
**h***j* {0, 1}*J*
(or {-1, +1}*J*),
such that the mappings **x***k* **h***k*
and **h***k* **d***k*,
for *k* = 1, 2, ..., *m*, are
linearly separable, then gradient descent-based search (e.g.,
perceptron rule, LMS, delta-rule, or Ho-Kashyap rule) may be employed
to independently and quickly learn the optimal weights for both
hidden and output layers. Initially, though, we do not know the
proper set {**h***k*}
of hidden targets which solves the problem. Therefore, a GA will
be used to evolve such a set of hidden targets. In other words,
a GA search is used to explore the space of possible hidden targets
{**h**} (hidden target space) and converge to a global solution
which renders the mappings **x** **h** and **h** **d**
linearly separable. Since the hidden targets are binary-valued,
a natural coding of a set of hidden targets is the string **s** = (**s**1 **s**2 ... **s***m*)
where **s***i*
is a string formed from the bits of the vector **h***i*.
Equivalently, we may represent the search "point" as
a *J* *m* array (matrix) **H** = [**h**1 **h**2 ... **h***m*].
This representation is particularly useful in the multipoint
crossover described next.

A population of *M* random binary arrays {**H***j*},
*j* = 1, 2, ..., *M*, is generated
as the initial population of search "points." Each
array has an associated network labeled *j* whose architecture
is shown in Figure 8.6.1, with all *M* nets initialized with
the same set of random weights. The fitness of the *j*th
search point (array **H***j*)
is determined by the output SSE of network *j*:

(8.6.1)

Here, is the output of the *l*th output unit
in network *j* due to the input **x***k*.
Now, any one of a number of fitness functions may be used. Examples
are *f*(**H***j*) = -*Ej*,
, or even where is a very small positive number. Though, different
fitness functions may lead to different performance.

Initially, starting from random weight values and
random hidden targets, LMS is used to adapt the weights of the
hidden layer in each of the *M* networks subject to the training
set {**x***k*, **h***k*},
*k* = 1, 2, ..., *m*. Here,
the threshold activation is removed during weight adaptation and
the hidden units are treated as linear units. Alternatively,
an adaptive version of the Ho-Kashyap algorithm or the perceptron
rule may be employed directly to the hidden LTG's. Similarly,
the weights of the output layer units are adapted subject to the
training set {**h***k*, **d***k*},
independent of the first hidden layer.

After the weights are updated, each network is tested
by performing feedforward computations and its fitness is computed.
In these computations, the outputs of the hidden units (as opposed
to the hidden targets) serve as the inputs to the output layer.
Next, the GA operators of reproduction, cross-over, and mutation
are applied to evolve the next generation of hidden target sets
{**H***j*}. In
reproduction, the hidden target sets **H***j*
with the highest fitness are duplicated and are entered into a
temporary pool for cross-over. Cross-over is applied with a probability
*P**c* (*P**c*
is set close to 1). A pair {**H***i*, **H***j*}
is selected randomly without replacement from the temporary pool
just generated. If a training pair {**x***k*, **d***k*}
is poorly learned by network *i* (network *j*) during
the above learning phase (i.e., if the output error due to this
pair is substantially larger than the average output error of
network *i* (network *j*) on the whole training set)
then the corresponding column **h***k*
of **H***i* is
replaced by the *k*th column of **H***j*.
Here, cross-over can affect multiple pairs of columns in the
hidden target arrays. The above reproduction and cross-over operations
differ from those employed by the standard GA, and are motivated
by empirical results (Hassoun and Song, 1993b). On the other
hand, the standard mutation operation is used here, where each
bit of the **H***i*'s,
after cross-over, is flipped with a probability *P**m*
(typically, *P**m* = 0.01
is used).

The above is a description of a single cycle of
the GA/GD learning method. This cycle is repeated until the population
{**H***i*} converges
to a dominant representation or until at least one network is
generated which has an output SSE less than a prespecified value.
During GA/GD learning, the weights of all *M* networks are
reinitialized at the beginning of each cycle to small random values
(one set of random weights may be used for all networks and for
all cycles). Hassoun and Song (1993b) reported several variations
of the above method including the use of sigmoidal hidden units,
using the outputs of the hidden layer instead of the hidden targets
to serve as the input pattern to the output layer during the training
phase, and using different fitness functions. Though, the present
GA/GD method showed better overall performance on a range of benchmark
problems.

One of the main motivations behind applying GA to
the hidden target space as opposed to the weight space is the
possibility of the existence of a more dense set of solutions
for a given problem. That is, there may be many more optimal
hidden target sets {**H***}
in hidden target space which produce zero SSE error than optimal
weights {} in weight space. This hypothesis was validated on
a number of simple problems which are designed so that the weight
space and the hidden target space had the same dimensions. However,
further and more extensive testing is still required in this area.

In the architecture of Figure 8.6.1, the above GA/GD
method involves a GA search space of dimension *mJ*. On
the other hand, the GA search in weight space involves a search
space of dimension [(*n*+1)*J* + (*J*+1)*L*]*b*
where *b* is the number of binary bits chosen to encode each
weight in the network (see Problem 8.5.6). Since one would normally
choose a population size *M* proportional to the dimension
of the binary search space in GA applications, we may conclude
that the GA/GD method has a speed advantage over the other method
when the following condition is satisfied

*mJ* < [(*n*+1)*J* + (*J*+1)*L*]*b*
(8.6.2)

Equation (8.6.2) implies that the GA/GD method is
preferable over GA-based weight search in neural network learning
tasks when the size of the training set, *m*, is small compared
to the product of the dimension of the training patterns and the
bit accuracy, *nb* (here, it is assumed that *n* >> *L*).
Unfortunately, many practical problems (such as pattern recognition,
system identification, and function approximation problems) lead
to training sets characterized by *m* >> *n*,
which makes the GA/GD method less advantageous in term of computational
speed. However, one may alleviate this problem (e.g., in pattern
recognition applications) by partial preprocessing of the training
set using a fast clustering method which would substantially reduce
the size of the training set (refer to Section 6.1 for details)
and thus make the GA/GD method regain its speed advantage.

As the case with the simulated annealing global
search method in the weight space, the GA/GD method may not compete
with backprop in computational speed. However, the GA/GD method
is an effective alternative to backprop in learning tasks which
involve complex multimodal criterion (error) functions, if optimal
solutions are at a premium.

| Does not converge within 1 million trials | |||||||||||||||||||||||

| 96 out of 100 runs do not converge within 1 Million backprop cycles. The remaining 4 runs converged in an average of 3644 cycles. | |||||||||||||||||||||||

Does not converge within 1000 generations with population sizes of 64 and 132. |

We conclude this chapter by pointing out that genetic
algorithms may also be used as the evolutionary mechanism in the
context of more general learning systems. Holland (1986; also,
Holland and Reitman, 1978) introduced a classifier system which
is an adaptive parallel rule-based system that learns syntactically
simple string rules (called classifiers) to guide its performance
in an arbitrary environment. The classifier system develops a
sequence(s) of actions or decisions so that a particular objective
is achieved in a dynamically evolving environment. As an example,
one may think of the classifier system as a controller whose objective
is to regulate or control the state of a dynamical system. Here,
the fitness of a particular classifier (rule) is a function of
how well an individual classifier complements others in the population.
The heart of the classifier system is a reinforcement-type learning
mechanism assisted with GA exploration (see Goldberg (1989) for
an accessible reference on classifier systems and their applications).
Recently, Abu Zitar (1993; also, Abu Zitar and Hassoun, 1993a,
b) have developed a framework for synthesizing multilayer feedforward
neural net controllers for robust nonlinear control from binary
string rules generated by a classifier-like system.