8.6 Genetic Algorithm Assisted Supervised Learning

In the previous section, a method for training multilayer neural nets was described which uses a GA to search for optimal weight configurations. Here, an alternative learning method is described which performs global search for finding optimal targets for hidden units based on a hybrid GA/gradient descent search strategy. This method is a supervised learning method which is suited for arbitrarily interconnected feedforward neural nets. In the following, this hybrid learning method is described in the context of a multilayer feedforward net having a single hidden layer.

In Section 2.3, the universal approximation capabilities of single hidden layer feedforward nets was established for a wide variety of hidden unit activation functions, including the threshold activation function. This implies that an arbitrary nonlinearly separable mapping can always be decomposed into two linearly separable mappings which are realized as the cascade of two single-layer neural nets, as long as the first layer (hidden layer) has a sufficient number of nonlinear units (e.g., sigmoids or LTG's).

In supervised learning, a desired target vector (pattern) is specified for each input vector in a given training set. A linearly separable set of training input/target pairs can be efficiently learned in a single layer net using the gradient-based LMS or delta learning rules [See Equations (3.1.35) and (3.1.53)]. On the other hand, a general complex training set may not be linearly separable which necessitates the use of a hidden layer. Thus, more sophisticated learning rules must be used which are usually far less efficient (slower) than LMS or delta rules and, as in the case of backprop, may not always lead to satisfactory solutions. Ideally, one would like to "decouple" the training of a multiple-layer network into the training of two (or more) single layer networks. This could be done if some method of finding an appropriate set of hidden unit activations could be found. These hidden unit activations will be called hidden targets because they can be used as target vectors to train the first layer. This approach would be useful if it could be guaranteed that the mapping from the input to the hidden targets and that from those hidden targets to the desired output targets are linearly separable. Now, once those hidden targets are found, efficient learning of the weights can proceed independently for the hidden and output layers using the delta rule. In fact, backprop may be thought of as employing a dynamic version of the above method where hidden targets are estimated according to Equation (5.1.10) for each training pair. However, because of its gradient descent nature backprop's estimate of the hidden targets does not guarantee finding optimal hidden target configuration. The following is a more efficient method for training multilayer feedforward neural nets, which utilizes the above hidden target-based supervised learning strategy. Here, a GA is used to "evolve" proper hidden targets, and a gradient descent search (LMS or delta rule) is used to learn optimal network interconnection weights (Song, 1992; Hassoun and Song, 1993 a, b).

8.6.1 Hybrid GA/Gradient Descent Method for Feedforward Multilayer Net Training

The basics of the hybrid GA/gradient descent (GA/GD) learning method for a multilayer feedforward net are described next. The GA/GD method consists of two parts: genetic search in hidden target space and gradient-based weight update at each layer. Consider the fully interconnected feedforward single hidden layer net of Figure 8.6.1 with LTG hidden units. The output units can be linear units, sigmoid units, or LTG's, depending on the nature of the target vector d. Also, consider the training set {xkdk}, k = 1, 2, ..., m.


Figure 8.6.1. A two-layer fully interconnected feedforward neural network.

Now, if we are given a set of hidden target column vectors {h1h2, ..., hm}, hj  {0, 1}J (or {-1, +1}J), such that the mappings xk  hk and hk  dk, for k = 1, 2, ..., m, are linearly separable, then gradient descent-based search (e.g., perceptron rule, LMS, delta-rule, or Ho-Kashyap rule) may be employed to independently and quickly learn the optimal weights for both hidden and output layers. Initially, though, we do not know the proper set {hk} of hidden targets which solves the problem. Therefore, a GA will be used to evolve such a set of hidden targets. In other words, a GA search is used to explore the space of possible hidden targets {h} (hidden target space) and converge to a global solution which renders the mappings x  h and h  d linearly separable. Since the hidden targets are binary-valued, a natural coding of a set of hidden targets is the string s = (s1 s2 ... sm) where si is a string formed from the bits of the vector hi. Equivalently, we may represent the search "point" as a J  m array (matrix) H = [h1 h2 ... hm]. This representation is particularly useful in the multipoint crossover described next.

A population of M random binary arrays {Hj}, j = 1, 2, ..., M, is generated as the initial population of search "points." Each array has an associated network labeled j whose architecture is shown in Figure 8.6.1, with all M nets initialized with the same set of random weights. The fitness of the jth search point (array Hj) is determined by the output SSE of network j:

(8.6.1)

Here, is the output of the lth output unit in network j due to the input xk. Now, any one of a number of fitness functions may be used. Examples are f(Hj) = -Ej, , or even where is a very small positive number. Though, different fitness functions may lead to different performance.

Initially, starting from random weight values and random hidden targets, LMS is used to adapt the weights of the hidden layer in each of the M networks subject to the training set {xkhk}, k = 1, 2, ..., m. Here, the threshold activation is removed during weight adaptation and the hidden units are treated as linear units. Alternatively, an adaptive version of the Ho-Kashyap algorithm or the perceptron rule may be employed directly to the hidden LTG's. Similarly, the weights of the output layer units are adapted subject to the training set {hkdk}, independent of the first hidden layer.

After the weights are updated, each network is tested by performing feedforward computations and its fitness is computed. In these computations, the outputs of the hidden units (as opposed to the hidden targets) serve as the inputs to the output layer. Next, the GA operators of reproduction, cross-over, and mutation are applied to evolve the next generation of hidden target sets {Hj}. In reproduction, the hidden target sets Hj with the highest fitness are duplicated and are entered into a temporary pool for cross-over. Cross-over is applied with a probability Pc (Pc is set close to 1). A pair {HiHj} is selected randomly without replacement from the temporary pool just generated. If a training pair {xkdk} is poorly learned by network i (network j) during the above learning phase (i.e., if the output error due to this pair is substantially larger than the average output error of network i (network j) on the whole training set) then the corresponding column hk of Hi is replaced by the kth column of Hj. Here, cross-over can affect multiple pairs of columns in the hidden target arrays. The above reproduction and cross-over operations differ from those employed by the standard GA, and are motivated by empirical results (Hassoun and Song, 1993b). On the other hand, the standard mutation operation is used here, where each bit of the Hi's, after cross-over, is flipped with a probability Pm (typically, Pm = 0.01 is used).

The above is a description of a single cycle of the GA/GD learning method. This cycle is repeated until the population {Hi} converges to a dominant representation or until at least one network is generated which has an output SSE less than a prespecified value. During GA/GD learning, the weights of all M networks are reinitialized at the beginning of each cycle to small random values (one set of random weights may be used for all networks and for all cycles). Hassoun and Song (1993b) reported several variations of the above method including the use of sigmoidal hidden units, using the outputs of the hidden layer instead of the hidden targets to serve as the input pattern to the output layer during the training phase, and using different fitness functions. Though, the present GA/GD method showed better overall performance on a range of benchmark problems.

One of the main motivations behind applying GA to the hidden target space as opposed to the weight space is the possibility of the existence of a more dense set of solutions for a given problem. That is, there may be many more optimal hidden target sets {H*} in hidden target space which produce zero SSE error than optimal weights {} in weight space. This hypothesis was validated on a number of simple problems which are designed so that the weight space and the hidden target space had the same dimensions. However, further and more extensive testing is still required in this area.

In the architecture of Figure 8.6.1, the above GA/GD method involves a GA search space of dimension mJ. On the other hand, the GA search in weight space involves a search space of dimension [(n+1)J + (J+1)L]b where b is the number of binary bits chosen to encode each weight in the network (see Problem 8.5.6). Since one would normally choose a population size M proportional to the dimension of the binary search space in GA applications, we may conclude that the GA/GD method has a speed advantage over the other method when the following condition is satisfied

mJ < [(n+1)J + (J+1)L]b (8.6.2)

Equation (8.6.2) implies that the GA/GD method is preferable over GA-based weight search in neural network learning tasks when the size of the training set, m, is small compared to the product of the dimension of the training patterns and the bit accuracy, nb (here, it is assumed that n >> L). Unfortunately, many practical problems (such as pattern recognition, system identification, and function approximation problems) lead to training sets characterized by m >> n, which makes the GA/GD method less advantageous in term of computational speed. However, one may alleviate this problem (e.g., in pattern recognition applications) by partial preprocessing of the training set using a fast clustering method which would substantially reduce the size of the training set (refer to Section 6.1 for details) and thus make the GA/GD method regain its speed advantage.

As the case with the simulated annealing global search method in the weight space, the GA/GD method may not compete with backprop in computational speed. However, the GA/GD method is an effective alternative to backprop in learning tasks which involve complex multimodal criterion (error) functions, if optimal solutions are at a premium.

8.6.2 Simulations

The GA/GD method is tested on the 4-bit parity binary mapping and a continuous mapping that arises from a nonlinear system identification problem. The 4-bit parity (refer to the K-map in Figure 2.1.2) is chosen because it is known to pose a difficult problem to neural networks using gradient descent-based learning, due to multiple local minima. On the other hand, the nonlinear system identification problem is chosen to test the ability of the GA/GD method with binary hidden targets to approximate continuous nonlinear functions.

In both simulations, a two-layer feedforward net was used with sigmoidal hidden units employing the hyperbolic tangent activation. For the binary mapping problem, bipolar training data, bipolar hidden targets, and bipolar output targets are assumed. Also, a single sigmoidal unit is used for the output layer. On the other hand, the system identification problem used one linear output unit. The delta rule with a learning rate of 0.1 is used to learn the weights at both hidden and output layers. Only ten delta rule learning steps are allowed for each layer per full GA/GD training cycle.

The 4-bit parity is a binary mapping from 4-dimiensional binary-valued input vectors to one binary-valued (desired) output. The desired output is taken as +1 if the number of +1 bits in the input vector is odd, and -1 otherwise. The networks used in the following simulations employ four hidden units. The GA/GD method is tested with population sizes of 8, 32, and 64 strings. For each population size, fifty trials are performed (each trial re-randomizes all initial weights and hidden target sets) and learning cycles statistics (mean value, standard deviation, maximum, and minimum) are computed. Simulation results are reported in Table 8.6.1 for the GA/GD method and three other methods: (1) a method similar to GA/GD but with the GA process replaced by a process where the search is reinitialized with random hidden targets and random weights at the onset of every learning cycle. This method is referred to as the random hidden target/gradient descent (RH/GD) method (it should be noted here that sufficient iterations of the delta rule are allowed for each cycle in order to rule out non-convergence). (2) Incremental backprop (BP). And, (3) standard GA learning in weight space (SGA).
Population Size
8
32
64
Learning cycles statistics
Mean
Standard deviation
Max
Min
Mean
Standard deviation
Max
Min
Mean
Standard deviation
Max
Min
GA/GD
437
530
1882
14
159
195
871
8
105
335
2401
5

RH/GD

Does not converge within 1 million trials
Learning

Method

BP

96 out of 100 runs do not converge within 1 Million backprop cycles. The remaining 4 runs converged in an average of 3644 cycles.

SGA

Does not converge within 1000 generations with population sizes of 64 and 132.

Table 8.6.1. Simulation results for the 4-bit parity using the GA/gradient descent (GA/GD) method, the random hidden target/gradient descent (RH/GD) method, backprop (BP), and the standard GA (SGA) in weight space method. A four hidden unit feedforward neural net is used.

In all simulations, the GA/GD method led to successful convergence, with population sizes of 64, 32, and 8, in an average of a few hundred learning cycles or less. The RH/GD method could not find a single solution with 106 trials. This shows clearly the difficulty of the task and verifies the effectiveness of the GA/GD method. As for backprop, only four out of 100 runs resulted in solution with the remaining 96 solutions reaching a high error plateau and/or a local minima. Finally, the SGA method neither converged nor was it able to find a solution within 1000 generations, with population sizes of 64 and 132; this being the case for codings of 8 and 16 binary bits for each weight.

The second simulation involves the identification of a nonlinear plant described by the discrete-time dynamics of Equation (5.4.4). A feedforward neural network with 20 hidden units is used and is trained using the GA/GD method on a training set which is generated according to a random input signal in a similar fashion as described in Section 5.4.1. The size of the training set used here, though, is m = 100. Figure 8.6.2 shows a typical simulation result after 20 learning cycles with a population of 50 hidden target sets. The test input signal in this case is the one given in Equation (5.4.5). This result compares favorably to those in Figure 5.4.4 (a) and (b) for a two-hidden layer feedforward net and a single hidden layer feedforward net, respectively, which required on the order of 105 to 106 training iterations of incremental backprop.










Figure 8.6.2. Nonlinear system identification results (dotted line) employing a single-hidden layer feedforward neural net trained with the GA/GD method. The exact dynamics are given by the solid line.

We conclude this chapter by pointing out that genetic algorithms may also be used as the evolutionary mechanism in the context of more general learning systems. Holland (1986; also, Holland and Reitman, 1978) introduced a classifier system which is an adaptive parallel rule-based system that learns syntactically simple string rules (called classifiers) to guide its performance in an arbitrary environment. The classifier system develops a sequence(s) of actions or decisions so that a particular objective is achieved in a dynamically evolving environment. As an example, one may think of the classifier system as a controller whose objective is to regulate or control the state of a dynamical system. Here, the fitness of a particular classifier (rule) is a function of how well an individual classifier complements others in the population. The heart of the classifier system is a reinforcement-type learning mechanism assisted with GA exploration (see Goldberg (1989) for an accessible reference on classifier systems and their applications). Recently, Abu Zitar (1993; also, Abu Zitar and Hassoun, 1993a, b) have developed a framework for synthesizing multilayer feedforward neural net controllers for robust nonlinear control from binary string rules generated by a classifier-like system.

Back to the Table of Contents

Back to Main Menu