8.5 Genetic Algorithms in Neural Network Optimization

Genetic algorithms are global optimization algorithms based on the mechanics of natural selection and natural genetics. They employ a structured yet randomized parallel multipoint search strategy which is biased towards reinforcing search points of "high fitness"; i.e., points at which the function being minimized have relatively low values. Genetic algorithms are similar to simulated annealing (Davis, 1987) in that they employ random (probabilistic) search strategies. However, one of the apparent distinguishing features of genetic algorithms is their effective implementation of parallel multipoint search. This section presents the fundamentals of genetic algorithms and shows how they may be used for neural networks training.

8.5.1 Fundamentals of Genetic Algorithms

The genetic algorithm (GA) as originally formulated by Holland (1975) was intended to be used as a modeling device for organic evolution. Later, De Jong (1975) demonstrated that the GA may also be used to solve optimization problems, and that globally optimal by results may be produced. Although there has been a lot of work done on modifications and improvements to the method, this section will present the standard genetic algorithm, and the analysis will follow the presentation given in Goldberg (1983).

In its simplest form, the standard genetic algorithm is a method of stochastic optimization for discrete programming problems of the form:

Maximize f(s) (8.5.1)

subject to s   = {0, 1}n

In this case, :   R is called the fitness function, and the n-dimensional binary vectors in are called strings. The most noticeable difference between the standard genetic algorithm and the methods of optimization discussed earlier is that at each stage (iteration) of the computation, genetic algorithms maintain a collection of samples from the search space rather than a single point. This collection of samples is called a population of strings.

To start the genetic search, an initial population of, say M binary strings: S(0) = {s1s2, ..., sM}  , each with n bits, is created. Usually, this initial population is created randomly because it is not known a priori where the globally optimal strings in are likely to be found. If such information is given, though, it may be used to bias the initial population towards the most promising regions of . From this initial population, subsequent populations S(1), S(2), ... S(t), ... will be computed employing the three genetic operators of selection, crossover, and mutation.

The standard genetic algorithm uses a roulette wheel method for selection, which is a stochastic version of the survival of the fittest mechanism. In this method of selection, candidate strings from the current generation S(t) are selected to survive to the next generation S(t+1) by designing a roulette wheel where each string in the population is represented on the wheel in proportion to its fitness value. So those strings which have a high fitness are given a large share of the wheel, while those strings with low fitness are given a relatively small portion of the roulette wheel. Finally, selections are made by spinning the roulette wheel M times and accepting as candidates those strings which are indicated at the completion of the spin.

Example 8.5.1: As an example, suppose M = 5, and consider the following initial population of strings: S(0) = {(10110), (11000), (11110), (01001), (00110)}. For each string si in the population, the fitness may be evaluated: f(si). The appropriate share of the roulette wheel to allot the ith string is obtained by dividing the fitness of the ith string by the sum of the fitnesses of the entire population: . Figure 8.5.1 shows a listing of the population with associated fitness values, and the corresponding roulette wheel.

(a) (b)

Figure 8.5.1. (a) A listing of the 5-string population and the associated fitness values. (b) Corresponding roulette wheel for string selection. The integers shown on the roulette wheel correspond to string labels.

To compute the next population of strings, the roulette wheel is spun five times. The strings which are chosen from this method of selection, though, are only candidate strings for the next population. Before actually being copied into the new population, these strings must undergo crossover and mutation.

Pairs of the M (assume M even) candidate strings which have survived selection are next chosen for crossover, which is a recombination mechanism. The probability that the crossover operator is applied will be denoted by Pc. Pairs of strings are selected randomly from S(t), without replacement, for crossover. A random integer k, called the crossing site, is chosen from {1, 2, ... - 1} and then the bits from the two chosen strings are swapped after the kth bit with a probability Pc, this is repeated until S(t) is empty. For example, Figure 8.5.2 illustrates a crossover for two 6-bit strings. In this case, the crossing site k is 4, so the bits from the two strings are swapped after the fourth bit.

(a) (b) (c)

Figure 8.5.2. An example of a crossover for two 6-bit strings. (a) Two strings are selected for crossover. (b) A crossover site is selected at random. In this case k = 4. (c) Now swap the two strings after the kth bit.

Finally, after crossover, mutation is applied to the candidate strings. The mutation operator is a stochastic bit-wise complementation, applied with uniform probability Pm. That is, for each single bit in the population, the value of the bit is flipped from 0 to 1 or from 1 to 0 with probability Pm. As an example, suppose Pm = 0.1, and the string s = 11100 is to undergo mutation. The easiest way to determine which bits, if any, to flip is to choose a uniform random number r  [0, 1] for each bit in the string. If r  Pm, then the bit is flipped, otherwise no action is taken. For the string s above, suppose the random numbers (0.91, 0.43, 0.03, 0.67, 0.29) were generated, then the resulting mutation is shown below. In this case, the third bit was flipped.

After mutation, the candidate strings are copied into the new population of strings: S(t+1), and the whole process is repeated by calculating the fitness of each string, using a roulette wheel method of selection, and applying the operators of crossover and mutation.

Although the next section presents an analytical analysis of the action of the genetic operators, some qualitative comments may be helpful first. In the roulette wheel method of selection, it is clear that only the above-average strings will tend to survive successive populations. Applying only selection to a population of strings results in a sort of MAX operation, that is, the operation of selecting the maximum from a set of numbers. It has been shown (Suter and Kabrisky, 1992) that a MAX operation may be implemented by successive application of a theorem by Hardy et al. (1952), which states that for a set of non-negative real numbers Q  R, then

min(Q) < ave(Q) < max(Q) (8.5.2)

(where it is assumed that not all of the elements of Q are equal). So the maximum value may be obtained by simply averaging the elements of Q and excluding all elements which are below average. The remaining subset may be averaged, and the process repeated until only the maximum value survives. The roulette wheel method of selection may be thought of as a "soft" stochastic version of this MAX operation, where strings with above-average strengths tend to survive successive roulette wheel spins.

The reason that the stochastic version is used, rather than just deterministically always choosing the best strings to survive, gets at the crux of the underlying theory and assumptions of genetic search. This theory is based on the notion that even strings with very low fitness may contain some useful partial information to the search. For this reason, though the survival probability is small, these lowly fit strings are not altogether discarded during the search.

Of the three genetic operators, the crossover operator is the most crucial in obtaining global results. Crossover is responsible for mixing the partial information contained in the strings of the population. In the next section, it will be conjectured that this type of mixing will lead to the formation of optimal strings. Based on empirical evidence, it has been found (De Jong, 1975; Grefenstette, 1986; Schaffer et al., 1989) that reasonable values for the probability of crossover are 0.6  Pc  0.99.

Unlike the previous two operators which are used to fully exploit and possibly improve the structure of the best strings in the current population, the purpose of the mutation operator is to diversify the search and introduce new strings into the population in order to fully explore the search space. The creation of these new strings is usually required because of the vast differences between the number of strings in the population, M, and the total number of strings in the entire search space , 2n. Typically, M is chosen to be orders of magnitude smaller than 2n, so by selecting and recombining (crossing) the M strings in a given population, only a fraction of the total search space is explored. So mutation forces diversity in the population and allows more of the search space to be sampled, thus allowing the search to overcome local minima.

Mutation, though, cuts with a double edge sword. Applying mutation too frequently will result in destroying the highly fit strings in the population, which may slow and impede convergence to a solution. Hence, although necessary, mutation is usually applied with a small probability. Empirically, it has been found (De Jong, 1975; Grefenstette, 1986; Schaffer et al., 1989) that reasonable values for the probability of mutation are . Bäck (1993) presented a theoretical analysis where he showed that Pm =  is the best choice when the fitness function f is unimodal. However, for a multimodal fitness function, Bäck shows that a dynamic mutation rate may overcome local minima, whereas a fixed mutation rate may not. The dynamic mutation rate may be implemented by following a schedule where Pm is slowly decreased towards from an initial value Pm(0), such that 1 > Pm(0) >> . This is analogous to decreasing the temperature in simulated annealing. Davis and Principe (1993) showed that the asymptotic convergence of a GA with a suitable mutation probability schedule can be faster than that of simulated annealing.

The Fundamental Theorem of Genetic Algorithms

We have just described the mechanisms of the standard genetic algorithm. Later, we will demonstrate by example that the GA actually works, i.e., global solutions to multimodal functions may be obtained. The question here is: Why does the standard GA work? Surprisingly, although the literature abounds with applications which demonstrate the effectiveness of GA search, the underlying theory is far less understood. The theoretical basis which has been established thus far (Goldberg, 1983; Thierens and Goldberg, 1993), though, will be described next.

To analyze the convergence properties of the GA, it is useful to define the notion of schema (plural, schemata). A schema H is a structured subset of the search space . The structure in H is provided by string similarities at certain fixed positions of all the strings in H. The string positions which aren't fixed in a given schema are usually denoted by *. For example, the schema H = *11*0 is the collection of all 5-bit binary strings which contain a 1 in the second and third string positions, and a 0 in the last position, that is,

H = {(01100), (01110), (11100), (11110)}

In total, there are 3n different schemas possible: all combinations of the symbols {0, 1 *}. Since there are only 2n different strings in , it is clear that a given string in will belong to several different schema. More precisely, each string in will belong to 2n different schema.

To prove the fundamental theorem of genetic algorithms, it is necessary to investigate the effect that selection, cross-over, and mutation has on a typical population of strings. More generally, it is possible to determine the effect of these genetic operators on the schemata of a typical population. The following notation will be useful: The order of a schema H is the number of fixed positions over the strings in H, and the defining length of a schema is the distance between the first and last fixed positions of the schema. The order and defining length are denoted by o(H) and (H), respectively. For example, for the schema *11**0, then o(H) = 3 because there are 3 fixed positions in wazzu H, and (H) = 6 - 2 = 4, because 2 and 6 are the indices of the first and last fixed string positions in H, respectively.

Consider S(t), the population of strings at time t, and consider the collection of schemata which contain one or more of the strings in this population. For each such schema H, denote by m(H, t) the number of strings in the population at time t which are also in H. We want to study the long term behavior of m(Ht) for those schema H which contain highly fit strings.

Using the roulette wheel method of selection outlined in the previous section, the expected number of strings in H  S(t + 1) given the quantity m(H, t) is easily seen to be:


where f(H) is the average fitness of the strings H  S(t + 1), and is the average fitness of all the strings in the population at time t. Assuming remains relatively constant, the above equation has the form of a linear difference equation: x(t + 1) = ax(t). The solution of this equation is well known and given by: x(t) = atx(0), t = 0, 1, 2, ..., which explodes if a > 1 and decays if a < 1. By comparing with Equation (8.5.3), we see that the number of strings in the population represented by H is expected to grow exponentially if , i.e., if the average fitness of the schema is higher than the average fitness of the entire population. Conversely, the number of strings in the population represented by H will decay exponentially if .

Now consider the effect of crossover on a schema H with m(H, t) samples in the population at time t. If a string belonging to H is selected for crossover, one of two possibilities may occur: (1) The crossover preserves the structure of H; in this case, the schema is said to have survived crossover, or (2) The crossover destroys the structure, and hence the resulting crossed string will not belong to H at time t + 1. It's easy to see by example which schema are likely to survive crossover. Consider the two schema A and B shown below.

A = 11, B = 01

Claim: Schema A will not survive crossover if the cross site k is 1, 2, 3, or 4. To see this, just take a representative example from A, say 100011. Making the reasonable assumption that the mating string is not identical to the example string at precisely the fixed string positions of A (i.e., the first and fifth positions), then upon crossover with cross site 1, 2, 3, or 4, the fixed 1 at the fifth string position will be lost, and the resulting string will not belong to A.

On the other hand, schema B may be crossed at sites 1, 3, 4, and 5 and still preserve the structure of B, because, in this case, the 01 fixed positions will lie on the same side of the crossing site and will be copied into the resulting string. The only crossing site which will destroy the structure of schema B would be k = 2.

By noticing the difference in defining length for these two schema: (A) = 4 and (B) = 1, the following conclusion may be made: A schema survives crossover when the cross site is chosen outside its defining length. Hence, the probability that a schema H will survive crossover is given by . But since the crossover operator is only applied with probability Pc, the following quantity is a lower bound for the crossover survival probability: .

Finally, the mutation operator destroys schema only if it is applied to the fixed positions in the schema. Hence the probability that a schema H will survive mutation is given by (1 - Pm)o(H). For small values of Pm, the binomial theorem may be employed to obtain the approximation: (1-Pm)o(H)  1 - o(H)Pm. The Fundamental Theorem of Genetic Algorithms may now be given.

Theorem. (Goldberg, 1983) By using the selection, crossover, and mutation of the standard genetic algorithm, then short, low order, and above average schemata receive exponentially increasing trials in subsequent populations.

Proof. Since the operations of selection, crossover, and mutation are applied independently, then the probability that a schema H will survive to the next generation may be obtained by a simple multiplication of the survival probabilities derived above:


By neglecting the cross product terms, the desired result is obtained:


The short, low order, and above average schemata are called building blocks, and the Fundamental Theorem indicates that building blocks are expected to dominate the population. Is this good or bad in terms of the original goal of function optimization? The above theorem does not answer this question. Rather, the connection between the Fundamental Theorem and the observed optimizing properties of the genetic algorithm is provided by the following conjecture:

The Building Block Hypothesis. The globally optimal strings in may be partitioned into substrings which are given by the bits of the fixed positions of building blocks.

Stated another way, the hypothesis is that the partial information contained in each of the building blocks may be combined to obtain globally optimal strings. If this hypothesis is correct, then the Fundamental Theorem implies that the GA is doing the right thing in allocating an exponentially increasing number of trials to the building blocks, because some arrangement of the building blocks is likely to produce a globally optimal string.

Unfortunately, although the building block hypothesis seems reasonable enough, it does not always hold true. Cases where the hypothesis fails can be constructed. It is believed (Goldberg, 1983), though, that such cases are of "needle in the haystack" type, where the globally optimal strings are surrounded (in a Hamming distance sense) by the worst strings in . Such problems are called GA-deceptive because by following the building block hypothesis, the GA is lead away from the globally optimal solutions rather than towards them. Current trends in GA research (Kuo and Hwang, 1993; Qi and Palmieri, 1993) include modifying the standard genetic operators in order to enable the GA to solve such "needle in the haystack" problems, and hence shrink in size the class of GA-deceptive problems.

The above analysis is based entirely on the schema in the population rather than the actual strings in the population. The GA, though, processes strings--not schema. This type of duality is called implicit parallelism by Holland (1975). The implicit parallelism notion means that a larger amount of information is obtained and processed at each generation by the GA than would appear by simply looking at the processing of the M strings. This additional information comes from the number of schema that the GA is processing per generation. The next question is: how many schema are actually processed per generation by the GA? Clearly, in every population of M strings, there are between 2n and 2nM schema present (if all the strings in the population are the same, then there are 2n schema; if all the strings are different, there may be at most 2nM schema). Because the selection, cross-over, and mutation operations tend to favor certain schema, then not all of the schema in the population will be processed by the GA. Holland (1975) estimated that O(M3) schemata per generation are actually processed in a useful manner (see also Goldberg, 1989). Hence, implicit parallelism implies that by processing M strings, the GA actually processes O(M3) schemata for free!

To apply the standard genetic algorithm to an arbitrary optimization problem of the form:


it is necessary to establish the following:

1. A correspondence between the search space and some space of binary strings . That is, an invertible mapping of the form D:  .

2. An appropriate fitness function f(s), such that the maximizers of f correspond to the minimizers of y.

This situation is shown schematically in Figure 8.5.3.

Figure 8.5.3. A schematic representation of the process of matching an optimization problem with the genetic algorithm framework.

Example 8.5.2: As an example of the solution process, consider the function shown in Figure 8.1.1, and the following optimization problem:


This is the same function considered earlier in Example 8.1.1 and plotted in Figure 8.1.1. This function has two local minima x  0.058, and x  0.091, as well as a global minimum at x*  0.223. The standard genetic algorithm will be used to obtain the global minimum of this function.

The first thing to notice is that the search space here is real-valued: = [0.05, 0.5]. As mentioned above, some transformation is needed in order to encode/decode the real-valued search space into some space of binary strings . In this example, a binary search space consisting of six-bit strings, i.e.,  = {0, 1}6 was used, with the decoding transformation given by


where d(s) is the ordinary decimal representation of the 6-bit binary string s. For example, the decimal representation of 000011 is 3, so D (000011) = 0.071; as would be expected, the two end-points are mapped in the following way: D (000000) = 0.05 and D (111111) = 0.5.

To establish an appropriate fitness function for this problem, recall that the problem is to minimize y(x), but maximize the fitness function f(s). So some sort of inverting transformation is required here. In this example, the following fitness function is used:


where z = D(s).

Before applying the standard genetic algorithm, values for M, Pc, and Pm must be chosen. The values for these parameters are usually determined empirically by running some trial simulations. However, one may first try to choose Pc and Pm in the ranges 0.6  Pc  0.99 and 0.01  Pm  0.001 respectively, as mentioned earlier. As for the value of M, empirical results suggest that n  M  2n is a good choice (Alander, 1992. See also Reeves, 1993). Besides the above parameters, some stopping or convergence criterion is required. Although there are several different convergence criteria which may be used, the criterion used here is to stop when the population is sufficiently dominated by a single string. In this case, convergence is obtained when a single string comprises 80 percent or more of the population.

Two simulations will be described below. In the first simulation, an initial population of strings was generated uniformly over the search space, and then, as usual, the genetic operators of selection, crossover, and mutation were applied until the convergence criterion was met. The following parameters were used: M = 10, P= 0.8, and P= 0.01, and the results of this simulation are shown in Table 8.5.1. In this table, a listing of the population is shown for the generations at times t = 0, 1, 2, and 3. The decoded real-value for each string is also shown, as well as the associated fitness values. The number in parenthesis beside each string shows the number of multiplicities of the string appearing in the total population of 10 strings. Notice how the population converges to populations dominated by highly fit strings. After the fourth iteration (t = 4), the population is dominated by the string s* = 011001, and the population has converged. The string s* is decoded to the value x* = 0.23, which is close to the globally optimal solution of x*  0.223. Note that better accuracy may be obtained by using a more accurate encoding of the real-valued search space. That is, by using a GA search space with strings of higher dimension, for example,  = {0,1}10, or  = {0, 1}20, whatever accuracy is required.

S(t) = {s1, ..., s10}
Decoded Value

x = D(s)

- y(x)

t = 0
000010 (1)

000110 (1)

001010 (1)

010001 (1)

011001 (1)

100000 (1)

100111 (1)

101110 (1)

110101 (1)

111100 (1)



















t = 1
010101 (1)

110101 (1)

101010 (1)

001010 (1)

011001 (3)

010001 (3)











t = 2
101001 (1)

011001 (4)

111001 (1)

010001 (3)

110101 (1)








t = 3
010001 (4)

011001 (6)



Table 8.5.1. A listing of the first four populations in the first simulation for Example 8.5.2. The numbers in parenthesis show the multiplicity of the string in the total population of 10 strings.

The second simulation of this problem demonstrates a case where mutation is necessary to obtain the global solution. In this simulation, the initial population was created with all strings near the left endpoint x = 0.05. The following parameters were used here: M = 10, Pc = 0.9, and Pm = 0.05. The increased mutation and crossover rates were used to encourage diversity in the population. This helps the genetic algorithm branch out to explore the entire space. In fact, if mutation was not used, i.e., Pm = 0, then the global solution could never be found by the GA. This is because the initial population is dominated by the schema 00000** which is not a desirable building block because the fitness of this schema is relatively small. Applying selection and crossover won't help because no new schemata would be generated. The results of the simulation are shown in Table 8.5.2. This time, the GA took 44 iterations to converge to the solution s* = 011000, with corresponding real-value: x* = 0.219.

P(t) = {s1, ..., s10}
Decoded Value

x = D(s)

- y(x)

t = 0
000000 (3)

000001 (1)

000010 (3)

000011 (3)







t = 5
000010 (2)

000110 (1)

011010 (2)

001011 (1)

100010 (1)

010010 (1)

001000 (1)

001110 (1)















t = 30
111000 (1)

010010 (4)

000010 (1)

011010 (2)

010000 (2)








t = 44
010100 (1)

011000 (9)



Table 8.5.2. A listing of the population at various stages of the computation for the second simulation of Example 8.5.2. In this simulation, the initial population of strings was concentrated at the left end-point of the search space .

Although the previous example used a one-dimensional objective function, multi-dimensional objective functions y: Rn  R may also be mapped onto the genetic algorithm framework by simply extending the length of the binary strings in to represent each component of the points x = (x1, ..., xn) in . That is, each string s will consist of n substrings s = (s1, ..., sn) , where si is the binary encoding for the ith component of x. A decoding transformation may then be applied to each substring separately: D(s) = (D1(s1), ..., Dn(sn)). Although not necessary, decoding each component separately might be desirable in certain applications. For example, suppose = (x1, x2), and = [0, 1]  [0, 100]. To obtain the same level of accuracy for the two variables, then more bits would have to be allotted the substring representing the second component of x, because it has a much larger range of values than the first component. Hence, in this case, the decoding transformations D1(s1) and D2(s2) would be different.

The crossover operator may also be slightly modified to exploit the structure of the substring decomposition of s. Instead of choosing a single crossing site over the entire string as shown below for a string of the form s = (s1, s2, s3, s4),

crossing sites may be chosen for each of the substrings, and the crossover occur locally at each substring. This type of crossover (known as multiple point crossover) is shown below:

A large number of other variations of and modifications to the standard GA have been reported in the literature. For examples, the reader is referred to Chapter 5 in Goldberg (1989) and to the proceedings of the International Conference on Genetic Algorithms (1989-1993).

The general-purpose nature of GAs allows them to be used in many different optimization tasks. As was discussed earlier, an arbitrary optimization problem with objective function y(x) can be mapped onto a GA as long as one can find an appropriate fitness function which is consistent with the optimization task. In addition, one needs to establish a correspondence (an invertible mapping) between the search space in x () and the GA search space () which is typically a space of binary strings. Both of these requirements are possible to satisfy in many optimization problems. For example, one may simply use the function y itself as the fitness function if y(x) is to be maximized, or the fitness function f = -y may be used if y(x) is to be minimized. On the other hand, the mapping between the original search space and the GA space can vary from a simple real-to-binary encoding, to more elaborate encoding schemes. Empirical evidence suggests that different choices/combinations of fitness functions and encoding schemes can have significant effect on the GA's convergence time and solution quality (Bäck, 1993). Unfortunately, theoretical results on the specification of the space to be explored by a GA are lacking (De Jong and Spears, 1993).

8.5.2 Application of Genetic Algorithms to Neural Networks

There are various ways of using GA-based optimization in neural networks. The most obvious way is to use a GA to search the weight space of a neural network with a predefined architecture (Caudell and Dolan, 1989; Miller et al., 1989; Montana and Davis, 1989; Whitley and Hanson, 1989). The use of GA-based learning methods may be justified for learning tasks which require neural nets with hidden units (e.g., nonlinearly separable classification tasks, nonlinear function approximation, etc.), since the GA is capable of global search and is not easily fooled by local minima. Also, GAs are useful for use in nets consisting of units with non-differentiable activation functions (e.g., LTG's), since the fitness function need not be differentiable.

In supervised learning one may readily identify a fitness function as -E where E = E(w) may be the sum of squared error criterion as in Equation (5.1.13) or the entropy criterion of Equation (5.2.16). As for specifying the search space for the GA, the complete set of network weights is coded as a binary string si with associated fitness f(si) = -E(D(si)). Here, D(si) is a decoding transformation.

An example of a simple two-input two-unit feedforward net is shown in Figure 8.5.4. In this example, each weight is coded as a 3-bit signed-binary sub-string where the left most bit encodes the sign of the weight (e.g., 110 represents -2 and 011 represents +3)

Figure 8.5.4. A simple two layer feed-forward net used to illustrate weight coding in a GA.

Now, we may generate an appropriate GA-compatible representation for the net in Figure 8.5.4 as a contiguous sequence of substrings s = (101010001110011) which corresponds to the real valued weight string (w11w12w13w21w22) = (-1, 2, 1, -2, 3). Starting with a random population of such strings (population of random nets), successive generations are constructed using the GA to evolve new strings out of old ones. Thus, strings whose fitness are above average (more specifically, strings which meet the criteria of the Fundamental Theorem of GAs) tend to survive, and ultimately, the population converges to the "fittest" string. This string represents, with a high probability, the optimal weight configuration for the learning task at hand for the predefined net architecture and predefined admissible weight values. It is interesting to note here how the cross-over operation may be interpreted as a swapping mechanism where parts (individual units, group of units, and/or a set of weights) of fit networks are interchanged in hope of producing a network with even higher fitness.

GA's are also able to deal with learning in generally interconnected networks including recurrent nets. Recurrent networks pose special problems for gradient descent learning techniques (refer to Section 5.4) that are not shared by GA's. With gradient descent learning it is generally necessary to correlate causes and effects in the network so that units and weights that cause the desired output are strengthened. But with recurrent networks the cause of a state may have occurred arbitrarily far in the past. On the other hand, the GA evolves weights based on a fitness measure of the whole network (a global performance measure), and the question of what caused any particular network state to occur is considered only in that the resulting state is desirable (Wieland, 1991). This inherent strength of GA's is in some ways also a weakness. By ignoring gradient (or more generally cause and effect) information when it does exist, the GA's can become slow and inefficient. There are also large costs in speed and storage for working with a whole population of networks, which can make standard GA's impractical for evolving optimal designs for large networks.

Thus, when gradient information exists, particularly if it is readily available, one can use such information to speedup the GA search. This leads to hybrid GA/gradient search where a gradient descent step may be included as one of the genetic operators (Montana and Davis, 1989). A more general view of the advantages of the marriage of GA and gradient descent can be seen based on the relationship between evolution and learning. Belew et al. (1990) have demonstrated the complementary nature of evolution and learning: the presence of learning facilitates the process of evolution [see also Smith (1987); Hinton and Nowlan (1987); Nolfi et al. (1990); Keeling and Stork (1991)]. In the context of our discussion, genetic algorithms can be used to provide a model of evolution, and supervised learning (or some other learning paradigm e.g., reinforcement or unsupervised learning) may be used to provide simple but powerful learning mechanisms. Thus, the presence of learning makes evolution much easier; all evolution has to do is to evolve (find) an appropriate initial state of a system, from which learning can do the rest (much like teaching a child who already has an "evolved" potential for learning!). These ideas motivate the use of hybrid learning methods which employ GA and gradient-based searches. A specific example of such a method is presented in the next section.

Potential applications of GA's in the context of neural networks include evolving appropriate network structures and learning parameters (Harp et al., 1989, 1990) which optimizes one or more network performance measure. These measures may include fast response (requires minimizing network size), VLSI hardware implementation compatibility (requires minimizing connectivity), and real-time learning (requires optimizing the learning rate). Another interesting application of GA's is to evolve learning mechanisms (rules) for neural networks. In other words, evolution is recruited to discover a process of learning. Chalmers (1991) reported an interesting experiment where a GA with proper string representation applied to a population of single-layer neural nets evolved the LMS learning rule [Equation (3.1.35)] as an optimal learning rule.

Back to the Table of Contents

Back to Main Menu