Genetic algorithms are global optimization algorithms
based on the mechanics of natural selection and natural genetics.
They employ a structured yet randomized parallel multipoint search
strategy which is biased towards reinforcing search points of
"high fitness"; i.e., points at which the function being
minimized have relatively low values. Genetic algorithms are
similar to simulated annealing (Davis, 1987) in that they employ
random (probabilistic) search strategies. However, one of the
apparent distinguishing features of genetic algorithms is their
effective implementation of parallel multipoint search. This
section presents the fundamentals of genetic algorithms and shows
how they may be used for neural networks training.
8.5.1 Fundamentals of Genetic Algorithms
The genetic algorithm (GA) as originally
formulated by Holland (1975) was intended to be used as a modeling
device for organic evolution. Later, De Jong (1975) demonstrated
that the GA may also be used to solve optimization problems, and
that globally optimal by results may be produced. Although there
has been a lot of work done on modifications and improvements
to the method, this section will present the standard genetic
algorithm, and the analysis will follow the presentation given
in Goldberg (1983).
In its simplest form, the standard genetic algorithm
is a method of stochastic optimization for discrete programming
problems of the form:
Maximize f(s) (8.5.1)
subject to s = {0, 1}n
In this case, f : R
is called the fitness function, and the ndimensional
binary vectors in are called strings. The most noticeable
difference between the standard genetic algorithm and the methods
of optimization discussed earlier is that at each stage (iteration)
of the computation, genetic algorithms maintain a collection of
samples from the search space rather than a single point. This
collection of samples is called a population of strings.
To start the genetic search, an initial population
of, say M binary strings: S(0) = {s1, s2, ..., sM} ,
each with n bits, is created. Usually, this initial population
is created randomly because it is not known a priori where the
globally optimal strings in are likely to be found. If such
information is given, though, it may be used to bias the initial
population towards the most promising regions of . From this
initial population, subsequent populations S(1), S(2),
... S(t), ... will be computed employing the
three genetic operators of selection, crossover,
and mutation.
The standard genetic algorithm uses a roulette wheel
method for selection, which is a stochastic version of the survival
of the fittest mechanism. In this method of selection, candidate
strings from the current generation S(t) are selected
to survive to the next generation S(t+1) by designing
a roulette wheel where each string in the population is represented
on the wheel in proportion to its fitness value. So those strings
which have a high fitness are given a large share of the wheel,
while those strings with low fitness are given a relatively small
portion of the roulette wheel. Finally, selections are made by
spinning the roulette wheel M times and accepting as candidates
those strings which are indicated at the completion of the spin.
Example 8.5.1: As an example,
suppose M = 5, and consider the following initial population
of strings: S(0) = {(10110), (11000), (11110),
(01001), (00110)}. For each string si
in the population, the fitness may be evaluated: f(si).
The appropriate share of the roulette wheel to allot the ith
string is obtained by dividing the fitness of the ith string
by the sum of the fitnesses of the entire population: .
Figure 8.5.1 shows a listing of the population with associated
fitness values, and the corresponding roulette wheel.
(a) (b)
Figure 8.5.1. (a) A listing of the 5string population
and the associated fitness values. (b) Corresponding roulette
wheel for string selection. The integers shown on the roulette
wheel correspond to string labels.
To compute the next population of strings, the roulette
wheel is spun five times. The strings which are chosen from this
method of selection, though, are only candidate strings for the
next population. Before actually being copied into the new population,
these strings must undergo crossover and mutation.
Pairs of the M (assume M even) candidate
strings which have survived selection are next chosen for crossover,
which is a recombination mechanism. The probability that the
crossover operator is applied will be denoted by Pc.
Pairs of strings are selected randomly from S(t),
without replacement, for crossover. A random integer k,
called the crossing site, is chosen from {1, 2, ... n  1}
and then the bits from the two chosen strings are swapped after
the kth bit with a probability Pc,
this is repeated until S(t) is empty. For example,
Figure 8.5.2 illustrates a crossover for two 6bit strings. In
this case, the crossing site k is 4, so the bits from the
two strings are swapped after the fourth bit.
(a) (b) (c)
Figure 8.5.2. An example of a crossover for two 6bit
strings. (a) Two strings are selected for crossover. (b) A crossover
site is selected at random. In this case k = 4. (c) Now
swap the two strings after the kth bit.
Finally, after crossover, mutation is applied to
the candidate strings. The mutation operator is a stochastic
bitwise complementation, applied with uniform probability Pm.
That is, for each single bit in the population, the value of
the bit is flipped from 0 to 1 or from 1 to 0 with probability
Pm. As
an example, suppose Pm
= 0.1, and the string s = 11100 is to undergo mutation.
The easiest way to determine which bits, if any, to flip is to
choose a uniform random number r [0, 1]
for each bit in the string. If r Pm,
then the bit is flipped, otherwise no action is taken. For the
string s above, suppose the random numbers (0.91, 0.43,
0.03, 0.67, 0.29) were generated, then the resulting mutation
is shown below. In this case, the third bit was flipped.
After mutation, the candidate strings are copied
into the new population of strings: S(t+1), and
the whole process is repeated by calculating the fitness of each
string, using a roulette wheel method of selection, and applying
the operators of crossover and mutation.
Although the next section presents an analytical
analysis of the action of the genetic operators, some qualitative
comments may be helpful first. In the roulette wheel method of
selection, it is clear that only the aboveaverage strings will
tend to survive successive populations. Applying only selection
to a population of strings results in a sort of MAX operation,
that is, the operation of selecting the maximum from a set of
numbers. It has been shown (Suter and Kabrisky, 1992) that a
MAX operation may be implemented by successive application of
a theorem by Hardy et al. (1952), which states that for a set
of nonnegative real numbers Q R, then
min(Q) < ave(Q) < max(Q)
(8.5.2)
(where it is assumed that not all of the elements
of Q are equal). So the maximum value may be obtained by
simply averaging the elements of Q and excluding all elements
which are below average. The remaining subset may be averaged,
and the process repeated until only the maximum value survives.
The roulette wheel method of selection may be thought of as a
"soft" stochastic version of this MAX operation, where
strings with aboveaverage strengths tend to survive successive
roulette wheel spins.
The reason that the stochastic version is used,
rather than just deterministically always choosing the best strings
to survive, gets at the crux of the underlying theory and assumptions
of genetic search. This theory is based on the notion that even
strings with very low fitness may contain some useful partial
information to the search. For this reason, though the
survival probability is small, these lowly fit strings are not
altogether discarded during the search.
Of the three genetic operators, the crossover operator
is the most crucial in obtaining global results. Crossover is
responsible for mixing the partial information contained in the
strings of the population. In the next section, it will be conjectured
that this type of mixing will lead to the formation of optimal
strings. Based on empirical evidence, it has been found (De Jong,
1975; Grefenstette, 1986; Schaffer et al., 1989) that reasonable
values for the probability of crossover are 0.6 Pc 0.99.
Unlike the previous two operators which are used
to fully exploit and possibly improve the structure of the best
strings in the current population, the purpose of the mutation
operator is to diversify the search and introduce new strings
into the population in order to fully explore the search space.
The creation of these new strings is usually required because
of the vast differences between the number of strings in the population,
M, and the total number of strings in the entire search
space , 2n. Typically,
M is chosen to be orders of magnitude smaller than 2n,
so by selecting and recombining (crossing) the M strings
in a given population, only a fraction of the total search space
is explored. So mutation forces diversity in the population
and allows more of the search space to be sampled, thus allowing
the search to overcome local minima.
Mutation, though, cuts with a double edge sword.
Applying mutation too frequently will result in destroying the
highly fit strings in the population, which may slow and impede
convergence to a solution. Hence, although necessary, mutation
is usually applied with a small probability. Empirically, it
has been found (De Jong, 1975; Grefenstette, 1986; Schaffer et
al., 1989) that reasonable values for the probability of mutation
are . Bäck (1993) presented a theoretical
analysis where he showed that Pm =
is the best choice when the fitness function f is unimodal.
However, for a multimodal fitness function, Bäck shows that
a dynamic mutation rate may overcome local minima, whereas a fixed
mutation rate may not. The dynamic mutation rate may be implemented
by following a schedule where Pm
is slowly decreased towards from an initial
value Pm(0),
such that 1 > Pm(0) >> .
This is analogous to decreasing the temperature in simulated
annealing. Davis and Principe (1993) showed that the asymptotic
convergence of a GA with a suitable mutation probability schedule
can be faster than that of simulated annealing.
The Fundamental Theorem of Genetic Algorithms
We have just described the mechanisms of the standard
genetic algorithm. Later, we will demonstrate by example that
the GA actually works, i.e., global solutions to multimodal functions
may be obtained. The question here is: Why does the standard
GA work? Surprisingly, although the literature abounds with applications
which demonstrate the effectiveness of GA search, the underlying
theory is far less understood. The theoretical basis which has
been established thus far (Goldberg, 1983; Thierens and Goldberg,
1993), though, will be described next.
To analyze the convergence properties of the GA,
it is useful to define the notion of schema (plural, schemata).
A schema H is a structured subset of the search space
. The structure in H is provided by string similarities
at certain fixed positions of all the strings in H. The
string positions which aren't fixed in a given schema are usually
denoted by *.
For example, the schema H = *11*0
is the collection of all 5bit binary strings which contain a
1 in the second and third string positions, and a 0 in the last
position, that is,
In total, there are 3n
different schemas possible: all combinations of the symbols {0,
1 *}. Since
there are only 2n
different strings in , it is clear that a given string in will
belong to several different schema. More precisely, each string
in will belong to 2n
different schema.
To prove the fundamental theorem of genetic algorithms,
it is necessary to investigate the effect that selection, crossover,
and mutation has on a typical population of strings. More generally,
it is possible to determine the effect of these genetic operators
on the schemata of a typical population. The following notation
will be useful: The order of a schema H is the number
of fixed positions over the strings in H, and the defining
length of a schema is the distance between the first and last
fixed positions of the schema. The order and defining length
are denoted by o(H) and (H), respectively.
For example, for the schema *11**0,
then o(H) = 3 because there are 3 fixed positions
in wazzu H, and (H) = 6 
2 = 4, because 2 and 6 are the indices of the first and last fixed
string positions in H, respectively.
Consider S(t), the population of strings
at time t, and consider the collection of schemata which
contain one or more of the strings in this population. For each
such schema H, denote by m(H, t) the
number of strings in the population at time t which are
also in H. We want to study the long term behavior of
m(H, t) for those schema H which
contain highly fit strings.
Using the roulette wheel method of selection outlined
in the previous section, the expected number of strings in H S(t
+ 1) given the quantity m(H, t) is easily
seen to be:
(8.5.3)
where f(H) is the average fitness of
the strings H S(t + 1), and
is the average fitness of all the strings in the population at
time t. Assuming remains relatively
constant, the above equation has the form of a linear difference
equation: x(t + 1) = ax(t). The solution
of this equation is well known and given by: x(t) = atx(0),
t = 0, 1, 2, ..., which explodes if a > 1 and
decays if a < 1. By comparing with Equation (8.5.3),
we see that the number of strings in the population represented
by H is expected to grow exponentially if ,
i.e., if the average fitness of the schema is higher than the
average fitness of the entire population. Conversely, the number
of strings in the population represented by H will decay
exponentially if .
Now consider the effect of crossover on a schema
H with m(H, t) samples in the population
at time t. If a string belonging to H is selected
for crossover, one of two possibilities may occur: (1) The crossover
preserves the structure of H; in this case, the schema
is said to have survived crossover, or (2) The crossover
destroys the structure, and hence the resulting crossed string
will not belong to H at time t + 1. It's easy to
see by example which schema are likely to survive crossover.
Consider the two schema A and B shown below.
Claim: Schema A will not survive crossover
if the cross site k is 1, 2, 3, or 4. To see this, just
take a representative example from A, say 100011. Making
the reasonable assumption that the mating string is not identical
to the example string at precisely the fixed string positions
of A (i.e., the first and fifth positions), then upon crossover
with cross site 1, 2, 3, or 4, the fixed 1 at the fifth string
position will be lost, and the resulting string will not belong
to
On the other hand, schema B may be crossed
at sites 1, 3, 4, and 5 and still preserve the structure of B,
because, in this case, the 01 fixed positions will lie on the
same side of the crossing site and will be copied into the resulting
string. The only crossing site which will destroy the structure
of schema B would be k = 2.
By noticing the difference in defining length for
these two schema: (A) = 4 and (B) = 1,
the following conclusion may be made: A schema survives crossover
when the cross site is chosen outside its defining length. Hence,
the probability that a schema H will survive crossover
is given by . But since the crossover
operator is only applied with probability Pc,
the following quantity is a lower bound for the crossover survival
probability: .
Finally, the mutation operator destroys schema only
if it is applied to the fixed positions in the schema. Hence
the probability that a schema H will survive mutation is
given by (1 
Pm)o(H).
For small values of Pm,
the binomial theorem may be employed to obtain the approximation:
(1Pm)o(H) 1
 o(H)Pm.
The Fundamental Theorem of Genetic Algorithms may now
be given.
Theorem. (Goldberg,
1983) By using the selection, crossover, and mutation of the standard
genetic algorithm, then short, low order, and above average schemata
receive exponentially increasing trials in subsequent populations.
Proof. Since the operations
of selection, crossover, and mutation are applied independently,
then the probability that a schema H will survive to the
next generation may be obtained by a simple multiplication of
the survival probabilities derived above:
(8.5.4)
By neglecting the cross product terms, the desired
result is obtained:
(8.5.5)
The short, low order, and above average schemata
are called building blocks, and the Fundamental Theorem
indicates that building blocks are expected to dominate the population.
Is this good or bad in terms of the original goal of function
optimization? The above theorem does not answer this question.
Rather, the connection between the Fundamental Theorem and the
observed optimizing properties of the genetic algorithm is provided
by the following conjecture:
The Building Block Hypothesis.
The globally optimal strings in may be partitioned into substrings
which are given by the bits of the fixed positions of building
blocks.
Stated another way, the hypothesis is that the partial
information contained in each of the building blocks may be combined
to obtain globally optimal strings. If this hypothesis is correct,
then the Fundamental Theorem implies that the GA is doing the
right thing in allocating an exponentially increasing number of
trials to the building blocks, because some arrangement of the
building blocks is likely to produce a globally optimal string.
Unfortunately, although the building block hypothesis
seems reasonable enough, it does not always hold true. Cases
where the hypothesis fails can be constructed. It is believed
(Goldberg, 1983), though, that such cases are of "needle
in the haystack" type, where the globally optimal strings
are surrounded (in a Hamming distance sense) by the worst strings
in . Such problems are called GAdeceptive because by
following the building block hypothesis, the GA is lead away from
the globally optimal solutions rather than towards them. Current
trends in GA research (Kuo and Hwang, 1993; Qi and Palmieri, 1993)
include modifying the standard genetic operators in order to enable
the GA to solve such "needle in the haystack" problems,
and hence shrink in size the class of GAdeceptive problems.
The above analysis is based entirely on the schema
in the population rather than the actual strings in the population.
The GA, though, processes stringsnot schema. This type of
duality is called implicit parallelism by Holland (1975).
The implicit parallelism notion means that a larger amount of
information is obtained and processed at each generation by the
GA than would appear by simply looking at the processing of the
M strings. This additional information comes from the
number of schema that the GA is processing per generation. The
next question is: how many schema are actually processed per generation
by the GA? Clearly, in every population of M strings,
there are between 2n
and 2nM
schema present (if all the strings in the population are the same,
then there are 2n
schema; if all the strings are different, there may be at most
2nM
schema). Because the selection, crossover, and mutation operations
tend to favor certain schema, then not all of the schema in the
population will be processed by the GA. Holland (1975) estimated
that O(M3)
schemata per generation are actually processed in a useful manner
(see also Goldberg, 1989). Hence, implicit parallelism implies
that by processing M strings, the GA actually processes
O(M3) schemata
for free!
To apply the standard genetic algorithm to an arbitrary
optimization problem of the form:
(8.5.6)
it is necessary to establish the following:
1. A correspondence between the search space and
some space of binary strings . That is, an invertible mapping
of the form D: .
2. An appropriate fitness function f(s),
such that the maximizers of f correspond to the minimizers
of y.
This situation is shown schematically in Figure 8.5.3.
Figure 8.5.3. A schematic representation of the process
of matching an optimization problem with the genetic algorithm
framework.
Example 8.5.2: As an example
of the solution process, consider the function shown in Figure
8.1.1, and the following optimization problem:
(8.5.7)
This is the same function considered earlier in Example
8.1.1 and plotted in Figure 8.1.1. This function has two local
minima x 0.058, and x 0.091,
as well as a global minimum at x* 0.223.
The standard genetic algorithm will be used to obtain the global
minimum of this function.
The first thing to notice is that the search space
here is realvalued: = [0.05, 0.5]. As mentioned above, some
transformation is needed in order to encode/decode the realvalued
search space into some space of binary strings . In this example,
a binary search space consisting of sixbit strings, i.e., = {0,
1}6 was used, with the
decoding transformation given by
(8.5.8)
where d(s) is the ordinary decimal
representation of the 6bit binary string s. For example,
the decimal representation of 000011 is 3, so D (000011)
= 0.071; as would be expected, the two endpoints are mapped in
the following way: D (000000) = 0.05 and D (111111)
= 0.5.
To establish an appropriate fitness function for
this problem, recall that the problem is to minimize y(x),
but maximize the fitness function f(s). So some
sort of inverting transformation is required here. In this example,
the following fitness function is used:
(8.5.9)
where z = D(s).
Before applying the standard genetic algorithm,
values for M, Pc,
and Pm must
be chosen. The values for these parameters are usually determined
empirically by running some trial simulations. However, one may
first try to choose Pc
and Pm in
the ranges 0.6 Pc 0.99
and 0.01 Pm 0.001
respectively, as mentioned earlier. As for the value of M,
empirical results suggest that n M 2n
is a good choice (Alander, 1992. See also Reeves, 1993). Besides
the above parameters, some stopping or convergence criterion is
required. Although there are several different convergence criteria
which may be used, the criterion used here is to stop when the
population is sufficiently dominated by a single string. In this
case, convergence is obtained when a single string comprises 80
percent or more of the population.
Two simulations will be described below. In the
first simulation, an initial population of strings was generated
uniformly over the search space, and then, as usual, the genetic
operators of selection, crossover, and mutation were applied until
the convergence criterion was met. The following parameters were
used: M = 10, Pc = 0.8,
and Pm = 0.01,
and the results of this simulation are shown in Table 8.5.1.
In this table, a listing of the population is shown for the generations
at times t = 0, 1, 2, and 3. The decoded realvalue for
each string is also shown, as well as the associated fitness values.
The number in parenthesis beside each string shows the number
of multiplicities of the string appearing in the total population
of 10 strings. Notice how the population converges to populations
dominated by highly fit strings. After the fourth iteration (t
= 4), the population is dominated by the string s*
= 011001, and the population has converged. The string s*
is decoded to the value x* = 0.23,
which is close to the globally optimal solution of x* 0.223.
Note that better accuracy may be obtained by using a more accurate
encoding of the realvalued search space. That is, by using a
GA search space with strings of higher dimension, for example,
= {0,1}10,
or = {0, 1}20,
whatever accuracy is required.


 






 


 



Table 8.5.1. A listing of the first four populations in the first
simulation for Example 8.5.2. The numbers in parenthesis show
the multiplicity of the string in the total population of 10 strings.
The second simulation of this problem demonstrates a case where
mutation is necessary to obtain the global solution. In this
simulation, the initial population was created with all strings
near the left endpoint x = 0.05. The following
parameters were used here: M = 10, Pc
= 0.9, and Pm
= 0.05. The increased mutation and crossover rates were used
to encourage diversity in the population. This helps the genetic
algorithm branch out to explore the entire space. In fact, if
mutation was not used, i.e., Pm
= 0, then the global solution could never be found by the GA.
This is because the initial population is dominated by the schema
00000**
which is not a desirable building block because the fitness of
this schema is relatively small. Applying selection and crossover
won't help because no new schemata would be generated. The results
of the simulation are shown in Table 8.5.2. This time, the GA
took 44 iterations to converge to the solution s*
= 011000, with corresponding realvalue: x* = 0.219.


 






 


 



Table 8.5.2. A listing of the population at various stages of
the computation for the second simulation of Example 8.5.2. In
this simulation, the initial population of strings was concentrated
at the left endpoint of the search space .
Although the previous example used a onedimensional objective
function, multidimensional objective functions y: Rn R
may also be mapped onto the genetic algorithm framework by simply
extending the length of the binary strings in to represent each
component of the points x = (x1,
..., xn)
in . That is, each string s will consist of n substrings
s = (s1,
..., sn)
, where si
is the binary encoding for the ith component of x.
A decoding transformation may then be applied to each substring
separately: D(s) = (D1(s1),
..., Dn(sn)).
Although not necessary, decoding each component separately might
be desirable in certain applications. For example, suppose x = (x1,
x2), and = [0,
1] [0, 100]. To obtain the same level of accuracy for
the two variables, then more bits would have to be allotted the
substring representing the second component of x, because
it has a much larger range of values than the first component.
Hence, in this case, the decoding transformations D1(s1)
and D2(s2)
would be different.
The crossover operator may also be slightly modified
to exploit the structure of the substring decomposition of s.
Instead of choosing a single crossing site over the entire string
as shown below for a string of the form s = (s1,
s2, s3,
s4),
crossing sites may be chosen for each of the substrings,
and the crossover occur locally at each substring. This type
of crossover (known as multiple point crossover) is shown below:
A large number of other variations of and modifications
to the standard GA have been reported in the literature. For
examples, the reader is referred to Chapter 5 in Goldberg (1989)
and to the proceedings of the International Conference on Genetic
Algorithms (19891993).
The generalpurpose nature of GAs allows them to
be used in many different optimization tasks. As was discussed
earlier, an arbitrary optimization problem with objective function
y(x) can be mapped onto a GA as long as one can
find an appropriate fitness function which is consistent with
the optimization task. In addition, one needs to establish a
correspondence (an invertible mapping) between the search space
in x () and the GA search space () which is typically a
space of binary strings. Both of these requirements are possible
to satisfy in many optimization problems. For example, one may
simply use the function y itself as the fitness function
if y(x) is to be maximized, or the fitness function
f = y
may be used if y(x) is to be minimized. On the
other hand, the mapping between the original search space and
the GA space can vary from a simple realtobinary encoding, to
more elaborate encoding schemes. Empirical evidence suggests
that different choices/combinations of fitness functions and encoding
schemes can have significant effect on the GA's convergence time
and solution quality (Bäck, 1993). Unfortunately, theoretical
results on the specification of the space to be explored by a
GA are lacking (De Jong and Spears, 1993).
8.5.2 Application of Genetic Algorithms to Neural Networks
Figure 8.5.4. A simple two layer feedforward net
used to illustrate weight coding in a GA.
Now, we may generate an appropriate GAcompatible
representation for the net in Figure 8.5.4 as a contiguous sequence
of substrings s = (101010001110011) which corresponds
to the real valued weight string (w11, w12, w13, w21, w22) = (1, 2, 1, 2, 3).
Starting with a random population of such strings (population
of random nets), successive generations are constructed using
the GA to evolve new strings out of old ones. Thus, strings whose
fitness are above average (more specifically, strings which meet
the criteria of the Fundamental Theorem of GAs) tend to survive,
and ultimately, the population converges to the "fittest"
string. This string represents, with a high probability, the
optimal weight configuration for the learning task at hand for
the predefined net architecture and predefined admissible weight
values. It is interesting to note here how the crossover operation
may be interpreted as a swapping mechanism where parts (individual
units, group of units, and/or a set of weights) of fit networks
are interchanged in hope of producing a network with even higher
fitness.
GA's are also able to deal with learning in generally
interconnected networks including recurrent nets. Recurrent networks
pose special problems for gradient descent learning techniques
(refer to Section 5.4) that are not shared by GA's. With gradient
descent learning it is generally necessary to correlate causes
and effects in the network so that units and weights that cause
the desired output are strengthened. But with recurrent networks
the cause of a state may have occurred arbitrarily far in the
past. On the other hand, the GA evolves weights based on a fitness
measure of the whole network (a global performance measure), and
the question of what caused any particular network state to occur
is considered only in that the resulting state is desirable (Wieland,
1991). This inherent strength of GA's is in some ways also a
weakness. By ignoring gradient (or more generally cause and effect)
information when it does exist, the GA's can become slow and inefficient.
There are also large costs in speed and storage for working with
a whole population of networks, which can make standard GA's impractical
for evolving optimal designs for large networks.
Thus, when gradient information exists, particularly
if it is readily available, one can use such information to speedup
the GA search. This leads to hybrid GA/gradient search where
a gradient descent step may be included as one of the genetic
operators (Montana and Davis, 1989). A more general view of the
advantages of the marriage of GA and gradient descent can be seen
based on the relationship between evolution and learning. Belew
et al. (1990) have demonstrated the complementary nature of evolution
and learning: the presence of learning facilitates the process
of evolution [see also Smith (1987); Hinton and Nowlan (1987);
Nolfi et al. (1990); Keeling and Stork (1991)]. In the context
of our discussion, genetic algorithms can be used to provide a
model of evolution, and supervised learning (or some other learning
paradigm e.g., reinforcement or unsupervised learning) may be
used to provide simple but powerful learning mechanisms. Thus,
the presence of learning makes evolution much easier; all evolution
has to do is to evolve (find) an appropriate initial state of
a system, from which learning can do the rest (much like teaching
a child who already has an "evolved" potential for learning!).
These ideas motivate the use of hybrid learning methods which
employ GA and gradientbased searches. A specific example of
such a method is presented in the next section.
Potential applications of GA's in the context of
neural networks include evolving appropriate network structures
and learning parameters (Harp et al., 1989, 1990) which optimizes
one or more network performance measure. These measures may include
fast response (requires minimizing network size), VLSI hardware
implementation compatibility (requires minimizing connectivity),
and realtime learning (requires optimizing the learning rate).
Another interesting application of GA's is to evolve learning
mechanisms (rules) for neural networks. In other words, evolution
is recruited to discover a process of learning. Chalmers (1991)
reported an interesting experiment where a GA with proper string
representation applied to a population of singlelayer neural
nets evolved the LMS learning rule [Equation (3.1.35)] as an optimal
learning rule.