Learning with backprop is slow (Sutton, 1986 ; Huang
and Lippmann, 1988). This is due to the characteristics of the
error surface. The surface is characterized by numerous flat
and steep regions. In addition, it has many troughs which are
flat in the direction of search. These characteristics are
particularly
pronounced in classification problems, especially when the size
of the training set is small (Hush
et al., 1991).

Many enhancements of and variations to backprop
have been proposed. These are mostly heuristic modifications
with goals of increased speed of convergence,
avoidance of local
minima, and/or improvement in the network's ability to generalize.
In this section, we present some common heuristics which may
improve these aspects of backprop learning in multilayer feedforward
neural networks.

Due to its gradient descent nature, backprop is
very sensitive to initial conditions. If the choice of the initial
weight vector **w**0
(here **w** is a point in the weight space being searched by
backprop) happens to be located within the attraction basin of
a strong local minima attractor (one where the minima is at the
bottom of a steep-sided valley of the criterion/error surface),
then the convergence of backprop will be fast and the solution
quality will be determined by the depth of the valley relative
to the depth of the global minima. On the other hand, backprop
converges very slowly if **w**0
starts the search in a relatively flat region of the error surface.

An alternative explanation for the sensitivity of
backprop to initial weights (as well as to other learning parameters)
is advanced by Kolen and Pollack
(1991). Using Monte Carlo simulations
on simple feedforward nets with incremental backprop learning
of simple functions, they discovered a complex fractal-like structure
for convergence as a function of initial weights. They reported
regions of high sensitivity in the weight space where two very
close initial points can lead to substantially different learning
curves. Thus, they hypothesize that these fractal-like structures
arise in backprop due to the nonlinear nature of the dynamic learning
equations which exhibit multiple attractors; rather than the gradient
descent metaphor with local valleys to get stuck in, they advance
a many-body metaphor where the search trajectory is determined
by complex interactions with the systems attractors.

In practice, the weights are normally initialized
to small zero-mean random values (Rumelhart et al., 1986b).
The
motivation for starting from small weights is that large weights
tend to prematurely saturate units in a network and render them
insensitive to the learning process (Hush et al., 1991 ; Lee et
al., 1991) (this phenomenon is known as "flat spot"
and is considered in Section 5.2.4). On the other hand, randomness
is introduced as a symmetry breaking mechanism; it prevents units
from adopting similar functions and becoming redundant.

A sensible strategy for choosing the magnitudes
of the initial weights for avoiding premature saturation is to
choose them such that an arbitrary unit *i* starts with a
small and random weighted sum *net**i*.
This may be achieved by setting the initial weights of unit *i*
to be on the order of where
*f**i*
is the number of inputs (fan-in) for unit *i* (Wessels and
Barnard, 1992). It can be easily shown that for zero-mean random
uniform weights in and assuming normalized
inputs which are randomly and uniformly distributed in the range
, *net**i*
has zero-mean and has standard deviation .
Thus, by generating uniform random weights within the range ,
the input to unit *i* (*net**i*)
is a random variable with zero mean and a standard deviation of
unity, as desired.

In simulations involving single hidden layer feedforward
networks for pattern classification and function approximation
tasks, substantial improvements in backprop convergence speed
and avoidance of "bad" local minima are possible by
initializing the hidden unit weight vectors to normalized vectors
selected randomly from the training set (Denoeux and Lengellé,
1993).

The convergence speed of backprop is directly related
to the learning rate parameter (*o*
and *h* in Equations
(5.1.2) and (5.1.9), respectively); if is small, the search path
will closely approximate the gradient path, but convergence will
be very slow due to the large number of update steps needed to
reach a local minima. On the other hand, if is large, convergence
will initially be very fast, but the algorithm will eventually
oscillate and thus not reach a minimum. In general, it is desirable
to have large steps when the search point is far away from a minimum
with decreasing step size as the search approaches a minimum.
In this section, we give a sample of the various approaches for
selecting the proper learning rate.

One early proposed heuristic (Plaut et al., 1986)
is to use constant learning rates which are inversely proportional
to the fan-in of the corresponding units. Extensions of the idea
of fan-in dependence of the learning rate have also been proposed
by (Tesauro and Janssens,
1988). The increased convergence speed
of backprop due to the utilization of the above method of setting
the individual learning rates for each unit inversely proportional
to the number of inputs to that unit has been theoretically justified
by analyzing the eigenvalue distribution of the Hessian matrix
of the criterion function, (Le Cun et
al., 1991a). Such learning rate normalization can be intuitively
thought of as maintaining balance between the learning speed of
units with different fan-in. Without this normalization, after
each learning iteration, units with high fan-in have their input
activity (*net*) changed by a larger amount than units with
low fan-in. Thus, and due to the nature of the sigmoidal activation
function used, the units with large fan-in tend to commit their
output to a saturated state prematurely and are rendered difficult
to adapt (see Section 5.2.4 for additional discussion). Therefore,
normalizing the learning rates of the various units by dividing
by their corresponding fan-in helps speed up learning.

The optimal learning rate for fast convergence of
backprop/gradient descent search is the inverse of the largest
eigenvalue of the Hessian matrix **H** of the error function
*E*, evaluated at the search point **w**. Computing the
full Hessian matrix is prohibitively expensive for large networks
with thousands of parameters are involved. Therefore, finding
the largest eigenvalue max
for speedy convergence seems rather inefficient. However, one
may employ a shortcut to efficiently estimate max
(Le Cun et al., 1993). This
shortcut is based on a simple way
of approximating the product of **H** by an arbitrarily chosen
(random) vector **z** through Taylor expansion:
where is a small positive constant. Now, using the power method,
which amounts to iterating the procedure ,
the vector **z** converges to where
**c**max is the normalized
eigenvector of **H** corresponding to max.
Thus, the norm of the converged vector **z** gives a good
estimate of |max|, and
its reciprocal may now be used as the learning rate in backprop.
An on-line version of this procedure is reported by Le Cun et
al. (1993).

Many heuristics have been proposed so as to adapt
the learning rate automatically.
Chan and Fallside (1987) proposed
an adaptation rule for that is based
on the cosine of the angle between the gradient vectors
and (here, *t* is an integer which
represents iteration number).
Sutton (1986) presented a method
which can increase or decrease for each
weight *w**i*
according to the number of sign changes observed in the associated
partial derivative . This method was
also studied empirically by Jacobs (1988). Franzini (1987) investigated
a technique that heuristically adjusts ,
increasing it whenever is close to
and decreasing it otherwise.
Cater (1987) suggested using separate
parameters, , one for each pattern
**x***k*.
Silva and Almeida (1990, see
also Vogl et al., 1988) used a
method
where the learning rate for a given weight *w**i*
is set to if
and have the same sign, with ;
if the partial derivatives have different signs, then a learning
rate of is used, with .
A similar, theoretically justified method for increasing the
convergence speed of incremental gradient descent search is to
set (*t*) = (*t* - 1)
if *E*(*t*) has the same sign as ,
and otherwise (Pflug, 1990).

When the input vectors are assumed to be randomly
and independently chosen from a probability distribution, we may
view incremental backprop as a stochastic gradient descent algorithm,
along the lines of the theory in Section 4.2.2. Thus, simply
setting the learning rate to a constant results in persistent
residual fluctuations around a local minimum **w***.
The variance of such fluctuations depends on the size of , the
criterion function being minimized, and the training set. Based
on results from stochastic approximation theory (Ljung, 1977),
the "running average" schedule ,
with sufficiently small 0,
guarantees asymptotic convergence to a local minimum
**w***.
However, this schedule leads to very slow convergence. Here,
one would like to start the search with a learning rate faster
than but then ultimately converge to
the rate as **w***
is approached. Unfortunately, increasing 0
can lead to instability for small *t*. Darken and Moody
(1991) proposed the "search then converge" schedule
which allows for faster convergence without
compromising stability. In this schedule, the learning rate stays
relatively high for a "search time" during which it
is hoped that the weights will hover about a good minimum. Then,
for times , the learning rate decreases
as and the learning converges. Note
that for , this schedule reduces to the
running average schedule. So a procedure for optimizing is needed.
A completely automatic "search then converge" schedule
can be found in Darken and Moody
(1992).

Another simple approach to speed up backprop is
through the addition of a momentum term (Plaut et al., 1986) to
the right-hand side of the weight update rules in Equations (5.1.2)
and (5.1.9). Here, each weight change *w**i*
is given some momentum so that it accelerates in the average downhill
direction, instead of fluctuating with every change in the sign
of the associated partial derivative .
The addition of momentum to gradient search is formally stated
as

(5.2.1)

where a
is a momentum rate normally chosen between 0 and 1 and .
Equation (5.2.1) is a special case of multi-stage gradient methods
which have been proposed for accelerating convergence (Wegstein,
1958) and escaping local minima (Tsypkin, 1971).

The momentum term can also be viewed as a way of
increasing the effective learning rate in almost-flat regions
of the error surface while maintaining a learning rate close to
r (here
0 r 1)
in regions with high fluctuations. This can be seen by employing
an *N*-step recursion and writing (5.2.1) as

(5.2.2)

If the search point is caught in a flat region, then
will be about the same at each time-step
and Equation (5.2.2) can be approximated as (with 0 a
1 and *N* large)

(5.2.3)

Thus, for flat regions, a momentum term leads to
increasing the learning rate by a factor .
On the other hand, if the search point is in a region of high
fluctuation, the weight change will not gain momentum; i.e., the
momentum effect vanishes. An empirical study of the effects of
r and
a
on the convergence of backprop and on its learning curve can be
found in Tollenaere
(1990).

Adaptive momentum rates may also be employed. Fahlman (1989) proposed and extensively simulated a heuristic variation of backprop, called quickprop, which employs a dynamic momentum rate given by

(5.2.4)

With this adaptive a(*t*)
substituted in (5.2.1), if the current slope is persistently smaller
than the previous one but has the same sign, then a(*t*)
is positive and the weight change will accelerate. Here, the
acceleration rate is determined by the magnitude of successive
differences between slope values. If the current slope is in
the opposite direction from the previous one, it signals that
the weights are crossing over a minimum. In this case, a(*t*)
has a negative sign and the weight change starts to decelerate.
Additional heuristics are used to handle the undesirable case
where the current slope is in the same direction as the previous
one, but has the same or larger magnitude; otherwise, this scenario
would lead to taking an infinite step or moving the search point
backwards, or up the current slope and toward a local maximum.
Substituting Equation (5.2.4) in (5.2.1) leads to the update
rule

(5.2.5)

It is interesting to note that Equation (5.2.5)
corresponds
to steepest gradient descent-based adaptation with a dynamically
changing effective learning rate (*t*). This learning rate
is given by the sum of the original constant learning rate and
the reciprocal of the denominator of the second term in the right-hand
side of Equation (5.2.5).

The use of error gradient information at two consecutive
time steps in Equation (5.2.4) to improve convergence speed can
be justified as being based on approximations of second-order
search methods such as Newton's method. The Newton method (e.g.,
Dennis and Schnabel, 1983)
is based on a quadratic model
of the criterion *E*(**w**) and hence uses only the first
three terms in a Taylor series expansion of *E* about the
"current" weight vector **w***c*:

This quadratic function is minimized by solving the
equation which leads to Newton's method:
. Here **H** is the Hessian matrix
with components .

Newton's algorithm iteratively computes the weight
changes **w** and works well when initialized within a convex
region of *E*. In fact, the algorithm converges quickly
if the search region is quadratic or nearly so. However this
method is very computationally expensive since the computation
**H**-1
requires *O*(*N*3)
operations at each iteration (here, *N* is the dimension
of the search space). Several authors have suggested computationally
efficient ways of approximating Newton's method (Parker, 1987 ;
Ricotti et al., 1988 ; Becker and Le Cun, 1989).
Becker and Le
Cun proposed an approach whereby the off-diagonal elements of
**H** are neglected, thus arriving at the approximation

(5.2.6)

which is a "decoupled" form of Newton's
rule where each weight is updated separately. The second term
in the right-hand side of Equation (5.2.5) can now be viewed as
an approximation of Newton's rule, since its denominator is a
crude approximation of the second derivative of *E* at step
*t*. In fact, this suggests that the weight update rule
in Equation (5.2.5) may be used with = 0.

As with Equation (5.2.4), special heuristics must
be used in order to prevent the search from moving in the wrong
gradient direction and in order to deal with regions of very small
curvature, such as inflection points and plateaus, which cause
*w**i* in Equation
(5.2.6) to blow-up. A simple solution is to replace the
term in Equation (5.2.6) by where
m is a small
positive constant. The approximate Newton method described above
is capable of scaling the descent step in each direction. However,
because it neglects off-diagonal Hessian terms, it is not able
to rotate the search direction as in the exact Newton's method.
Thus, this approximate rule is only efficient if the directions
of maximal and minimal curvature of *E* happen to be aligned
with the weight space axes.
Bishop (1992) reported a somewhat
efficient technique for computing the elements of the Hessian
matrix exactly, using multiple feedforward propagation through
the network, followed by multiple backward propagation.

Another approach for deriving theoretically justifiable
update schedules for the momentum rate in Equation (5.2.1) is
to adjust a(*t*)
at each update step such that the gradient descent search direction
is "locally" optimal. In "optimum" steepest
descent (also known as best-step steepest descent), the learning
rate is set at time *t* such that it minimizes the criterion
function *E* at time step *t *+ 1; i.e., we
desire a r
which minimizes . Unfortunately, this
optimal learning step is impractical since it requires the computation
of the Hessian 2*E*
at each time step (refer to Problem 5.2.12 for an expression for
the optimal ). However, we may still use some of the properties
of the optimal in order to accelerate the search, as we demonstrate
next.

When **w**(*t*) is specified, the necessary
condition for minimizing *E*[**w**(*t*+1)] is (Tompkins,
1956; Brown, 1959)

(5.2.7)

This implies that the search direction in two successive
steps of optimum steepest descent are orthogonal. The easiest
method to enforce the orthogonal requirement is the Gram-Schmidt
orthogonalization method. Suppose that we know the search direction
at time *t *- 1,
denoted **d**(*t *- 1),
and that we compute the "exact" gradient *E*(*t*)
(used in batch backprop) at time step *t* [to simplify notation,
we write *E*[**w**(*t*)] as *E*(*t*)].
Now, we can satisfy the condition of orthogonal consecutive search
directions by computing a new search direction, employing Gram-Schmidt
orthogonolization (Yu et al.,
1993)

(5.2.8)

Performing descent search in the direction
**d**(*t*)
in Equation (5.2.8) leads to the weight vector update rule

(5.2.9)

where the relation **w**(*t *- 1)
= **w**(*t*) -
**w**(*t *- 1)
= +r
**d**(*t *- 1)
has been used. Comparing the component-wise weight update version
of Equation (5.2.9) to Equation (5.2.1) reveals another adaptive
momentum rate given by

Another similar approach is to set the current search
direction **d**(*t*) to be a compromise between the current
"exact" gradient *E*(*t*) and the previous
search direction **d**(*t* - 1),
i.e., , with .
This is the basis for the conjugate gradient method in
which
the search direction is chosen (by appropriately setting b)
so that it distorts as little as possible the minimization achieved
by the previous search step. Here, the current search direction
is chosen to be conjugate (with respect to **H**) to the previous
search direction. Analytically, we require
**d**(*t *- 1)T
**H**(*t *- 1)**d**(*t*) = 0
where the Hessian is assumed to be
positive-definite.**
**In practice, b,
which plays the role of an adaptive momentum, is chosen according
to the Polack-Ribiére rule (Polack and Ribiére,
1969 ; Press et al.,
1986) :

Thus, the search direction in the conjugate gradient
method at time *t* is given by

Now, using and substituting
the above expression for **d**(*t*) in
**w**(*t*) = **d**(*t*)
leads to the weight update rule:

When *E* is quadratic, the conjugate gradient
method theoretically converges in *N* or fewer iterations.
In general, *E* is not quadratic and therefore this method
would be slower than what the theory predicts. However, it is
reasonable to assume that *E* is approximately quadratic
near a local minimum. Therefore, conjugate gradient descent is
expected to accelerate the convergence of backprop once the search
enters a small neighborhood of a local minimum. As a general
note, the basic idea of conjugate gradient search was introduced
by Hestenes and Stiefel
(1952). Beckman (1964)
gives a good account
of this method. Battiti
(1992) and van der Smagt
(1994) gave
additional characterization of second-order backprop (such as
conjugate gradient-based backprop) from the point of view of
optimization.
The conjugate gradient method has been applied to multilayer
feedforward neural net training (Kramer and
Sangiovanni-Vincentelli,
1989 ; Makram-Ebeid et al.,
1989 ; van der Smagt,
1994) and is shown
to outperform backprop in speed of convergence.

It is important to note that the above second-order
modifications to backprop improve the speed of convergence of
the weights to the "closest" local minimum. This faster
convergence to local minima is the direct result of employing
a better search direction as compared to incremental backprop.
On the other hand, the stochastic nature of the search directions
of incremental backprop and its fixed learning rates can be an
advantage since it allows the search to escape shallow local minima,
which generally leads to better solution quality. These observations
suggest the use of hybrid learning algorithms (Møller,
1990 ; Gorse and Shepherd,
1992) where one starts with incremental
backprop and then switches to conjugate gradient-based backprop
for the final convergence phase. This hybrid method has its roots
in a technique from numerical analysis known as Levenberg-Marquardt
optimization (Press et al.,
1986).

As a historical note, we mention that the concept
of gradient descent was first introduced by Cauchy (1847) for
use in the solution of simultaneous equations; the method has
enjoyed popularity ever since. It should also be noted that some
of the above enhancements to gradient search date back to the
fifties and sixties and are discussed in Tsypkin (1971). Additional
modifications of the gradient descent method which enhances its
convergence to global minima are discussed in Section 8.1. For
a good survey of gradient search the reader is referred to the
book by Polyak (1987).

As indicated earlier in this section, backprop suffers
from premature convergence of some units to flat spots. During
training, if a unit in a multilayer network receives a weighted
signal *net* with a large magnitude, this unit outputs a
value close to one of the saturation levels of its activation
function. If the corresponding target value (desired target value
for an output unit, or an unknown "correct" hidden target
for a hidden unit) is substantially different from that of the
saturated unit, we say that the unit is incorrectly saturated
or has entered a flat spot. When this happens, the size of the
weight update due to backprop will be very small even though the
error is relatively large, and it will take an excessively long
time for such incorrectly saturated units to reverse their states.
This situation can be explained by referring to Figure 5.2.1.,
where the activation function *f*(*net*) = tanh(b
*net*) and its derivative *f* ' [given in Equation (5.1.12)]
are plotted for b
= 1.

Here, when *net* has large magnitude, *f *'
approaches zero. Thus, the weight change approaches zero in Equations
(5.1.2) and (5.1.9) even when there is a large difference between
the actual and desired/correct output for a given unit.

A simple solution to the flat spot problem is to
bias the derivative of the activation function (Fahlman, 1989);
i.e., replace *f**o*'
and *f**h*'
in Equation (5.1.2) and (5.1.9) by *f**o*'
+ e and
*f**h*' +
e,
respectively (a typical value for e
is 0.1). Hinton (1987a)
suggested the use of a nonlinear error
function that goes to infinity at the points where *f *'
goes to zero, resulting in a finite non-zero error value (see
Franzini (1987) for an
example of using such an error function).
The entropic
criterion of Chapter 3, Equation (3.1.76), is a
good choice for the error function since it leads to an output
unit update rule similar to that of Equation (5.1.2) but without
the *f**o*'
term (note, however, that the update rule for the hidden units
would still have the derivative term. See Equation (5.2.18) in
Section 5.2.6). One may also modify the basic sigmoid activation
function in backprop in order to reduce flat spot effects. The
use of the homotopy activation function
is one such example (Yang and Yu,
1993). Here,
forms a homotopy between a linear and a sigmoid function with
. Initially, is set to 1; that is, all
nodes have linear activations. Backprop is used to achieve a
minimum in *E*, then is decreased (monotonically) and backprop
is continued until is zero. That is, the activation function
recovers its sigmoidal nature gradually as training progresses.
Since and is nonzero for most of the
training phase, flat spot effects are eliminated. Besides reducing
the effects of flat spot, the homotopy function also helps backprop
escape some local minima. This can be seen by noting that when
= 1, the error function is
a polynomial of **w** which has a relatively smaller number
of local minima than . Because we have
achieved a minimum point of , which can
provide a relatively better initial point for minimizing ,
many unwanted local minima are avoided. An alternative explanation
of the effect of a gradually increasing activation function slope
on the avoidance of local minima is given in Section 8.4 based
on the concept of mean-field annealing.

Another method for reducing flat spot effects involves
dynamically updating the activation slope (l
and b in
Equations (5.1.11) and (5.1.12), respectively) such that the slope
of each unit is adjusted, independently, in the direction of reduced
output error (Tawel, 1989 ;
Kufudaki and Horejs, 1990
; Rezgui and
Tepedelenlioglu, 1990 ; Kruschke and Movellan, 1991; Sperduti and
Starita, 1991, 1993). Gradient descent on the error surface in
the activation function's slope space leads to the following update
rules (assuming hyperbolic tangent activation
functions)

(5.2.10)

and

(5.2.11)

for the *l*th output unit and the *j*th
hidden unit, respectively. Here, *o*
and *h* are small
positive constants. Typically, when initialized with slopes near
unity, Equations (5.2.10) and (5.2.11) reduce the activation slopes
toward zero, which increases the effective dynamic range of the
activation function which, in turn, reduces flat spot effects
and therefore allows the weights to update rapidly in the initial
stages of learning. As the algorithm begins to converge, the
slope starts to increase and thus restores the saturation properties
of the units. It is important to note here that the slope adaptation
process just described becomes a part of the backprop weight update
procedure; the slopes are updated after every weight update step.

Other nonsigmoid activation functions may be utilized
as long as they are differentiable (Robinson et al., 1989).
From
the discussion on the approximation capabilities of multilayer
feedforward networks in Section 2.3, a wide range of activation
functions may be employed without compromising the universal
approximation
capabilities of such networks. However, the advantages of choosing
one particular class of activation functions (or a mixture of
various functions) is not completely understood. Moody and Yarvin
(1992) reported an empirical study where they have compared
feedforward
networks with a single hidden layer feeding into a single linear
output unit, each network employing different type of differentiable
nonlinear activation function. The types of activation functions
considered by Moody and Yarvin included the sigmoid logistic function,
polynomials, rational functions (ratios of polynomials), and Fourier
series (sums of cosines). Benchmark simulations on a few data
sets representing noisy data with only mild nonlinearity and noiseless
data with a high degree of nonlinearity were performed. It was
found that the networks with nonsigmoidal activations attained
superior performance on the highly nonlinear noiseless data.
On the set of noisy data with mild nonlinearity, however, polynomials
did poorly, whereas rationals and Fourier series showed better
performance and were comparable to sigmoids.

Other methods for improving the training speed of
feedforward multilayer networks involve replacing the sigmoid
units by Gaussian or other units. These methods are covered in
Chapter 6.

**5.2.5 **** Weight Decay, Weight Elimination, and
Unit
Elimination**

In Chapter 4 (Section 4.8), we saw that in order
to guarantee good generalization, the number of degrees of freedom
or number of weights (which determines a network's complexity)
must be considerably smaller than the amount of information available
for training. Some insight into this matter can be gained from
considering an analogous problem in curve fitting (Duda and Hart,
1973 ; Wieland and
Leighton, 1987). For example, consider the
rational function which is plotted in
Figure 5.2.2 (solid line). And, assume that we are given a set
of 15 samples (shown as small circles) from which we are to find
a "good" approximation to *g*(*x*). Two
polynomial
approximations are shown in Figure 5.2.2: An eleventh-order polynomial
(dashed line) and an eighth-order polynomial (dotted line). These
approximations are computed by minimizing the SSE criterion over
the sample points. The higher order polynomial has about the
same number of parameters as the number of training samples, and
thus is shown to give a very close fit to the data, this is referred
to as " memorization." However, it is clear from the
figure that this polynomial does not provide good
"generalization"
(i.e., it does not provide reliable interpolation and/or
extrapolation)
over the full range of the data. On the other hand, fitting the
data by an eighth-order polynomial leads to relatively better
overall interpolations over a wider range of *x* values (refer
to the dotted line in Figure 5.2.2). In this case, the number
of free parameters is equal to nine which is smaller than the
number of training samples. This "undetermined" nature
leads to an approximation function that better matches the
"smooth"
function *g*(*x*) being approximated. Trying to use
a yet lower order polynomial (e.g., fifth-order or less) leads
to a poor approximation because this polynomial would not have
sufficient "flexibility" to capture the nonlinear structure
in *g*(*x*).

Figure 5.2.2. Polynomial approximation for the function
(shown as a solid line), based on the
15 samples shown (small circles). The objective of the approximation
is to minimize the sum of squared error criterion. The dashed
line represents an eleventh-order polynomial. A better overall
approximation for *g*(*x*) is given by an eighth-order
polynomial (dotted line).

The reader is advised to consider the nature and
complexity of this simple approximation problem by carefully studying
Figure 5.2.2. Here, the total number of possible training samples
of the form (*x*, *g*(*x*)) is uncountably
infinite. From this huge set of potential data, though, we close
only 15 samples to try to approximate the function. In this case,
the approximation involved minimizing an SSE criterion function
over these few sample points. Clearly, however, a solution which
is globally (or near globally) optimal in terms of sum-squared
error over the training set (for example, the eleventh order
polynomial)
may be hardly appropriate in terms of interpolation (generalization)
between data points. Thus, one should choose a class of approximation
functions which penalizes unnecessary fluctuations between training
sample points. Neural networks satisfy this approximation property
and are thus superior to polynomials in approximating arbitrary
nonlinear functions from sample points (see further discussion
given below). Figure 5.2.3 show the results of simulations involving
the approximation of the function *g*(*x*), with the
same set of samples used in the above simulations, using single
hidden layer feedforward neural nets. Here, all hidden units
employ the hyperbolic tangent activation function (with a slope
of 1), and the output unit is linear. These nets are trained
using the incremental backprop algorithm [given by Equations (5.1.2)
and (5.1.9)] with and *h* = 0.01.
Weights are initialized randomly and uniformly over the range
[-0.2, +0.2].
The training was stopped when the rate of change of the SSE became
insignificantly small. The dotted line in Figure 5.2.3 is for
a net with three hidden units (which amounts to 10 degrees of
freedom). Surprisingly, increasing the number of hidden units
to 12 units (37 degrees of freedom) improved the quality of the
fit as shown by the dashed line in the figure. By comparing Figures
5.2.3 and 5.2.2, it is clear that the neural net approximation
for *g*(*x*) is superior to that of polynomials in terms
of accurate interpolation and extrapolation.

Figure 5.2.3. Neural network approximation for the
function (shown as a solid line). The
dotted line was generated by a 3-hidden unit feedforward net.
The dashed line, which is shown to have substantial overlap with
*g*(*x*), was generated by a 12-hidden unit feedforward
net. In both cases, standard incremental backprop training was
used.

The generalization superiority of the neural net
can be attributed to the bounded and smooth nature of the hidden
unit responses, as compared to the potentially divergent nature
of polynomials. The bounded unit response localizes the nonlinear
effects of individual hidden units in a neural network and allows
for the approximations in different regions of the input space
to be independently tuned. This approximation process is similar
in its philosophy to the traditional spline technique for curve
fitting (Schumaker,
1981). Hornik et al.
(1990) gave related
theoretical justification for the usefulness of feedforward neural
nets with sigmoidal hidden units in function approximation. They
showed that, in addition to approximating the training set, the
derivative of the output of the network evaluated at the training
data points is also a good approximation of the derivative of
the unknown function being approximated. This result explains
the good extrapolation capability of neural nets observed in
simulations.
For example, the behavior of the neural net output shown in Figure
5.2.3 for *x* > 10 and
*x* < -5
is a case in point. It should be noted, though, that in most
practical situations the training data is noisy. Hence, an exact
fit of this data must be avoided, which means that the degrees
of freedom of a neural net approximator must be constrained.
Otherwise, the net will have a tendency for overfitting. This
issue is explored next.

Once we decide on a particular approximation function
or network architecture, generalization can be improved if the
number of free parameters in the net is optimized. Since it is
difficult to estimate the optimal number of weights (or units)
a priori, there has been much interest in techniques that
automatically
remove excess weights and/or units from a network. These techniques
are sometimes referred to as network " pruning" algorithms
and are surveyed in Reed
(1993).

One of the earliest and simplest approaches to remove
excess degrees of freedom from a neural network is through the
use of simple weight decay (Plaut et al., 1986 ; Hinton, 1986)
in which each weight decays towards zero at a rate proportional
to its magnitude, so that connections disappear unless reinforced.
Hinton (1987b) gave
empirical justification by showing that such
weight decay improves generalization in feedforward networks.
Krogh and Hertz (1992) gave
some theoretical justification for
this generalization phenomena.

Weight decay in the weight update equations of backprop
can be accounted for by adding a complexity (regularization) term
to the criterion function *E* that penalizes large weights,

(5.2.12)

Here, represents the relative importance of the
complexity term with respect to the error term *E*(**w**)
[note that the second term in Equation (5.2.12) is a regularization
term as in Equation (4.1.3)]. Now, gradient search for minima
of *J*(**w**) leads to the following weight update rule

(5.2.13)

which shows an exponential decay in
*w**i*
if no learning occurs. Because it penalizes more weights than
necessary, the criterion function in Equation (5.1.12) overly
discourages the use of large weights where a single large weight
costs much more than many small ones. Weigend et al. (1991)
proposed
a procedure of weight-elimination given by minimizing

(5.2.14)

where the penalty term on the right-hand side helps
regulate weight magnitudes and *w*0
is a positive free parameter which must be determined. For large
*w*0, this procedure
reduces to the weight decay procedure described above and hence
favors many small weights, whereas if *w*0
is small, fewer large weights are favored. Also, note that when
, the cost of the weight approaches one
(times ) which justifies the interpretation of the penalty term
as a counter of large weights. In practice, a *w*0
close to unity is used. It should be noted that the above weight
elimination procedure is very sensitive to the choice of . A
heuristic for adjusting dynamically during learning is described
in Weigend et al. (1991).
For yet other forms of the complexity
term, the reader is referred to Nowlan and Hinton (1992a and
1992b),
and Section 5.2.7.

The above ideas have been extended to unit elimination
(e.g., see Hanson and Pratt, 1989 ; Chauvin, 1989 ; Hassoun et al.,
1990). Here, one would start with an excess of hidden units and
dynamically discard redundant ones. As an example, one could
penalize redundant units by replacing the weight decay term in
Equation (5.2.13) by for all weights
of hidden units, which leads to the hidden unit update rule

(5.2.15)

Generalization in feedforward
networks can also
be improved by utilizing network construction procedures, as opposed
to weight or unit pruning. Here, we start with a small network
and allow it to grow gradually (add more units) in response to
incoming data. The idea is to keep the network as small as possible.
In Chapter 6 (Section 6.3) three adaptive networks having
unit-allocation
capabilities are discussed. Further details on network construction
procedures can be found in Marchand et al. (1990) , Frean (1990) ,
Fahlman and Lebiere (1990)
, and Mézard and Nadal
(1989).

An alternative or complementary strategy to the
above methods for improving generalization in feedforward neural
networks is suggested by findings based on empirical results (Morgan
and Bourlard, 1990; Weigend
et al., 1991 ; Hergert et
al., 1992).
In simulations involving backprop training of feedforward nets
on noisy data, it is found that the validation (generalization)
error decreases monotonically to a minimum but then starts to
increase, even as the training error continues to decrease. This
phenomenon is depicted in the conceptual plot in Figure 5.2.4,
and is illustrated through the computer simulation given next.

Figure 5.2.4. Training error (dashed curve) and
validation
error (solid
curve) encountered in training multilayer feedforward
neural nets using backprop.

Consider the problem of approximating the rational
function *g*(*x*) plotted in Figure 5.2.3, from a set
of noisy sample points. This set of points is generated from
the 15 perfect samples, shown in Figure 5.2.3, by adding zero-mean
normally distributed random noise whose variance is equal to 0.25.
A single hidden layer feedforward neural net is used with 12
sigmoidal hidden units and a single linear output unit. It employs
incremental backprop training with *o* = 0.05,
*h* = 0.01,
and initial random weights in . After
80 training cycles on the 15 noisy samples, the net is tested
for uniformly sampled inputs *x* in the range [-8, 12].
The output of this 80-cycle net is shown as a dashed line in
Figure 5.2.5. Next, the training continued and then stopped after
10,000 cycles. The output of the resulting net is shown as a
dotted line in the figure. Comparing the two approximations in
Figure 5.2.5 leads to the conclusion that the partially trained
net is superior to the excessively trained net in terms of overall
interpolation and extrapolation capabilities. Further insight
into the dynamics of the generalization process for this problem
can be gained from Figure 5.2.6. Here, the validation RMS error
is monitored by testing the net on a validation set of 294 perfect
samples, uniformly spaced in the interval [-8, 12],
after every 10 training cycles. This validation error is shown
as the dashed line in Figure 5.2.6. The training error (RMS error
on the training set of 15 points) is also shown in the figure
as a solid line. Note that the optimal net in terms of overall
generalization capability is the one obtained after about 80 to
90 training cycles. Beyond this training point, the training
error keeps decreasing, while the validation error increases.
It is interesting to note the non-monotonic behavior of the
validation
error between training cycles 2000 and 7000. This suggests that,
in general, multiple local minima may exist in the validation
error curves of backprop trained feedforward neural networks.
The location of these minima is a complex function of the network
size, weight initialization, and learning parameters. To summarize,
when training with noisy data, excessive training usually leads
to overfitting. On the other hand, partial training may lead
to a better approximation of the unknown function in the sense
of improved interpolation and, possibly, improved extrapolation.

Figure 5.2.5. Two different neural network approximations
of the rational function *g*(*x*), shown as a solid
line, from noisy samples. The training samples shown are generated
from the 15 perfect samples in Figure 5.2.3 by adding zero-mean
normally distributed random noise with 0.25 variance. Both
approximations
resulted from the same net with 12 hidden units and incremental
backprop learning. The dashed line represents the output of the
net after 80 learning cycles. After completing 10,000 learning
cycles, the same net generates the dotted line output.

Figure 5.2.6. Training and validation RMS errors
for the neural net approximation of the function *g*(*x*).
The training set consists of the 15 noisy samples in Figure 5.2.5.
The validation set consists of 294 perfect samples uniformly
spaced in the interval [-8, 12].
The validation error starts lower than the training error mainly
because perfect samples are used for validation.

The
generalization phenomenon depicted in Figure
5.2.4 (and illustrated by the simulation in Figures 5.2.5 and
5.2.6) does not currently have a complete theoretical justification.
A qualitative explanation for it was advanced by Weigend et al.
(1991). They explain that to a first approximation, backprop
initially adapts the hidden units in the network such that they
all attempt to fit the major features of the data. Later, as
training proceeds, some of the units then start to fit the noise
in the data. This later process continues as long as there is
error and as long as training continues (this is exactly what
happens in the simulation of Figure 5.2.5). The overall process
suggests that the effective number of free parameters (weights)
starts small (even if the network is oversized) and gets larger
approaching the true number of adjustable parameters in the network
as training proceeds. Baldi and
Chauvin (1991) derived analytical
results on the behavior of the validation error in LMS-trained
single layer feedforward networks learning the identity map from
noisy autoassociation pattern pairs. Their results agree with
the above generalization phenomenon in nonlinear multilayer
feedforward
nets.

Therefore, a suitable strategy for improving
generalization
in networks of non-optimal size is to avoid " overtraining"
by carefully monitoring the evolution of the validation error
during training and stopping just before it starts to increase.
This strategy is based on one of the early criteria in model
evaluation known as cross-validation (e.g., see Stone, 1978).
Here, the whole available data set is split into three parts:
Training set, validation set, and prediction set. The training
set is used to determine the values of the weights of the network.
The validation set is used for deciding when to terminate training.
Training continues as long as the performance on the validation
set keeps improving. When it ceases to improve, training is stopped.
The third part of the data, the prediction set, is used to estimate
the expected performance (generalization) of the trained network
on new data. In particular, the prediction set should not be
used for validation during the training phase. Note that this
heuristic requires the application to be data-rich. Some
applications,
though, suffer from scarcity of training data which makes this
method inappropriate. The reader is referred to Finnoff et al.,
(1993) for an empirical study of cross-validation-based
generalization
and its comparison to weight decay and other generalization-inducing
methods.

See how it works interactively

As seen earlier in Section 3.1.5, other criterion/error
functions can be used from which new versions of the backprop
weight update rules can be derived. Here, we consider two such
criterion functions: (1) Relative entropy and (2)
Minkowski-*r*<.
Starting from the instantaneous entropy criterion ( Baum and Wilczek,
1988)

(5.2.16)

and employing gradient descent search, we may obtain
the learning equations

(5.2.17)

and

(5.2.18)

for the output and hidden layer units, respectively.
Equations (5.2.17) and (5.2.18) assume hyperbolic tangent activations
at both layers.

From Equation (5.2.17) we see that the
*f**o*'
term present in the corresponding equation of standard backprop
[Equation (5.1.2)] has now been eliminated. Thus, the output
units do not have a flat spot problem; on the other
hand, *f**h*'
still appears in Equation (5.2.18) for the hidden units (this
derivative appears implicitly as the
term in the standard backprop equation). Therefore, the flat
spot problem is only partially solved by employing the entropy
criterion.

The entropy-based backprop is well suited to
probabilistic
training data. It has a natural interpretation in terms of learning
the correct probabilities of a set of hypotheses represented by
the outputs of units in a multilayer neural network. Here, the
probability that the *l*th hypothesis is true given an input
pattern **x***k*
is determined by the output of the* l*th unit as .
The entropy criterion is a " well formed" error function
( Wittner and Denker,
1988); the reader is referred to Section
3.1.5 for a definition and discussion of "well formed"
error functions. Such functions have been shown in simulations
to converge faster than standard backprop ( Solla et al., 1988).

Another choice is the Minkowski-*r*criterion
function ( Hanson and Burr,
1988):

(5.2.19)

which leads to the following weight update equations:

(5.2.20)

and

(5.2.21)

where sgn is the sign function. These equations
reduce to those of standard backprop for the case *r* = 2.
The motivation behind the use of this criterion is that it can
lead to maximum likelihood estimation of weights for Gaussian
and non-Gaussian input data
distributions by appropriately choosing
*r* (e.g., *r* = 1 for data with Laplace distributions).
A small *r* (1 *r* < 2) gives less weight for large
deviations and tends to reduce the influence of outlier-points
in the input space during learning. On the other hand, when noise
is negligible, the sensitivity of the separating surfaces implemented
by the hidden units to the geometry of the problem may be increased
by employing *r* > 2. Here, fewer hidden units are recruited
when learning complex nonlinearly separable mappings for larger
*r* values ( Hanson and
Burr, 1988).

If no a priori knowledge is available about the
distribution of the training data, it would be difficult to estimate
a value for *r*, unless extensive experimentation with various
*r* values (e.g., *r* = 1.5, 2, 3) is done. Alternatively,
an automatic method for estimating *r* is possible by adaptively
updating *r *in the direction of decreasing *E*. Here,
steepest gradient descent on* E*(*r*) results in the
update rule

(5.2.22)

which when restricting *r* to be strictly greater
than 1 (metric error measure case) may be approximated as

(5.2.23)

Note that it is important that the *r* update
rule be invoked much less frequently than the weight update rule
(for example, *r* is updated once every 10 training epochs
of backprop).

The idea of increasing the learning robustness of
backprop in noisy environments can be placed in a more general
statistical framework ( White,
1989) where the technique of robust
statistics ( Huber, 1981)
takes effect. Here, robustness of learning
refers to insensitivity to small perturbations in the underlying
probability distribution *p*(**x**) of the training set.
These statistical techniques motivate the replacement of the
linear error in Equations (5.1.2) and
(5.1.9) by a nonlinear error suppressor function
which is compatible with the underlying probability density function
*p*(**x**). One example is to set
with 1 *r* < 2. This error suppresser leads to exactly
the Minkowski-*r* weight update rule of Equations (5.2.20)
and (5.2.21). In fact, the case *r *= 1 is equivalent to
minimizing the summed absolute error criterion which is known
to suppress outlier data points. Similarly, the selection
( Kosko, 1992) leads to
robust backprop if *p*(**x**) has
long tails such as a Cauchy distribution or some other
infinite
variance density.

Furthermore, regularization terms may be added to
the above error functions *E*(**w**) in order to introduce
some desirable effects such as good generalization, smaller effective
network size, smaller weight magnitudes, faster learning, etc.
( Poggio and Girosi, 1990a;
Mao and Jain, 1993). The
regularization
terms in Equations (5.2.12) and (5.2.14) used for enhancing generalization
through weight pruning/elimination are examples. Another possible
regularization term is ( Drucker and
Le Cun, 1992) which has been shown to improve backprop generalization
by forcing the output to be insensitive to small changes in the
input. It also helps speed up convergence by generating hidden
layer weight distributions that have smaller variances than that
generated by standard backpropagation. (Refer to Problem 5.2.11
for yet another form of regularization.)

Weight-sharing, a method where several weights in
a network are controlled by a single parameter, is another way
for enhancing generalization ( Rumelhart et al., 1986b,
also see
Section 5.3.3 for an application). It imposes equality constraints
among weights, thus reducing the number of free (effective) parameters
in the network which leads to improved generalization. An automatic
method for affecting weight sharing can be derived by adding the
regularization term ( Nowlan
and Hinton, 1992a and 1992b):

(5.2.24)

to the error function, where each *p**j*(*w**i*)
is a Gaussian density with mean *j*
and variance *j*,
the *j* is the mixing
proportion of Gaussian *p**j*
with , and *w**i*
represents an arbitrary weight in the network. The j,
j, and j
parameters are assumed to adapt as the network learns. The use
of multiple adaptive Gaussians allows the implementation of
"soft-weight
sharing," in which the learning algorithm decides for itself
which weights should be tied together. If the Gaussians all start
with high variance, the initial grouping of weights into subsets
will be very soft. As the network learns and the variance shrinks,
those groupings become more and more distinct and converge to
subsets influenced by the task being learned.

For gradient descent-based adaptation, one may employ
the partial derivatives

, (5.2.25)

, (5.2.26)

, (5.2.27)

and

(5.2.28)

with

(5.2.29)

It should be noted that the derivation of the partial
of *R* with respect to the mixing proportions is less straight
forward than those in Equations (5.2.25) through (5.2.27) since
we must maintain the sum of the *j*'s
to 1. Thus, the result in Equation (5.2.28) has been obtained
by appropriate use of a Lagrange multiplier method and a bit of
algebraic manipulation. The term *r**j*(*w**i*)
in Equations (5.2.25) through (5.2.29) is the posterior probability
of Gaussian *j* given weight *w**i*;
i.e., it measures the responsibility of Gaussian *j* for
the *i*th weight. Equation (5.2.25) attempts to pull the
weights towards the center of the "responsible" Gaussian.
It realizes a competition mechanism among the various Gaussians
for taking on responsibility of weight *w**i*.
The partial derivative for *j*
drives *j* toward
the weighted average of the set of weights for which Gaussian
*j* is responsible. Similarly, one may come up with simple
interpretations for the derivatives in Equations (5.2.27) and
(5.2.28). To summarize, the penalty term in (5.2.24) leads to
unsupervised clustering of weights (weight sharing) driven by
the biases in the training set.