Consider the two-layer feedforward architecture
shown in Figure 5.1.1. This network receives a set of scalar
signals {*x*0,
*x*1,
... , *x**n*},
where *x*0 is a bias
signal equal to 1. This set of signals constitutes an input vector
**x**, **x** R*n*+1.
The layer receiving the input signal is called the hidden layer.
Figure 5.1.1 shows a hidden layer having *J *units. The
output of the hidden layer is a (*J*+1)-dimensional real-valued
vector** z**, **z **= [*z*0,
*z*1, ...,
*z**J*]T.
Again, *z*0 = 1
represents a bias input and can be thought of as being generated
by a "dummy" unit (with index zero) whose output
*z*0
is clamped at 1. The vector **z** supplies the input for the
output layer of *L* units. The output layer generates an
*L*-dimensional vector **y** in response to the input
**x** which, when the network is fully trained, should
be identical
(or very close) to a "desired" output vector **d**
associated with **x**.

Figure 5.1.1. A two layer fully interconnected
feedforward
neural network architecture. For clarity, only selected connections
are drawn.

The activation function *f**h*
of the hidden units is assumed to be a differentiable nonlinear
function (typically, *f**h*
is the logistic
function defined by ,
or a hyperbolic tangent function *f**h*(*net*) = tanh*
*(* net*), with values for and close to unity). Each
unit of the output layer is assumed to have the same activation
function, denoted *f**o*;
the functional form of *f**o*
is determined by the desired output signal/pattern representation
or the type of application. For example, if the desired output
is real-valued (as in some function approximation applications),
then a linear activation *f**o*
(*net*) = *net* may be used. On the other hand, if
the network implements a pattern classifier with binary outputs,
then a saturating nonlinearity similar to* f**h*
may be used for *f**o*.
In this case, the components of the desired output vector **d**
must be chosen within the range of *f**o*.
It is important to note that if *f**h*
is linear, then one can always collapse the net in Figure 5.1.1
to a single layer net and thus lose the universal
approximation/mapping
capabilities discussed in Chapter 2. Finally, we denote by
*w**ji*
the weight of the *j*th hidden unit associated with the input
signal *x**i*.
Similarly, *w**lj*
is the weight of the *l*th output unit associated with the
hidden signal *z**j*.

Next, consider a set of *m* input/output pairs
{**x***k*,
**d***k*},
where **d***k*
is an *L*-dimensional vector representing the desired network
output upon the presentation of **x***k*.
The objective here is to adaptively adjust the
*J*(*n *+ 1) + *L*(*J *+&
nbsp;1)
weights of this network such that the underlying function/mapping
represented by the training set is approximated or learned. Since
the learning here is supervised, i.e., target outputs are available,
we may define an error function to measure the degree of approximation
for any given setting of the network's weights. A commonly used
error function is the SSE measure, but this is by no means the
only possibility, and later in this chapter, several other error
functions will be discussed. Once a suitable error function is
formulated, learning can be viewed (as was done in Chapters 3
and 4) as an optimization process. That is, the error function
serves as a criterion function, and the
learning algorithm seeks
to minimize the criterion function over the space of possible
weight settings. For instance, if a differentiable criterion
function is used, gradient descent on such a function will naturally
lead to a learning rule. This idea has been invented independently
by Bryson and Ho (1969), Amari (1967; 1968), Werbos (1974), and
Parker (1985). Next, we
illustrate the above idea by deriving
a supervised learning rule for adjusting the weights
*w**ji*
and *w**lj*
such that the following error function is minimized (in a local
sense) over the training set
(Rumelhart et al., 1986b):

(5.1.1)

Here, **w** represents the set of all weights
in the network. Note that Equation (5.1.1) is the
" instantaneous"
SSE criterion of Equation (3.1.32) generalized for a multiple
output network.

**5.1.1 Error ****Backpropagation Learning
Rule**

Since the targets for the output units are explicitly
specified, one can directly use the delta rule, derived in Section
3.1.3 for updating the *w**lj*
weights. That is,

(5.1.2)

where is the weighted sum
for the *l*th output unit, is the
derivative of *f**o*
with respect to *net*, and and
represent the updated (new) and current weight values, respectively.
The *z**j*'s
are computed by propagating the input vector **x** through
the hidden layer according to:

(5.1.3)

The learning rule for the hidden layer weights
*w**ji*
is not as obvious as that for the output layer since we do not
have available a set of target values (desired outputs) for hidden
units. However, one may derive a learning rule for hidden units
by attempting to minimize the output layer error. This amounts
to propagating the output errors (*d**l *- *y**l*)
back through the output layer towards the hidden units in an attempt
to estimate "dynamic" targets for these units. Such
a learning rule is termed error back-propagation or the backprop
learning rule and may be viewed as an extension of the delta rule
(Equation 5.1.2) used for updating the output layer. To complete
the derivation
of backprop for the hidden layer weights, and similar
to the above derivation for the output layer weights, we perform
gradient descent on the criterion function in Equation (5.1.1),
but this time, the gradient is calculated with respect to the
hidden weights:

(5.1.4)

where the partial derivative is to be evaluated at
the current weight values. Using the chain rule for differentiation,
one may express the partial derivative in Equation (5.1.4) as

(5.1.5)

with

(5.1.6)

(5.1.7)

and

Now, upon substituting Equations (5.1.6) through
(5.1.8) into Equation (5.1.5) and using Equation (5.1.4), we arrive
at the desired learning rule:

(5.1.9)

By comparing Equation (5.1.9) to (5.1.2), one can
immediately define an "estimated target"
*d**j*
for the *j*th hidden unit implicitly
in terms of the back propagated error signal *d**j*-*zj*
as follows:

(5.1.10)

It is usually possible to express the derivatives
of the activation functions in Equations (5.1.2) and (5.1.9) in
terms of the activations themselves. For example, for the logistic
activation function, we have

(5.1.11)

and for the hyperbolic tangent function, we
have

(5.1.12)

The above learning equations may also be extended
to feedforward nets with more than one hidden layer and/or nets
with connections that jump over one or
more layers (see Problems 5.1.2 and
5.1.3). The complete procedure for updating the weights in a
feedforward neural net utilizing the above rules is summarized
below for the two layer architecture of Figure 5.1.1. We will
refer to this learning procedure as incremental backprop or just
backprop:

1. Initialize all weights and refer to them as "current" weights and . (see Section 5.2.1 for details).

2. Set the learning rates *o*
and *h* to small
positive values (refer to Section 5.2.2 for additional
details).

3. Select an input pattern **x***k*
from the training set (preferably at random) and propagate it
through the network, thus generating hidden and output unit
activities based on the current weight settings.

4. Use the desired target **d***k*
associated with **x***k*
and employ Equation (5.1.2) to compute the output layer weight
changes .

5. Employ Equation (5.1.9) to compute the hidden
layer weights changes . Normally, the
current weights are used in these computations. In general, enhanced
error correction may be achieved if one employs the updated
output layer weights . However, this
comes at the added cost of recomputing *y**l*
and *f**o*'(*net**l*).

6. Update all weights according to and for the output and hidden layers, respectively.

7. Test for convergence. This is done by checking
some preselected function of the output errors to see if its magnitude
is below some preset threshold. If convergence is met, stop;
otherwise, set and
and go to step 3. It should be noted that backprop may fail
to find a solution which passes the convergence test. In this
case, one may try to reinitialize the search process, tune the
learning parameters, and/or use more hidden units.

The above procedure is based on "incremental"
learning, which means that the weights are updated after every
presentation of an input pattern. Another alternative is to employ
" batch"
learning where weight updating is performed
only after all patterns (assuming a finite training set) have
been presented. The batch learning is formally stated by summing
the right hand side of Equations (5.1.2) and (5.1.9) over all
patterns **x***k*.
This amounts to gradient descent on the criterion
function

(5.1.13)

Even though batch updating moves the search point
**w** in the direction of the true gradient at each update
step, the "approximate" incremental updating is more
desirable for two reasons: (1) It requires less storage, and (2)
it makes the search path in the weight space stochastic (here,
at each time step the input vector **x** is drawn at random)
which allows for a wider exploration of the search space and,
potentially, leads to better quality solutions. When backprop
converges, it converges to a local minima of the criterion
function
(McInerny et al., 1989).
This fact is true of any gradient descent-based
learning rule when the surface being searched is nonconvex (Amari,
1990); i.e., it admits local minima. Using stochastic
approximation
theory, Finnoff (1993; 1994)
showed that for "very small"
learning rates (approaching zero), incremental backprop approaches
batch backprop and produces essentially the same results. However,
for small constant learning rates there is a nonnegligible stochastic
element in the training process which gives incremental backprop
a "quasiannealing" character in which the cumulative
gradient is continuously perturbed, allowing the search to escape
local minima with small shallow basins of attraction. Thus,
solutions
generated by incremental backprop are often practical ones. The
local minima problem can be further eased by heuristically adding
random noise to the weights (von Lehman et al., 1988) or by
adding
noise to the input patterns (Sietsma and Dow, 1988). In
both
cases, some noise reduction schedule should be employed to dynamically
reduce the added noise level towards zero as learning progresses.

Next, the incremental backprop learning
procedure
is applied to solve a two-dimensional, two-class pattern
classification
problem. This problem should help give a good feel for what is
learned by the hidden units in a feedforward neural network, and
how the various units work together to generate a desired solution.

** Example 5.1.1:** Consider
the two-class problem shown in Figure 5.1.2. The points inside
the shaded region belong to class B and all other points are in
class A. A three layer feedforward neural network with backprop
training is employed which is supposed to learn to distinguish
between these two classes. The network consists of an 8-unit
first hidden layer, followed by a second hidden layer with 4 units,
followed by a 1-unit output layer. We will refer to such a network
as having an 8-4-1 architecture. All units employ a hyperbolic*
*tangent activation function. The output unit should encode
the class of each input vector; a positive output indicates class
B and a negative output indicates class A. Incremental backprop
was used with learning rates set to 0.1. The
training set consists
of 500 randomly chosen points, 250 from region A and another 250
from region B. In this training set, points representing class
B and class A were assigned desired output (target) values of
+1 and -1,
respectively. Training was performed for several hundred cycles
over the training set.

Figure 5.1.3 shows geometrical plots of all unit
responses upon testing the network with a new set of 1000 uniformly
randomly generated points inside the [-1,
+1]2 region. In generating
each plot, a black dot was placed at the exact coordinates of
the test point (input) in the input space if and only if the
corresponding
unit response is positive. The boundaries between the dotted
and the white regions in the plots represent approximate decision
boundaries learned by the various units in the network. Figure
5.1.3 (a)-(h) represent the decision boundaries learned by the
eight units in the first hidden layer. Figure 5.1.3 (i)-(l) shows
the decision boundaries learned by the four units of the second
hidden layer. Figure 5.1.3 (m) shows the decision boundary realized
by the output unit. Note the linear nature of the separating
surface realized by the first hidden layer units, from which complex
nonlinear separating surfaces are realized by the second hidden
layer units and ultimately by the output layer unit. This example
also illustrates how a single hidden layer feedforward net (counting
only the first two layers) is capable of realizing convex, concave,
as well as disjoint decision regions, as can be seen from Figure
5.1.3 (i)-(l). Here, we neglect the output unit and view the
remaining net as one with an 8-4 architecture.

The present problem can also be solved with smaller
networks (fewer number of hidden units or even a network with
a single hidden layer). However, the training of such smaller
networks with backprop may become more difficult. A smaller network
with a 5-3-1 architecture utilizing a variant backprop learning
procedure (Hassoun et al.,
1990) is reported in Song
(1992), which
has a comparable separating surface to the one in Figure 5.1.3
(m).

Figure 5.1.3. Separating surfaces generated by the various units in the 8-4-1 network of Example 5.1.1 (a)-(h): Separating surfaces realized by the units in the first hidden layer; (i)-(l): Separating surface realized by the units in the second hidden layer; and (m): Separating surface realized by the output unit.

Huang and Lippmann
(1988) employed Monte Carlo simulations
to investigate the capabilities of backprop in learning complex
decision regions (see Figure 2.3.3). They reported no significant
performance difference between two and three layer feedforward
nets when forming complex decision regions using backprop. They
also demonstrated that backprop's convergence time is excessive
for complex decision regions and the performance of such trained
classifiers is similar to that obtained with the *k*-nearest
neighbor classifier (Duda and
Hart, 1973). Villiers
and Barnard
(1993) reported similar simulations but on data sets which
consisted
of a "distribution of distributions" where a typical
class is a set of clusters (distributions) in the feature space,
each of which can be more or less spread out and which might involve
some or all of the dimensions of the feature space; the distribution
of distributions thus assigns a probability to each distribution
in the data set. It was found for networks of equal complexity
(same number of weights), that there is no significant difference
between the quality of "best" solutions generated by
two and three layer backprop-trained feedforward networks; actually,
the two layer nets demonstrated better performance, on the average.
As for the speed of convergence, three layer nets converged faster
if the number of units in the two hidden layers were roughly equal.

Gradient descent search may be eliminated
all together
in favor of a stochastic global search procedure that guarantees
convergence to a global solution with high probability; genetic
algorithms and simulated annealing are examples of such procedures
and are considered in Chapter 8. However, the assured (in
probability)
optimality of these global search procedures comes at the expense
of slow convergence. Next, a deterministic search procedure termed
global descent is presented which helps backprop reach globally
optimal solutions.

**5.1.2 Global Descent-Based Error
Backpropagation**

Here, we describe a learning method in which the
gradient
descent rule in batch backprop is replaced with a "global
descent" rule (Cetin et al.
1993a). This methodology is
based on a global optimization scheme, acronymed
TRUST: terminal
repeller unconstrained subenergy tunneling, which formulates
optimization
in terms of the flow of a special deterministic dynamical system
(Cetin et al., 1993b).

Global descent is a gradient descent on a special
criterion function *C*(**w**, **w***)
given by

(5.1.14)

where **w***,
with component values , is a fixed weight
vector which can be a local minimum of *E*(**w**) or an
initial weight state **w**0,
is the unit step function, is a shifting
parameter (typically set to 2), and *k* is a small positive
constant. The first term in the right-hand side in Equation (5.1.14)
is a monotonic transformation of the original criterion function
(e.g., SSE criterion may be used) which preserves all critical
points of *E*(**w**) and has the same relative ordering
of the local and global minima of *E*(**w**). It also
flattens the portion of *E*(**w**) above
*E*(**w***)
with minimal distortion elsewhere. On the other hand, the term
is a "repeller term" which
gives rise to a convex surface with a unique minimum located at
. The overall effects of this energy
transformation is schematically represented for a one-dimensional
criterion function in Figure 5.1.4.

Performing gradient descent on *C*(**w**,**
w***) leads to the
"global
descent" update rule

(5.1.15)

The first term on the right-hand side of Equation
(5.1.15) is a "subenergy gradient", while the second
term is a "non-Lipschitzian" terminal repeller (Zak,
1989). Upon replacing the gradient descent in Equation (5.1.2)
and (5.1.4) by Equation (5.1.15) where *w**i*
represents an arbitrary hidden unit or output unit weight, the
modified backprop procedure may escape local minima of the original
criterion function *E*(**w**) given in Equation (5.1.13).
Here, the batch training is required since Equation (5.1.15)
necessitates a unique error surface for all patterns.

Figure 5.1.4. A plot of a one-dimensional criterion
function *E*(*w*) with local minimum at *w**.
The function *E*(*w*) - *E*(*w****)
is plotted below, as well as the global descent criterion function
*C*(*w*,** ***w**).

The update rule in Equation (5.1.15) automatically
switches between two phases: A tunneling phase and a gradient
descent phase. The tunneling phase is characterized by .
Since for this condition the subenergy gradient term is nearly
zero in the vicinity of the local minimum **w***,
the terminal repeller term in Equation (5.1.15) dominates, leading
to the dynamical system

(5.1.16)

This system has an unstable repeller equilibrium
point at ; i.e., at the local minimum
of *E*(**w**). The "power" of this repeller
is determined by the constant *k*. Thus, the dynamical system
given by Equation (5.1.15), when initialized with a small perturbation
from **w***, is repelled
from this local minimum until it reaches a lower energy region
; i.e., tunneling through portions of
*E*(**w**) where is accomplished.
The second phase is a gradient descent minimization phase,
characterized
by . Here, the repeller term is identically
zero. Thus, Equation (5.1.15) becomes

(5.1.17)

where (**w**) is a dynamic learning rate (step
size) equal to . Note that (**w**)
is approximately equal to when *E*(**w***)
is larger than *E*(**w**)+.

Initially, **w***
is chosen as one corner of a domain in the form of a
hyperparallelepiped
of dimension , which is the dimension
of **w** in the architecture of Figure 5.1.1. A slightly
perturbated
version of **w***,
namely , is taken as the initial state
of the dynamical system in Equation (5.1.15). Here **w**
is a small perturbation which drives the system into the domain
of interest. If , the system immediately
enters a gradient descent phase which equilibrates at a local
minimum. Every time a new equilibrium is reached,
**w***
is set equal to this equilibrium and Equation (5.1.15) is
reinitialized
with which assures a necessary consistency
in the search flow direction. Since **w***
is now a local minimum, holds in the
neighborhood of **w***.
Thus, the system enters a repelling (tunneling) phase, and the
repeller at **w***
repels the system until it reaches a lower basin of attraction
where . As the dynamical system enters
the next basin, the system automatically switches to gradient
descent and equilibrates at the next lower local minimum. We then
set **w*** equal to
this new minimum and repeat the process. If, on the other hand,
at the onset of training, then the system
is initially in a tunneling phase. The tunneling will proceed
to a lower basin, at which point it enters the minimization phase
and follows the behavior discussed above. Training can be stopped
when a minimum **w***
corresponding to is reached, or when
*E*(**w***) becomes
smaller than a preset threshold.

The global descent method is guaranteed to find
the global minimum for functions of one variable, but not for
multivariate functions. However, in the multidimensional case,
the algorithm will always escape from one local minimum to another
with a lower or equal functional value. Figure 5.1.5 compares
the learning curve for the global descent-based backprop to that
of batch backprop for the four-bit parity problem in a feedforward
net with four hidden units and a single output unit. The same
initial random weights are used in both cases. The figure depicts
one tunneling phase for the global descent algorithm before
convergence
to a (perfect) global minimum solution. In performing this
simulation,
it is found that the direction of the perturbation vector
**w**
is very critical in regard to successfully reaching a global minimum.
On the other hand, batch backprop converges to the first local
minimum it reaches. This local solution represents a partial
solution to the 4-bit parity problem (i.e., mapping error is present).
Simulations using incremental backprop with the same initial
weights as in the above simulations are also performed, but are
not shown in the figure. Incremental backprop was able to produce
both of the solutions shown in Figure 5.1.5; very small learning
rates (0 and *n*)
often lead to imperfect local solutions, while relatively larger
learning rates may lead to a perfect solution.

Figure 5.1.5. Learning curves for global descent-
and gradient descent-based batch backprop for the 4-bit parity.