We started this chapter by deriving backprop; a
gradient descent-based learning procedure for minimizing the sum
of squared error criterion function in a feedforward layered network
of sigmoidal units. This result is a natural generalization of
the delta learning rule given in Chapter 3 for single layer networks.
We also presented a global descent-based error backpropagation
procedure which employs automatic tunneling through the error
function for escaping local minima and converging towards a global
minimum.

Various variations to backprop are introduced in
order to improve convergence speed, avoid "poor" local
minima, and enhance generalization. These variations include
weight initialization methods, autonomous learning parameters
adjustments, and the addition of regularization terms to the error
function being minimized. Theoretical basis for several of these
variations is presented.

A number of significant real-world applications
are presented where backprop is used to train feedforward networks
for realizing complex mappings between noisy sensory data and
the corresponding desired classifications/actions. These applications
include converting a human's hand movement to speech, hand-written
digit recognition, autonomous vehicle control, medical diagnosis,
and data compression.

Finally, extensions of the idea of backward error propagation learning to recurrent neural networks are given which allow for temporal association of time sequences. Time-delay neural networks, which may be viewed as nonlinear FIR or IIR filters, are shown to be capable of sequence recognition and association by employing standard backprop training. Backpropagation through time is introduced as a training method for fully recurrent networks. It employs a trick that allows backprop with weight sharing to be used to train an unfolded feedforward nonrecurrent version of the original network. Direct training of fully recurrent networks is also possible. A recurrent backpropagation method for training fully recurrent nets on static (spatial) associations is presented. This method is also extended to temporal association of continuous-time sequences (time-dependent recurrent backpropagation). Finally, a method of on-line temporal association of discrete-time sequences (real-time recurrent learning) is discussed.

**5.1.1** Derive Equations
(5.1.11) and (5.1.12).

**5.1.4** Derive the batch
backprop rule for the network in Figure 5.1.1

*† **5.1.6** Derive and
implement numerically the global descent backprop learning algorithm
for a single hidden layer feedforward network starting from Equation
(5.1.15). Generate a learning curve (as in Figure 5.1.5) for
the 4-bit parity problem using incremental backprop, batch backprop,
and global descent backprop. Assume a four hidden units fully
interconnected feedforward net with unipolar sigmoid activation
units, and use the same initial weights and learning rates for
all learning algorithms (use = 2 and *k* = 0.001
for global descent, and experiment with different directions of
the perturbation vector **w**).

**5.1.7** Consider the two
layer feedforward net in Figure 5.1.1. Assume that we replace
the hidden layer weights *w**ji*
by nonlinear weights of the form where
*r**ji* R
is a parameter associated with hidden unit *j*, and *x**i*
is the *i*th component of the input vector **x**. It
has been shown empirically ( Narayan, 1993) that this network is
capable of faster and more accurate training when the weights
and the *r**ji*
exponents are adapted as compared to the same network with fixed
*r**ji* = 0.
Derive a learning rule for *r**ji*
based on incremental gradient descent minimization of the instantaneous
SSE criterion of Equation (5.1.1). Are there any restrictions
on the values of the inputs *x**i*?
What would be a reasonable initial value for the *r**ji*
exponents?

† **5.1.8** Consider
the Feigenbaum (1978) chaotic time series generated by the nonlinear
iterated (discrete-time) map

Plot the time series *x*(*t*) for *t* [0, 20],
starting from *x*(0) = 0.2. Construct (by inspection)
an optimal net of the type considered in Problem 5.1.7 which will
perfectly model this iterated map (assume zero biases and linear
activation functions with unity slope for all units in the network).
Now, vary all exponents and weights by +1 percent and -2
percent, respectively. Compare the time series predicted by this
varied network to *x*(*t*) over the range *t* [0, 20].
Assume *x*(0) = 0.2. Note that the output of
the net at time *t* + 1 must serve as the new input
to the net for predicting the time series at *t* + 2,
and so on.

**5.2.1** Given a unit with
*n* weights *w**i*
uniformly randomly distributed in the range .
Assume that the components *x**i*
of the input vector **x** are randomly and uniformly distributed
in the interval [0, 1]. Show that the random variable
has a zero mean and unity standard deviation.

**5.2.2** Explain qualitatively
the characteristics of the approximate Newton's rule of Equation
(5.2.6).

**5.2.3** Complete the missing steps in the derivation
of Equation (5.2.9).

**5.2.4** Derive the activation
function slope update rules of Equations (5.2.10) and (5.2.11).

**5.2.5** Derive the incremental
backprop learning rule starting from the entropy criterion function
in Equation (5.2.16).

**5.2.6** Derive the incremental
backprop learning rule starting from the Minkowski-*r* criterion
function in Equation (5.2.19).

**5.2.7** Comment on the qualitative
characteristics of the Minkowski-*r* criterion function for
negative *r*.

**5.2.8** Derive Equation
(5.2.22).

**5.2.9** Derive the partial
derivatives of *R* in Equations (5.2.25) through (5.2.28)
for the soft weight-sharing regularization term in Equation (5.2.24).
Use the appropriate partial derivatives to solve analytically
for the optimal mixture parameters and
, assuming fixed values for the "responsibilities"
*r**j*(*w**i*).

**5.2.10** Give a qualitative
explanation for the effect of adapting the Gaussian mixture parameters
*j*, *j*,
and *j* on learning
in a feedforward neural net.

**5.2.11** Consider the criterion
function with entropy regularization ( Kamimura, 1993):

where is a normalized output
of hidden unit *j*, and > 0. Assume the same
network architecture as in Figure 5.1.1 with logistic sigmoid
activations for all units and derive backprop based on this criterion/error
function. What are the effects of the entropy regularization
term on the hidden layer activity pattern of the trained net?

* **5.2.12** The optimum steepest
descent method employs a learning step
defined as the smallest positive root of the equation

Show that the optimal learning step is approximately
given by ( Tsypkin, 1971)

† **5.2.13** Repeat the
exact same simulation in Figure 5.2.3 but with a 40 hidden unit
feedforward net. During training, use the noise-free training
samples as indicated by the small circles in Figure 5.2.3; these
samples have the following *x* values {-5, -4, -3, -2, -1, , 0, , 1, 2, 3, 4, 6, 8, 10}.
By comparing the number of degrees of freedom of this net to
the size of the training set, what would your intuitive conclusions
be about the net's approximation behavior? Does the result of
your simulation agree with your intuitive conclusions? Explain.
How would these results be impacted if a noisy data set is used?

† **5.2.14** Repeat the
simulations in Figure 5.2.5 using incremental backprop with cross-validation-based
stopping of training. Assume the net to be identical to the one
discussed in Section 5.2.6 in conjunction with Figure 5.2.5.
Also, use the same weight initialization and learning parameters.
Plot the validation and training RMS errors on a log-log scale
for the first 10,000 cycles, and compare it to Figure 5.2.6.
Discuss the differences. Test the resulting "optimally trained"
net on 200 points *x*, generated uniformly in [-8, 12].
Plot the output of this net versus *x* and compare it to
the actual function being approximated.
Also, compare the output of this net to the one in Figure 5.2.5
(dashed line), and give the reason(s) for the difference (if any)
in performance of the two nets. The following training and validation
sets are to be used in this problem. The training set is the
one plotted in Figure 5.2.5. The validation set has the same
noise statistics as for the training set.

† **5.2.15** Repeat Problem
5.2.14 using, as your training set, all the available data (i.e.,
both training and validation data). Here, cross-validation cannot
be used to stop training, since we have no independent (non-training)
data to validate with. One way to help avoid over training in
this case, would be to stop at the training cycle that led to
the optimal net in Problem 5.2.14. Does the resulting net generalize
better than the one in Problem 5.2.14? Explain.

**5.2.16** Consider the simple
neural net in Figure P5.2.16. Assume the hidden unit has an activation
function and that the output unit has
a linear activation with unit slope. Show that there exist a
set of real-valued weights {*w*1, *w*2, *v*1, *v*2}
which approximates the discontinuous function ,
for all *x*, *a*, *b*, and *c* R, to
any degree of accuracy.

† **5.4.1** Consider
the time series generated by the Glass-Mackey ( Mackey and Glass,
1977) discrete-time equation

Plot the time series *x*(*t*) for *t* [0, 1000]
and = 17. When solving the above nonlinear difference
delay equation an initial condition specified by an initial function
defined over a strip of width is required. Experiment with several
different initial functions (e.g. , ).

† **5.4.2** Use incremental
backprop with sufficiently small learning rates to train the network
in Figure 5.4.1 to predict in the Glass-Mackey
time series of Problem 5.4.1 (assume = 17). Use a
collection of 500 training pairs corresponding to different values
of *t* generated randomly from the time series for .
Assume training pairs of the form

Also, assume 50 hidden units with hyperbolic tangent
activation function (set = 1) and use a linear activation
function for the output unit. Plot the training RMS error versus
the number of training cycles. Plot the signal
predicted (recursively) by the trained network and compare it
to for *t* = 0, 6, 12, 18, ..., 1200.
Repeat with a two hidden layer net having 30 units in its first
hidden layer and 15 units in its second hidden layer (use the
learning equation derived in Problem 5.1.2 to train the weights
of the first hidden layer). [For an interesting collection of
time series and their prediction, the reader is referred to the
edited volume by Weigend and Gershenfeld (1994)].

† **5.4.3** Employ the
series-parallel identification scheme of Section 5.4.1 (refer
to Figure 5.4.3) to identify the nonlinear discrete-time plant
(Narendra and Parthasarathy, 1990)

Use a feedforward neural network having 20 hyperbolic
tangent activation (set = 1) units in its hidden layer,
feeding into a linear output unit. Use incremental backprop,
with sufficiently small learning rates, to train the network.
Assume the outputs of the delay lines (inputs to neural network
in Figure 5.4.3) to be *x*(*t*) and *u*(*t*).
Also, assume uniform random inputs in the interval [-2,
+2] during training. Plot the output of the plant as well as
the recursively generated output of the identification model for
the input

**5.4.4** Derive Equations
(5.4.10) and (5.4.13).

**5.4.5** Show that if the
state **y*** is a locally
asymptotically stable equilibrium of the dynamics in Equation
(5.4.6), then the state **z***
satisfying Equation (5.4.17) is a locally asymptotically stable
equilibrium of the dynamics in Equation (5.4.18). (Hint: Start
by showing that linearizing the dynamical equations about their
respective equilibria gives

and

where and
are small perturbations added to and
, respectively.)

*** 5.4.6 **Derive
Equations (5.4.22) and (5.4.24). (See Pearlmutter (1988) for
help).

† **5.4.7** Employ time-dependent
recurrent backpropagation learning to generate the trajectories
shown in Figures 5.4.11 (a) and 5.4.12 (a).

**5.4.8 **Show that the RTRL
method applied to a fully recurrent network of *N* units
has O(*N*4)
computational complexity for each learning iteration.

Goto [5.0] [5.1] [5.2] [5.3] [5.4]