We started this chapter by deriving backprop; a
gradient descent-based learning procedure for minimizing the sum
of squared error criterion function in a feedforward layered network
of sigmoidal units. This result is a natural generalization of
the delta learning rule given in Chapter 3 for single layer networks.
We also presented a global descent-based error backpropagation
procedure which employs automatic tunneling through the error
function for escaping local minima and converging towards a global
minimum.
Various variations to backprop are introduced in
order to improve convergence speed, avoid "poor" local
minima, and enhance generalization. These variations include
weight initialization methods, autonomous learning parameters
adjustments, and the addition of regularization terms to the error
function being minimized. Theoretical basis for several of these
variations is presented.
A number of significant real-world applications
are presented where backprop is used to train feedforward networks
for realizing complex mappings between noisy sensory data and
the corresponding desired classifications/actions. These applications
include converting a human's hand movement to speech, hand-written
digit recognition, autonomous vehicle control, medical diagnosis,
and data compression.
Finally, extensions of the idea of backward error propagation learning to recurrent neural networks are given which allow for temporal association of time sequences. Time-delay neural networks, which may be viewed as nonlinear FIR or IIR filters, are shown to be capable of sequence recognition and association by employing standard backprop training. Backpropagation through time is introduced as a training method for fully recurrent networks. It employs a trick that allows backprop with weight sharing to be used to train an unfolded feedforward nonrecurrent version of the original network. Direct training of fully recurrent networks is also possible. A recurrent backpropagation method for training fully recurrent nets on static (spatial) associations is presented. This method is also extended to temporal association of continuous-time sequences (time-dependent recurrent backpropagation). Finally, a method of on-line temporal association of discrete-time sequences (real-time recurrent learning) is discussed.
Problems
5.1.1 Derive Equations
(5.1.11) and (5.1.12).
5.1.2 Derive the backprop
learning rule for the first hidden layer (layer directly connected
to the input signal x) in a three layer (two hidden layer)
feedforward network. Assume that the first hidden layer has K
units with weights wki
and differentiable activations fh1(netk),
the second hidden layer has J units with weights wjk
and differentiable activations fh2(netj),
and the output layer has L units with weights wlj
and differentiable activations fo(netl).
5.1.3 Consider the neural
network in Figure 5.1.1 with full additional connections between
the input vector x and the output layer units. Let the
weights of these additional connections be designated as wli
(connection weights between the lth
output unit and the ith input signal.) Derive a learning
rule for these additional weights based on gradient descent minimization
of the instantaneous SSE criterion function.
5.1.4 Derive the batch
backprop rule for the network in Figure 5.1.1
5.1.5 Use the incremental
backprop procedure described in Section 5.1.1 to train a two-layer
network with 12 hidden units and a single output unit to learn
to distinguish between the class regions in Figure P5.1.5. Follow
a similar training strategy to the one employed in Example 5.1.1.
Generate and plot the separating surfaces learned by the various
units in the network. Can you identify the function realized
by the output unit?

* 5.1.6 Derive and
implement numerically the global descent backprop learning algorithm
for a single hidden layer feedforward network starting from Equation
(5.1.15). Generate a learning curve (as in Figure 5.1.5) for
the 4-bit parity problem using incremental backprop, batch backprop,
and global descent backprop. Assume a four hidden units fully
interconnected feedforward net with unipolar sigmoid activation
units, and use the same initial weights and learning rates for
all learning algorithms (use = 2 and k = 0.001
for global descent, and experiment with different directions of
the perturbation vector w).
5.1.7 Consider the two
layer feedforward net in Figure 5.1.1. Assume that we replace
the hidden layer weights wji
by nonlinear weights of the form
where
rji R
is a parameter associated with hidden unit j, and xi
is the ith component of the input vector x. It
has been shown empirically ( Narayan, 1993) that this network is
capable of faster and more accurate training when the weights
and the rji
exponents are adapted as compared to the same network with fixed
rji = 0.
Derive a learning rule for rji
based on incremental gradient descent minimization of the instantaneous
SSE criterion of Equation (5.1.1). Are there any restrictions
on the values of the inputs xi?
What would be a reasonable initial value for the rji
exponents?
5.1.8 Consider
the Feigenbaum (1978) chaotic time series generated by the nonlinear
iterated (discrete-time) map

Plot the time series x(t) for t [0, 20],
starting from x(0) = 0.2. Construct (by inspection)
an optimal net of the type considered in Problem 5.1.7 which will
perfectly model this iterated map (assume zero biases and linear
activation functions with unity slope for all units in the network).
Now, vary all exponents and weights by +1 percent and -2
percent, respectively. Compare the time series predicted by this
varied network to x(t) over the range t [0, 20].
Assume x(0) = 0.2. Note that the output of
the net at time t + 1 must serve as the new input
to the net for predicting the time series at t + 2,
and so on.
5.2.1 Given a unit with
n weights wi
uniformly randomly distributed in the range
.
Assume that the components xi
of the input vector x are randomly and uniformly distributed
in the interval [0, 1]. Show that the random variable
has a zero mean and unity standard deviation.
5.2.2 Explain qualitatively
the characteristics of the approximate Newton's rule of Equation
(5.2.6).
5.2.3 Complete the missing steps in the derivation
of Equation (5.2.9).
5.2.4 Derive the activation
function slope update rules of Equations (5.2.10) and (5.2.11).
5.2.5 Derive the incremental
backprop learning rule starting from the entropy criterion function
in Equation (5.2.16).
5.2.6 Derive the incremental
backprop learning rule starting from the Minkowski-r criterion
function in Equation (5.2.19).
5.2.7 Comment on the qualitative
characteristics of the Minkowski-r criterion function for
negative r.
5.2.8 Derive Equation
(5.2.22).
5.2.9 Derive the partial
derivatives of R in Equations (5.2.25) through (5.2.28)
for the soft weight-sharing regularization term in Equation (5.2.24).
Use the appropriate partial derivatives to solve analytically
for the optimal mixture parameters
and
, assuming fixed values for the "responsibilities"
rj(wi).
5.2.10 Give a qualitative
explanation for the effect of adapting the Gaussian mixture parameters
j, j,
and j on learning
in a feedforward neural net.
5.2.11 Consider the criterion
function with entropy regularization ( Kamimura, 1993):

where
is a normalized output
of hidden unit j, and > 0. Assume the same
network architecture as in Figure 5.1.1 with logistic sigmoid
activations for all units and derive backprop based on this criterion/error
function. What are the effects of the entropy regularization
term on the hidden layer activity pattern of the trained net?
* 5.2.12 The optimum steepest
descent method employs a learning step
defined as the smallest positive root of the equation

Show that the optimal learning step is approximately
given by ( Tsypkin, 1971)

5.2.13 Repeat the
exact same simulation in Figure 5.2.3 but with a 40 hidden unit
feedforward net. During training, use the noise-free training
samples as indicated by the small circles in Figure 5.2.3; these
samples have the following x values {-5, -4, -3, -2, -1,
, 0,
, 1, 2, 3, 4, 6, 8, 10}.
By comparing the number of degrees of freedom of this net to
the size of the training set, what would your intuitive conclusions
be about the net's approximation behavior? Does the result of
your simulation agree with your intuitive conclusions? Explain.
How would these results be impacted if a noisy data set is used?
5.2.14 Repeat the
simulations in Figure 5.2.5 using incremental backprop with cross-validation-based
stopping of training. Assume the net to be identical to the one
discussed in Section 5.2.6 in conjunction with Figure 5.2.5.
Also, use the same weight initialization and learning parameters.
Plot the validation and training RMS errors on a log-log scale
for the first 10,000 cycles, and compare it to Figure 5.2.6.
Discuss the differences. Test the resulting "optimally trained"
net on 200 points x, generated uniformly in [-8, 12].
Plot the output of this net versus x and compare it to
the actual function
being approximated.
Also, compare the output of this net to the one in Figure 5.2.5
(dashed line), and give the reason(s) for the difference (if any)
in performance of the two nets. The following training and validation
sets are to be used in this problem. The training set is the
one plotted in Figure 5.2.5. The validation set has the same
noise statistics as for the training set.
5.2.15 Repeat Problem
5.2.14 using, as your training set, all the available data (i.e.,
both training and validation data). Here, cross-validation cannot
be used to stop training, since we have no independent (non-training)
data to validate with. One way to help avoid over training in
this case, would be to stop at the training cycle that led to
the optimal net in Problem 5.2.14. Does the resulting net generalize
better than the one in Problem 5.2.14? Explain.
5.2.16 Consider the simple
neural net in Figure P5.2.16. Assume the hidden unit has an activation
function
and that the output unit has
a linear activation with unit slope. Show that there exist a
set of real-valued weights {w1, w2, v1, v2}
which approximates the discontinuous function
,
for all x, a, b, and c R, to
any degree of accuracy.

.
5.4.1 Consider
the time series generated by the Glass-Mackey ( Mackey and Glass,
1977) discrete-time equation

Plot the time series x(t) for t [0, 1000]
and = 17. When solving the above nonlinear difference
delay equation an initial condition specified by an initial function
defined over a strip of width is required. Experiment with several
different initial functions (e.g.
,
).
5.4.2 Use incremental
backprop with sufficiently small learning rates to train the network
in Figure 5.4.1 to predict
in the Glass-Mackey
time series of Problem 5.4.1 (assume = 17). Use a
collection of 500 training pairs corresponding to different values
of t generated randomly from the time series for
.
Assume training pairs of the form

Also, assume 50 hidden units with hyperbolic tangent
activation function (set = 1) and use a linear activation
function for the output unit. Plot the training RMS error versus
the number of training cycles. Plot the signal
predicted (recursively) by the trained network and compare it
to
for t = 0, 6, 12, 18, ..., 1200.
Repeat with a two hidden layer net having 30 units in its first
hidden layer and 15 units in its second hidden layer (use the
learning equation derived in Problem 5.1.2 to train the weights
of the first hidden layer). [For an interesting collection of
time series and their prediction, the reader is referred to the
edited volume by Weigend and Gershenfeld (1994)].
5.4.3 Employ the
series-parallel identification scheme of Section 5.4.1 (refer
to Figure 5.4.3) to identify the nonlinear discrete-time plant
(Narendra and Parthasarathy, 1990)

Use a feedforward neural network having 20 hyperbolic
tangent activation (set = 1) units in its hidden layer,
feeding into a linear output unit. Use incremental backprop,
with sufficiently small learning rates, to train the network.
Assume the outputs of the delay lines (inputs to neural network
in Figure 5.4.3) to be x(t) and u(t).
Also, assume uniform random inputs in the interval [-2,
+2] during training. Plot the output of the plant as well as
the recursively generated output of the identification model for
the input

5.4.4 Derive Equations
(5.4.10) and (5.4.13).
5.4.5 Show that if the
state y* is a locally
asymptotically stable equilibrium of the dynamics in Equation
(5.4.6), then the state z*
satisfying Equation (5.4.17) is a locally asymptotically stable
equilibrium of the dynamics in Equation (5.4.18). (Hint: Start
by showing that linearizing the dynamical equations about their
respective equilibria gives

and

where
and
are small perturbations added to
and
, respectively.)
* 5.4.6 Derive
Equations (5.4.22) and (5.4.24). (See Pearlmutter (1988) for
help).
5.4.7 Employ time-dependent
recurrent backpropagation learning to generate the trajectories
shown in Figures 5.4.11 (a) and 5.4.12 (a).
5.4.8 Show that the RTRL
method applied to a fully recurrent network of N units
has O(N4)
computational complexity for each learning iteration.
Goto [5.0] [5.1] [5.2] [5.3] [5.4]