5.5 Summary

We started this chapter by deriving backprop; a gradient descent-based learning procedure for minimizing the sum of squared error criterion function in a feedforward layered network of sigmoidal units. This result is a natural generalization of the delta learning rule given in Chapter 3 for single layer networks. We also presented a global descent-based error backpropagation procedure which employs automatic tunneling through the error function for escaping local minima and converging towards a global minimum.

Various variations to backprop are introduced in order to improve convergence speed, avoid "poor" local minima, and enhance generalization. These variations include weight initialization methods, autonomous learning parameters adjustments, and the addition of regularization terms to the error function being minimized. Theoretical basis for several of these variations is presented.

A number of significant real-world applications are presented where backprop is used to train feedforward networks for realizing complex mappings between noisy sensory data and the corresponding desired classifications/actions. These applications include converting a human's hand movement to speech, hand-written digit recognition, autonomous vehicle control, medical diagnosis, and data compression.

Finally, extensions of the idea of backward error propagation learning to recurrent neural networks are given which allow for temporal association of time sequences. Time-delay neural networks, which may be viewed as nonlinear FIR or IIR filters, are shown to be capable of sequence recognition and association by employing standard backprop training. Backpropagation through time is introduced as a training method for fully recurrent networks. It employs a trick that allows backprop with weight sharing to be used to train an unfolded feedforward nonrecurrent version of the original network. Direct training of fully recurrent networks is also possible. A recurrent backpropagation method for training fully recurrent nets on static (spatial) associations is presented. This method is also extended to temporal association of continuous-time sequences (time-dependent recurrent backpropagation). Finally, a method of on-line temporal association of discrete-time sequences (real-time recurrent learning) is discussed.

Problems

5.1.1 Derive Equations (5.1.11) and (5.1.12).

5.1.2 Derive the backprop learning rule for the first hidden layer (layer directly connected to the input signal x) in a three layer (two hidden layer) feedforward network. Assume that the first hidden layer has K units with weights wki and differentiable activations fh1(netk), the second hidden layer has J units with weights wjk and differentiable activations fh2(netj), and the output layer has L units with weights wlj and differentiable activations fo(netl).

5.1.3 Consider the neural network in Figure 5.1.1 with full additional connections between the input vector x and the output layer units. Let the weights of these additional connections be designated as wli (connection weights between the lth output unit and the ith input signal.) Derive a learning rule for these additional weights based on gradient descent minimization of the instantaneous SSE criterion function.

5.1.4 Derive the batch backprop rule for the network in Figure 5.1.1

5.1.5 Use the incremental backprop procedure described in Section 5.1.1 to train a two-layer network with 12 hidden units and a single output unit to learn to distinguish between the class regions in Figure P5.1.5. Follow a similar training strategy to the one employed in Example 5.1.1. Generate and plot the separating surfaces learned by the various units in the network. Can you identify the function realized by the output unit?


Figure P5.1.5. Two-class classification problem.

*† 5.1.6 Derive and implement numerically the global descent backprop learning algorithm for a single hidden layer feedforward network starting from Equation (5.1.15). Generate a learning curve (as in Figure 5.1.5) for the 4-bit parity problem using incremental backprop, batch backprop, and global descent backprop. Assume a four hidden units fully interconnected feedforward net with unipolar sigmoid activation units, and use the same initial weights and learning rates for all learning algorithms (use  = 2 and k = 0.001 for global descent, and experiment with different directions of the perturbation vector w).

5.1.7 Consider the two layer feedforward net in Figure 5.1.1. Assume that we replace the hidden layer weights wji by nonlinear weights of the form where rji  R is a parameter associated with hidden unit j, and xi is the ith component of the input vector x. It has been shown empirically ( Narayan, 1993) that this network is capable of faster and more accurate training when the weights and the rji exponents are adapted as compared to the same network with fixed rji = 0. Derive a learning rule for rji based on incremental gradient descent minimization of the instantaneous SSE criterion of Equation (5.1.1). Are there any restrictions on the values of the inputs xi? What would be a reasonable initial value for the rji exponents?


5.1.8 Consider the Feigenbaum (1978) chaotic time series generated by the nonlinear iterated (discrete-time) map


Plot the time series x(t) for t  [0, 20], starting from x(0) = 0.2. Construct (by inspection) an optimal net of the type considered in Problem 5.1.7 which will perfectly model this iterated map (assume zero biases and linear activation functions with unity slope for all units in the network). Now, vary all exponents and weights by +1 percent and -2 percent, respectively. Compare the time series predicted by this varied network to x(t) over the range t  [0, 20]. Assume x(0) = 0.2. Note that the output of the net at time t + 1 must serve as the new input to the net for predicting the time series at t + 2, and so on.

5.2.1 Given a unit with n weights wi uniformly randomly distributed in the range . Assume that the components xi of the input vector x are randomly and uniformly distributed in the interval [0, 1]. Show that the random variable has a zero mean and unity standard deviation.

5.2.2 Explain qualitatively the characteristics of the approximate Newton's rule of Equation (5.2.6).

5.2.3 Complete the missing steps in the derivation of Equation (5.2.9).

5.2.4 Derive the activation function slope update rules of Equations (5.2.10) and (5.2.11).

5.2.5 Derive the incremental backprop learning rule starting from the entropy criterion function in Equation (5.2.16).

5.2.6 Derive the incremental backprop learning rule starting from the Minkowski-r criterion function in Equation (5.2.19).

5.2.7 Comment on the qualitative characteristics of the Minkowski-r criterion function for negative r.

5.2.8 Derive Equation (5.2.22).

5.2.9 Derive the partial derivatives of R in Equations (5.2.25) through (5.2.28) for the soft weight-sharing regularization term in Equation (5.2.24). Use the appropriate partial derivatives to solve analytically for the optimal mixture parameters and , assuming fixed values for the "responsibilities" rj(wi).

5.2.10 Give a qualitative explanation for the effect of adapting the Gaussian mixture parameters j, j, and j on learning in a feedforward neural net.

5.2.11 Consider the criterion function with entropy regularization ( Kamimura, 1993):


where is a normalized output of hidden unit j, and  > 0. Assume the same network architecture as in Figure 5.1.1 with logistic sigmoid activations for all units and derive backprop based on this criterion/error function. What are the effects of the entropy regularization term on the hidden layer activity pattern of the trained net?

* 5.2.12 The optimum steepest descent method employs a learning step defined as the smallest positive root of the equation


Show that the optimal learning step is approximately given by ( Tsypkin, 1971)


5.2.13 Repeat the exact same simulation in Figure 5.2.3 but with a 40 hidden unit feedforward net. During training, use the noise-free training samples as indicated by the small circles in Figure 5.2.3; these samples have the following x values {-5, -4, -3, -2, -1, , 0, , 1, 2, 3, 4, 6, 8, 10}. By comparing the number of degrees of freedom of this net to the size of the training set, what would your intuitive conclusions be about the net's approximation behavior? Does the result of your simulation agree with your intuitive conclusions? Explain. How would these results be impacted if a noisy data set is used?

5.2.14 Repeat the simulations in Figure 5.2.5 using incremental backprop with cross-validation-based stopping of training. Assume the net to be identical to the one discussed in Section 5.2.6 in conjunction with Figure 5.2.5. Also, use the same weight initialization and learning parameters. Plot the validation and training RMS errors on a log-log scale for the first 10,000 cycles, and compare it to Figure 5.2.6. Discuss the differences. Test the resulting "optimally trained" net on 200 points x, generated uniformly in [-8, 12]. Plot the output of this net versus x and compare it to the actual function being approximated. Also, compare the output of this net to the one in Figure 5.2.5 (dashed line), and give the reason(s) for the difference (if any) in performance of the two nets. The following training and validation sets are to be used in this problem. The training set is the one plotted in Figure 5.2.5. The validation set has the same noise statistics as for the training set.
Training Set
Validation Set
Input
Output
Input
Output
-5.0000
2.6017
-6.0000
2.1932
-4.0000
3.2434
-5.5000
2.5411
-3.0000
2.1778
-4.5000
1.4374
-2.0000
2.1290
-2.5000
2.8382
-1.0000
1.5725
-1.5000
1.7027
-0.5000
-0.4124
-0.7500
0.3688
0.0000
-2.2652
-0.2500
-1.1351
0.3333
-2.6880
0.4000
-2.3758
1.0000
-0.3856
0.8000
-2.5782
2.0000
-0.6755
1.5000
0.2102
3.0000
1.1409
2.5000
-0.3497
4.0000
0.8026
3.5000
1.5792
6.0000
0.9805
5.0000
1.1380
8.0000
1.4563
7.0000
1.9612
10.0000
1.2267
9.0000
0.9381

5.2.15 Repeat Problem 5.2.14 using, as your training set, all the available data (i.e., both training and validation data). Here, cross-validation cannot be used to stop training, since we have no independent (non-training) data to validate with. One way to help avoid over training in this case, would be to stop at the training cycle that led to the optimal net in Problem 5.2.14. Does the resulting net generalize better than the one in Problem 5.2.14? Explain.

5.2.16 Consider the simple neural net in Figure P5.2.16. Assume the hidden unit has an activation function and that the output unit has a linear activation with unit slope. Show that there exist a set of real-valued weights {w1w2v1v2} which approximates the discontinuous function , for all x, a, b, and c  R, to any degree of accuracy.


Figure P5.2.16. A neural network for approximating the function .

5.4.1 Consider the time series generated by the Glass-Mackey ( Mackey and Glass, 1977) discrete-time equation


Plot the time series x(t) for t  [0, 1000] and  = 17. When solving the above nonlinear difference delay equation an initial condition specified by an initial function defined over a strip of width is required. Experiment with several different initial functions (e.g. , ).

5.4.2 Use incremental backprop with sufficiently small learning rates to train the network in Figure 5.4.1 to predict in the Glass-Mackey time series of Problem 5.4.1 (assume  = 17). Use a collection of 500 training pairs corresponding to different values of t generated randomly from the time series for . Assume training pairs of the form


Also, assume 50 hidden units with hyperbolic tangent activation function (set  = 1) and use a linear activation function for the output unit. Plot the training RMS error versus the number of training cycles. Plot the signal predicted (recursively) by the trained network and compare it to for t = 0, 6, 12, 18, ..., 1200. Repeat with a two hidden layer net having 30 units in its first hidden layer and 15 units in its second hidden layer (use the learning equation derived in Problem 5.1.2 to train the weights of the first hidden layer). [For an interesting collection of time series and their prediction, the reader is referred to the edited volume by Weigend and Gershenfeld (1994)].

5.4.3 Employ the series-parallel identification scheme of Section 5.4.1 (refer to Figure 5.4.3) to identify the nonlinear discrete-time plant (Narendra and Parthasarathy, 1990)


Use a feedforward neural network having 20 hyperbolic tangent activation (set  = 1) units in its hidden layer, feeding into a linear output unit. Use incremental backprop, with sufficiently small learning rates, to train the network. Assume the outputs of the delay lines (inputs to neural network in Figure 5.4.3) to be x(t) and u(t). Also, assume uniform random inputs in the interval [-2, +2] during training. Plot the output of the plant as well as the recursively generated output of the identification model for the input


5.4.4 Derive Equations (5.4.10) and (5.4.13).

5.4.5 Show that if the state y* is a locally asymptotically stable equilibrium of the dynamics in Equation (5.4.6), then the state z* satisfying Equation (5.4.17) is a locally asymptotically stable equilibrium of the dynamics in Equation (5.4.18). (Hint: Start by showing that linearizing the dynamical equations about their respective equilibria gives

and


where and are small perturbations added to and , respectively.)

* 5.4.6 Derive Equations (5.4.22) and (5.4.24). (See Pearlmutter (1988) for help).

5.4.7 Employ time-dependent recurrent backpropagation learning to generate the trajectories shown in Figures 5.4.11 (a) and 5.4.12 (a).

5.4.8 Show that the RTRL method applied to a fully recurrent network of N units has O(N4) computational complexity for each learning iteration.


Goto [5.0] [5.1] [5.2] [5.3] [5.4]

Back to the Table of Contents

Back to Main Menu