**5.1.1** Derive Equations
(5.1.11) and (5.1.12).

**5.1.2** Derive the backprop
learning rule for the first hidden layer (layer directly connected
to the input signal **x**) in a three layer (two hidden layer)
feedforward network. Assume that the first hidden layer has *K*
units with weights *w**ki*
and differentiable activations *f**h*1(*net**k*),
the second hidden layer has *J* units with weights *w**jk*
and differentiable activations* f**h*2(*net**j*),
and the output layer has *L* units with weights *w**lj*
and differentiable activations *f**o*(*net**l*).

**5.1.3** Consider the neural
network in Figure 5.1.1 with full additional connections between
the input vector **x** and the output layer units. Let the
weights of these additional connections be designated as *w**li*
(connection weights between the *l*th
output unit and the *i*th input signal.) Derive a learning
rule for these additional weights based on gradient descent minimization
of the instantaneous SSE criterion function.

**5.1.4** Derive the batch
backprop rule for the network in Figure 5.1.1

† **5.1.5** Use the incremental
backprop procedure described in Section 5.1.1 to train a two-layer
network with 12 hidden units and a single output unit to learn
to distinguish between the class regions in Figure P5.1.5. Follow
a similar training strategy to the one employed in Example 5.1.1.
Generate and plot the separating surfaces learned by the various
units in the network. Can you identify the function realized
by the output unit?

*† **5.1.6** Derive and
implement numerically the global descent backprop learning algorithm
for a single hidden layer feedforward network starting from Equation
(5.1.15). Generate a learning curve (as in Figure 5.1.5) for
the 4-bit parity problem using incremental backprop, batch backprop,
and global descent backprop. Assume a four hidden units fully
interconnected feedforward net with unipolar sigmoid activation
units, and use the same initial weights and learning rates for
all learning algorithms (use = 2 and *k* = 0.001
for global descent, and experiment with different directions of
the perturbation vector **w**).

**5.1.7** Consider the two
layer feedforward net in Figure 5.1.1. Assume that we replace
the hidden layer weights *w**ji*
by nonlinear weights of the form where
*r**ji* R
is a parameter associated with hidden unit *j*, and *x**i*
is the *i*th component of the input vector **x**. It
has been shown empirically ( Narayan, 1993) that this network is
capable of faster and more accurate training when the weights
and the *r**ji*
exponents are adapted as compared to the same network with fixed
*r**ji* = 0.
Derive a learning rule for *r**ji*
based on incremental gradient descent minimization of the instantaneous
SSE criterion of Equation (5.1.1). Are there any restrictions
on the values of the inputs *x**i*?
What would be a reasonable initial value for the *r**ji*
exponents?

† **5.1.8** Consider
the Feigenbaum (1978) chaotic time series generated by the nonlinear
iterated (discrete-time) map

Plot the time series *x*(*t*) for *t* [0, 20],
starting from *x*(0) = 0.2. Construct (by inspection)
an optimal net of the type considered in Problem 5.1.7 which will
perfectly model this iterated map (assume zero biases and linear
activation functions with unity slope for all units in the network).
Now, vary all exponents and weights by +1 percent and -2
percent, respectively. Compare the time series predicted by this
varied network to *x*(*t*) over the range *t* [0, 20].
Assume *x*(0) = 0.2. Note that the output of
the net at time *t* + 1 must serve as the new input
to the net for predicting the time series at *t* + 2,
and so on.

**5.2.1** Given a unit with
*n* weights *w**i*
uniformly randomly distributed in the range .
Assume that the components *x**i*
of the input vector **x** are randomly and uniformly distributed
in the interval [0, 1]. Show that the random variable
has a zero mean and unity standard deviation.

**5.2.2** Explain qualitatively
the characteristics of the approximate Newton's rule of Equation
(5.2.6).

**5.2.3** Complete the missing steps in the derivation
of Equation (5.2.9).

**5.2.4** Derive the activation
function slope update rules of Equations (5.2.10) and (5.2.11).

**5.2.5** Derive the incremental
backprop learning rule starting from the entropy criterion function
in Equation (5.2.16).

**5.2.6** Derive the incremental
backprop learning rule starting from the Minkowski-*r* criterion
function in Equation (5.2.19).

**5.2.7** Comment on the qualitative
characteristics of the Minkowski-*r* criterion function for
negative *r*.

**5.2.8** Derive Equation
(5.2.22).

**5.2.9** Derive the partial
derivatives of *R* in Equations (5.2.25) through (5.2.28)
for the soft weight-sharing regularization term in Equation (5.2.24).
Use the appropriate partial derivatives to solve analytically
for the optimal mixture parameters and
, assuming fixed values for the "responsibilities"
*r**j*(*w**i*).

**5.2.10** Give a qualitative
explanation for the effect of adapting the Gaussian mixture parameters
*j*, *j*,
and *j* on learning
in a feedforward neural net.

**5.2.11** Consider the criterion
function with entropy regularization ( Kamimura, 1993):

where is a normalized output
of hidden unit *j*, and > 0. Assume the same
network architecture as in Figure 5.1.1 with logistic sigmoid
activations for all units and derive backprop based on this criterion/error
function. What are the effects of the entropy regularization
term on the hidden layer activity pattern of the trained net?

* **5.2.12** The optimum steepest
descent method employs a learning step
defined as the smallest positive root of the equation

Show that the optimal learning step is approximately
given by ( Tsypkin, 1971)

† **5.2.13** Repeat the
exact same simulation in Figure 5.2.3 but with a 40 hidden unit
feedforward net. During training, use the noise-free training
samples as indicated by the small circles in Figure 5.2.3; these
samples have the following *x* values {-5, -4, -3, -2, -1, , 0, , 1, 2, 3, 4, 6, 8, 10}.
By comparing the number of degrees of freedom of this net to
the size of the training set, what would your intuitive conclusions
be about the net's approximation behavior? Does the result of
your simulation agree with your intuitive conclusions? Explain.
How would these results be impacted if a noisy data set is used?

† **5.2.14** Repeat the
simulations in Figure 5.2.5 using incremental backprop with cross-validation-based
stopping of training. Assume the net to be identical to the one
discussed in Section 5.2.6 in conjunction with Figure 5.2.5.
Also, use the same weight initialization and learning parameters.
Plot the validation and training RMS errors on a log-log scale
for the first 10,000 cycles, and compare it to Figure 5.2.6.
Discuss the differences. Test the resulting "optimally trained"
net on 200 points *x*, generated uniformly in [-8, 12].
Plot the output of this net versus *x* and compare it to
the actual function being approximated.
Also, compare the output of this net to the one in Figure 5.2.5
(dashed line), and give the reason(s) for the difference (if any)
in performance of the two nets. The following training and validation
sets are to be used in this problem. The training set is the
one plotted in Figure 5.2.5. The validation set has the same
noise statistics as for the training set.

† **5.2.15** Repeat Problem
5.2.14 using, as your training set, all the available data (i.e.,
both training and validation data). Here, cross-validation cannot
be used to stop training, since we have no independent (non-training)
data to validate with. One way to help avoid over training in
this case, would be to stop at the training cycle that led to
the optimal net in Problem 5.2.14. Does the resulting net generalize
better than the one in Problem 5.2.14? Explain.

**5.2.16** Consider the simple
neural net in Figure P5.2.16. Assume the hidden unit has an activation
function and that the output unit has
a linear activation with unit slope. Show that there exist a
set of real-valued weights {*w*1, *w*2, *v*1, *v*2}
which approximates the discontinuous function ,
for all *x*, *a*, *b*, and *c* R, to
any degree of accuracy.

† **5.4.1** Consider
the time series generated by the Glass-Mackey ( Mackey and Glass,
1977) discrete-time equation

Plot the time series *x*(*t*) for *t* [0, 1000]
and = 17. When solving the above nonlinear difference
delay equation an initial condition specified by an initial function
defined over a strip of width is required. Experiment with several
different initial functions (e.g. , ).

† **5.4.2** Use incremental
backprop with sufficiently small learning rates to train the network
in Figure 5.4.1 to predict in the Glass-Mackey
time series of Problem 5.4.1 (assume = 17). Use a
collection of 500 training pairs corresponding to different values
of *t* generated randomly from the time series for .
Assume training pairs of the form

Also, assume 50 hidden units with hyperbolic tangent
activation function (set = 1) and use a linear activation
function for the output unit. Plot the training RMS error versus
the number of training cycles. Plot the signal
predicted (recursively) by the trained network and compare it
to for *t* = 0, 6, 12, 18, ..., 1200.
Repeat with a two hidden layer net having 30 units in its first
hidden layer and 15 units in its second hidden layer (use the
learning equation derived in Problem 5.1.2 to train the weights
of the first hidden layer). [For an interesting collection of
time series and their prediction, the reader is referred to the
edited volume by Weigend and Gershenfeld (1994)].

† **5.4.3** Employ the
series-parallel identification scheme of Section 5.4.1 (refer
to Figure 5.4.3) to identify the nonlinear discrete-time plant
(Narendra and Parthasarathy, 1990)

Use a feedforward neural network having 20 hyperbolic
tangent activation (set = 1) units in its hidden layer,
feeding into a linear output unit. Use incremental backprop,
with sufficiently small learning rates, to train the network.
Assume the outputs of the delay lines (inputs to neural network
in Figure 5.4.3) to be *x*(*t*) and *u*(*t*).
Also, assume uniform random inputs in the interval [-2,
+2] during training. Plot the output of the plant as well as
the recursively generated output of the identification model for
the input

**5.4.4** Derive Equations
(5.4.10) and (5.4.13).

**5.4.5** Show that if the
state **y*** is a locally
asymptotically stable equilibrium of the dynamics in Equation
(5.4.6), then the state **z***
satisfying Equation (5.4.17) is a locally asymptotically stable
equilibrium of the dynamics in Equation (5.4.18). (Hint: Start
by showing that linearizing the dynamical equations about their
respective equilibria gives

and

where and
are small perturbations added to and
, respectively.)

*** 5.4.6 **Derive
Equations (5.4.22) and (5.4.24). (See Pearlmutter (1988) for
help).

† **5.4.7** Employ time-dependent
recurrent backpropagation learning to generate the trajectories
shown in Figures 5.4.11 (a) and 5.4.12 (a).

**5.4.8 **Show that the RTRL
method applied to a fully recurrent network of *N* units
has O(*N*4)
computational complexity for each learning iteration.