Up to this point we have been concerned with
"static"
mapping networks which are trained to produce a spatial output
pattern in response to a particular spatial input pattern. However,
in many engineering, scientific, and economic applications, the
need arises to model dynamical processes where a time sequence
is required in response to certain temporal input signal(s).
One such example is plant modeling in control applications. Here,
it is desired to capture the dynamics of an unknown plant (usually
nonlinear) by modeling a flexiblestructured network that will
imitate the plant by adaptively changing its parameters to track
the plant's observable output signals when driven by the same
input signals. The resulting model is referred to as a temporal
association network.
Temporal association networks must have a recurrent
(as opposed to static) architecture in order to handle the time
dependent nature of associations. Thus, it would be very useful
to extend the multilayer feedforward network and its associated
training algorithm(s) (e.g., backprop) into the temporal domain.
In general, this requires a recurrent architecture (nets with
feedback connections) and proper associated learning algorithms.
Two special cases of temporal association networks
are sequence reproduction and sequence recognition networks.
For sequence reproduction, a network must be able to generate
the rest of a sequence from a part of that sequence. This is
appropriate, for example, for predicting the price trend of a
given stock market from its past history or predicting the future
course of a time series from examples. In sequence recognition,
a network produces a spatial pattern or a fixed output in response
to a specific input sequence. This is appropriate, for example,
for speech recognition, where the output encodes the word
corresponding
to the speech signal. NETtalk and GloveTalk of Section 5.3 are
two other examples of sequence recognition networks.
In the following, neural net architectures having
various degrees of recurrency and their associated learning methods
are introduced which are capable of processing time sequences.
5.4.1 TimeDelay Neural Networks
Consider the timedelay neural network architecture
shown in Figure 5.4.1. This maps a finite time sequence
into a single output y (this can
also be generalized for the case when x and/or y
are vectors). One may view this neural network as a discretetime
nonlinear filter (we may also use the borrowed terms finiteduration
impulse response (FIR) filter or nonrecursive
filter from the
linear filtering literature).
The architecture in Figure 5.4.1 is equivalent to
a single hidden layer feedforward neural network receiving the
(m + 1)dimensional "spatial" pattern
x generated by a tapped delay line preprocessor from a
temporal sequence. Thus, if target values for the output unit
are specified for various times t, backprop may be used
to train the above network to act as a sequence recognizer.
Figure 5.4.1. A timedelay neural network for
onedimensional
input/output signals.
The timedelay neural net has been successfully
applied to the problem of speech recognition (e.g., Tank and Hopfield,
1987; Elamn and Zipser,
1988; Waibel, 1989; Waibel et al., 1989;
Lippmann, 1989) and time
series prediction (Lapedes and
Farber,
1988; Weigend et al.,
1991). Here, we discuss time series prediction
since it captures the spirit of the type of processing done by
the timedelay neural net. Given observed values of the state
x of a (nonlinear) dynamical system at discrete times less
than t, the goal is to use these values to accurately predict
, where p is some prediction time
step into the future (for simplicity, we assume a one dimensional
state x). Clearly, as p increases the quality of
the predicted value will degrade for any predictive method. A
method is robust if it can maintain prediction accuracy for a
wide range of p values.
As is normally done in linear signal processing
applications (e.g., Widrow and
Stearns, 1985), one may use the
tapped delay line nonlinear filter of Figure 5.4.1 as the basis
for predicting . Here, a training set
is constructed of pairs , where .
Backprop may now be employed to learn such a training set. Reported
simulation results of this prediction method show comparable or
better performance compared to other non neural networkbased
techniques (Lapedes and Farber,
1988; Weigend et al.,
1991; Weigend
and Gershenfeld, 1993).
Theoretical justification for the above approach
is available in the form of a very powerful theorem by Takens
(1981), which states that there exists a functional relation
of the form
(5.4.1)
with , as long as the trajectory
x(t) evolves towards compact attracting manifolds
of dimension d. This theorem, however, provides no information
on the form of g or the value of . The timedelay neural
network approach provides a robust approximation for g
in Equation (5.4.1) in the form of the continuous, adaptive parameter
model
(5.4.2)
where a linear activation is assumed for the output
unit, and fh
is the nonlinear activation of hidden units.
A simple modification to the timedelay net makes
it suitable for sequence reproduction. The training procedure
is identical to the one for the above prediction network. However,
during retrieval, the output y [predicting ]
is propagated through a single delay element, with the output
of this delay element connected to the input of the timedelay
net as is shown in Figure 5.4.2. This sequence reproduction net
will only work if the prediction is very
accurate since any error in the predicted signal has a multiplicative
effect due to the iterated scheme employed.
Further generalization of the above ideas can result
in a network for temporal association. We present such modifications
in the context of nonlinear dynamical plant identification/modeling
of control theory. Consider the following general nonlinear single
input, single output plant described by the difference equation:
(5.4.3)
where u(t) and x(t) are,
respectively, the input and output signals of the plant at time
t, g is a nonlinear function, and .
We are interested in training a suitable layered neural network
to capture the dynamics of the plant in Equation (5.4.3), thus
modeling the plant. Here, we assume that the order of the plant
is known (m and n are known). The general form
of Equation (5.4.3) suggests the use of a timedelay neural network
shown inside the dashed rectangle in Figure 5.4.3. This may also
be viewed as a (nonlinear) recursive filter,
termed infiniteduration
impulse response (IIR) filter in the linear
filtering literature.
using
During training, the neural network and the plant
receive the same input u(t). The neural network
also receives the plant's output x(t+1) (switch
S in the up position in Figure 5.4.3). Backprop can be
used to update the weights of the neural network based on the
"static" mapping pairs
for various values of t. This identification
scheme is referred to as seriesparallel identification model
(Narendra and Parthasarathy,
1990). After training, the neural
network with the switch S in the down position (
is fed back as the input to the top delay line in Figure 5.4.3)
will generate (recursively) an output time sequence in response
to an input time sequence. If the training was successful, one
would expect the output to approximate
the actual output of the plant, , for
the same input signal u(t) and same initial conditions.
Theoretical justifications for the effectiveness of this neural
network identification method can be found in Levin and Narendra
(1992).
Narendra and
Parthasarathy (1990) reported successful
identification of nonlinear plants by timedelay neural networks
similar to the one in Figure 5.4.3. In one of their simulations,
the feedforward part of the neural network consisted of a two
hidden layer network with five inputs and a single linear output
unit. The two hidden layers consisted of 20 and 10 units with
bipolar sigmoid activations, respectively. This network was used
to identify the unknown plant
(5.4.4)
The inputs to the neural network during training
were . Incremental backprop was used
to train the network using a uniformly distributed random input
signal whose amplitude was in the interval [1,
+1]. The training phase consisted of 100,000 training iterations,
which amounts to one training cycle over the random inputs signal
u(t), 0 < t 100,000.
A learning rate of 0.25 was used. Figure 5.4.4 (a) shows the
output of the plant (solid line) and the model (dotted line) for
the input signal
(5.4.5)
It should be noted that in the above simulation no
attempt has been made to optimize the network size or to tune
the learning process. For example, Figure 5.4.4 (b) shows simulation
results with a single hidden layer net consisting of twenty bipolar
sigmoid activation hidden units. Here, incremental backprop with
a learning rate of 0.25 was used. The training phase consisted
of 5 106
iterations.
This amounts to 10,000 training cycles over a 500 sample input
signal having the same characteristics as described above.
Other learning algorithms may be used for training
the timedelay neural network discussed above, some of which are
extensions of algorithms used in classical linear adaptive filtering
or adaptive control. Nerrand et
al. (1993) present examples of
such algorithms.
Figure 5.4.4. Identification results for the plant
in Equation (5.4.4) a timedelay neural network. Plant
output x(t) (solid line) and neural network output
(dotted line) in response to the input
signal in Equation (5.4.5). (a) The network has two hidden layers
and is trained with incremental backprop for one cycle over a
100,000 sample random input signal. (Adapted from K. S. Narendra
and K. Parthasarathy, 1990, Identification and Control of
Dynamical
Systems Containing Neural Networks, IEEE Trans. on Neural
Networks,
1(1), 427, ©1990 IEEE.) (b) The network has a single
hidden layer and is trained for 10,000 cycles over a 500 sample
random input signal.
See how it works interactively
5.4.2 Backpropagation Through Time
In the previous section, a partially recurrent neural
network is presented which is capable of temporal association.
In general, however, a fully recurrent neural net is a
more appropriate/economic
alternative. Here, individual units may be input units, output
units, or both. The desired targets are defined on a set of arbitrary
units at certain predetermined times. Also, arbitrary interconnection
patterns between units can exist. An example of a simple twounit
fully interconnected network is shown in Figure 5.4.5(a). The
network receives an input sequence x(t) at unit
1, and it is desired that the network generates the sequence
d(t)
as the output y2(t)
of unit 2.
A network which behaves identically to the above
simple recurrent net over the time steps t = 1, 2, 3, and
4 is shown in Figure 5.4.5(b). This amounts to unfolding the
recurrent network in time (Minsky and Papert, 1969) to
arrive
at a feedforward
layered network. The number of resulting layers
is equal to the unfolding time interval T. This idea is
effective when T is small and limits the maximum length
of sequences that can be generated. Here, all units in the recurrent
network are duplicated T times, so that a separate unit
in the unfolded network holds the state yi(t)
of the equivalent recurrent network at time t. Note that
the connections wij
from unit j to unit i in the unfolded network are
identical for all layers.
(a) (b)
Figure 5.4.5. (a) A simple recurrent network. (b)
A feedforward network generated by unfolding in time the recurrent
net in (a). The two networks are equivalent over the four time
steps t = 1, 2, 3, 4.
The resulting unfolded network simplifies the training
process of encoding the x(t) d(t)
sequence association since now backprop learning is applicable.
However, we should note a couple of things here. First, targets
may be specified for hidden units. Thus, errors at the output
of hidden units, and not just the output errors, must be propagated
backward from the layer in which they originate. Secondly, it
is important to realize the constraint that all copies of each
weight wij
must remain identical across duplicated layers (backprop normally
produces different increments wij
for each particular weight copy). A simple solution is to add
together the individual weight changes for all copies of a partial
weight wij
and then change all such copies by the total amount. Once trained,
a copy of the weights from any layer of the unfolded net are copied
into the recurrent network which, in turn, is used for the temporal
association task. Adapting backprop to training unfolded recurrent
neural nets results in the socalled backpropagation through time
learning method (Rumelhart
et al., 1986b). There exist relatively
few applications of this technique in the literature (e.g., Rumelhart
et al., 1986b; Nowlan,
1988; Nguyen and Widrow,
1989). One reason
is its inefficiency in handling long sequences. Another reason
is that other learning methods are able to solve the problem without
the need for unfolding. These methods are treated next. But
first, we describe one interesting application of backpropagation
through time: The truck backerupper problem.
Consider the trailer truck system shown in Figure
5.4.6. The goal is to design a controller which successfully
backsup the truck so that the back of the trailer designated
by coordinates (x, y) ends at (0, 0) with the trailer
perpendicular to the dock (i.e., the trailer angle t
is zero), and where only backward movements of the cab are allowed.
The controller receives the observed state x = [x,
y, t,
c]T
(c is the cab angle)
and produces a steering signal (angle) s.
It is assumed that the truck backsup at a constant speed. The
details of the trailer truck kinematics can be found in Miller
et al. (1990a). The original application assumes six state
variables,
including the position of the back of the cab. However, these
two additional variables may be eliminated if the length of the
cab and that of the trailer are given.
Figure 5.4.6. A pictorial representation of the truck
backerupper problem. The objective here is to design a controller
which generates a steering signal s
which successfully backsup the truck so that the back of the
trailer ends at the (0, 0) reference point, with the trailer
perpendicular to the dock.
Before the controller is designed, a feedforward
single hidden layer neural network is trained, using backprop,
to emulate the truck and trailer kinematics. This is accomplished
by training the network on a large number of backup trajectories
(corresponding to random initial trailer truck position
configurations),
each consisting of a set of association pairs
{[x(k1)T s(k1)]T,
x(k)} where
k = 1, 2, ..., T,
and T represents the number of backup steps till the trailer
hits the dock or leaves some predesignated borders of the parking
lot (T depends on the initial state of the truck and the
applied steering signal s).
The steering signal was selected randomly during this training
process. The general idea for training the emulator is depicted
in the block diagram of Figure 5.4.3 for the identification of
nonlinear dynamical systems by a neural network. However, the
tapped delay
lines are not needed here because of the kinematic
nature of the trailer truck system. Next, the trained emulator
network is used to train the controller. Once trained, the controller
is used to control the real system. The reason for training the
controller with the emulator and not with the real system is justified
below.
Figure 5.4.7 shows the controller/emulator system
in a retrieval mode. The whole system is recurrent due to the
external feedback loops (actually, the system exhibits partial
recurrence since the emulator is a feedforward network and since
it will be assumed that the controller has a feedforward single
hidden layer architecture). The controller and the emulator are
labeled C and E, respectively, in the figure. The controller
receives the input vector x(k) and responds with
a single output s(k),
representing the control signal.
Once the output layer errors are computed, they
are propagated back through the emulator network units and through
the controller network units. Here, only the controller weights
are adjusted (with equal increments for all copies of the same
weight as discussed earlier). The need to propagate the error
through the plant block necessitates that a neural networkbased
plant emulator be used to replace the plant during training.
The trained controller is capable of backing the truck from any
initial state, as long as it has sufficient clearance from the
loading dock. Thousands of backups are required to train the
controller. It is helpful (but not necessary) to start the learning
with "easy" initial cases and then proceed to train
with more difficult cases. Typical backup trajectories are shown
in Figure 5.4.9.
(a)
(b)
(c)
Figure 5.4.9. Typical backup trajectories for the
trailer truck which resulted by employing a backpropagation through
time trained controller (a) Initial state, (b) trajectory, and
(c) final state. (Courtesy of Lee Feldkamp and Gint Puskorius
of Ford Research Laboratory, Dearborn, Michigan.)
5.4.3 Recurrent Backpropagation
This section presents an extension of backprop to
fully recurrent networks where the units are assumed to have
continuously
evolving states. The new algorithm is used to encode
" spatial"
input/output associations as stable equilibria of the recurrent
network; i.e., after training on a set of {xk, dk}
pattern pairs, the presentation of xk
is supposed to drive the network's output y(t) towards
the fixed attractor state
dk.
Thus, the extension here is still restricted to learning
" static"
mappings as opposed to temporal association; however, it will
serve as the basis for other extensions of backprop to sequence
association which are discussed later in this section. The present
extension, usually called recurrent backpropagation, was proposed
independently by Pineda (1987,
1988) and Almeida (1987,
1988).
Consider a recurrent network of N units with
outputs yi,
connections wij,
and activations f (neti).
A simple example (N = 2) of such a network is shown in
Figure 5.4.5a. A unit is an input unit if it receives an element
of the input pattern
xk.
By definition, noninput units will be assigned an input
= 0. Output units are designated as units with prespecified desired
outputs . In general, a unit may belong
to the set of input units and the set of output units, simultaneously,
or it may be "hidden" in the sense that it is neither
an input nor an output unit. Henceforth, the pattern index k
is dropped for convenience.
A biologically as well as electronically motivated
choice for the state evolution of unit i is given by (refer
to Equations (4.7.8) and (7.1.19), respectively).
(5.4.6)
where neti
represents the total input activity of unit i and yi
simulates natural signal decay. By setting ,
one arrives at the equilibrium points y*
of the above system, given by:
(5.4.7)
The following is a derivation of a learning rule
for the system/network in Equation (5.4.6), which assumes the
existence and asymptotic stability of at least
one equilibrium
point y*, ,
in Equation (5.4.7). This equilibrium point(s) represent the
steadystate response of the network. Suppose that the network
has converged to an equilibrium state y*
in response to an input x. Then, if neuron i is
an output neuron, it will respond with .
This output is compared to the desired response
di,
resulting in an error signal Ei.
The goal is to adjust the weights of the network in such a way
that the state y*
ultimately becomes equal to the desired response d associated
with the input x. In other words, our goal is to minimize
the error function
(5.4.8)
with = 0 if unit
i is not an output unit. Note that an instantaneous error
function is used so that the resulting weight update rule is
incremental
in nature. Using gradient descent search to update the weight
wpq gives
(5.4.9)
with given by differentiating
Equation (5.4.7) to obtain
(5.4.10)
where ip is the Kronecker delta (ip = 1 if i = p and zero otherwise). Another way of writing Equation (5.4.10) is
(5.4.11)
where
(5.4.12)
Now, one may solve for by
inverting the set of linear equations represented by Equation
(5.4.11) and get
(5.4.13)
where (L1)ip
is the ipth element of the inverse matrix L1.
Hence, substituting Equation (5.4.13) in Equation (5.4.9) gives
the desired learning rule:
(5.4.14)
When the recurrent network is fully connected, then
the matrix L is N N and its inversion requires
O(N3) operations
using standard matrix inversion methods. Pineda and Almeida
independently
showed that a more economical local implementation, utilizing
a modified recurrent neural network of the same size as the original
network, is possible. This implementation has O(N2)
computational
complexity and is usually called recurrent backpropagation.
To see this, consider the summation term in Equation (5.4.14)
and define it as :
(5.4.15)
Then, undoing the matrix inversion in Equation (5.4.15)
leads to the set of linear equations for ,
as shown by
(5.4.16)
or, substituting for L from Equation (5.4.12),
renaming the index p as j, and rearranging terms,
(5.4.17)
This equation can be solved using an analog network
of units zi with
the dynamics
(5.4.18)
Note that Equation (5.4.17) is satisfied by the
equilibria
of Equation (5.4.18). Thus, a solution for z*,
, is possible if it is an attractor of
the dynamics in Equation (5.4.18). It can be shown (see Problem
5.4.5) that z*
is an attractor of Equation (5.4.18),
if y*
is an attractor of Equation (5.4.6).
The similarity between Equations (5.4.18) and (5.4.6)
suggests that a recurrent network realization for computing
z*
should be possible. In fact, such a network may be arrived at
by starting with the original network and replacing the coupling
weight wij
from unit j to unit i by
from unit i to unit j, assuming linear activations
for all units, setting all inputs to zero, and feeding the error
as input to the ith output unit
(of the original network). The resulting network is called the
errorpropagation network or the adjoint of the original net.
Figure 5.4.10 shows the errorpropagation network for the simple
recurrent net given in Figure 5.4.5(a).
We may now give a brief outline of the recurrent
backpropagation learning procedure. An input pattern
xk
is presented to the recurrent net and a steadystate solution
is computed by iteratively solving Equation
(5.4.6). The steadystate outputs of the net are compared with
the target dk
to find the output errors . Then, the
's are computed by iteratively solving
Equation (5.4.18). The weights are finally adjusted using Equation
(5.4.14) or its equivalent form
(5.4.19)
where
Next, a new input pattern is presented to the network
and the above procedure is repeated, and so on. It should be
noted that the recurrent backpropagation reduces to incremental
backprop for the special case of a net with no feedback.
The above analysis assumed that finite equilibria
y* exist and are
stable. However, it has been shown (Simard et al., 1988, 1989)
that for any recurrent neural network architecture, there always
exists divergent trajectories for Equation (5.4.6). In practice,
though, if the initial weights are chosen to be small enough,
the network almost always converges to a finite stable equilibrium
y*.
One potential application of recurrent backpropagation
networks is as associative memories (for
definition and details
of associative memories, refer to Chapter 7). This is because
these networks build attractors dk
which correspond to input/output association patterns .
That is, if a noisy and/or incomplete version of a trained pattern
xk is presented
as input, it potentially causes the network to eventually converge
to dk.
These pattern completion/error correction features are superior
to those of feedforward networks (Almeida, 1987). Other
applications
of recurrent backpropagation nets can be found in Qian and Sejnowski
(1989) and Barhen et al. (1989).
5.4.4 TimeDependent Recurrent Backpropagation
The recurrent backpropagation method just discussed
can be extended to recurrent networks that produce timedependent
trajectories. One such extension is the timedependent recurrent
backpropagation method of Pearlmutter (1989a, b)[See
also Werbos
(1988), and Sato (1990)].
In Pearlmutter's method, learning is
performed as a gradient descent in the weights of a continuous
recurrent network to minimize an error function E of the
temporal trajectory of the states. It can be thought of as an
extension of recurrent backpropagation to dynamic sequences.
The following is a brief outline of this algorithm.
Here, we start with a recurrent net with units
yi
having the dynamics
(5.4.20)
Note that the inputs xi(t)
are continuous functions of time. Similarly, each output unit
yl has a
desired target signal dl(t)
that is also a continuous function of time.
Consider minimizing a criterion E(y),
which is some function of the trajectory y(t) for
t between 0 and t1.
Since the objective here is to teach the lth output unit
to produce the trajectory dl(t)
upon the presentation of x(t), an appropriate criterion
(error) functional is
(5.4.21)
which measures the deviation of yl
from the function dl.
Now, the partial derivatives of E with respect to the
weights may be computed as:
(5.4.22)
where ,
is the solution to Equation (5.4.20), and zi(t)
is the solution of the dynamical system given by
(5.4.23)
with the boundary condition zi(t1)
= 0. Here, Ei(t)
is given by di(t)

yi(t)
if unit i is an output unit, and zero otherwise. One may
also simultaneously minimize E in the timeconstant space
by gradient descent, utilizing
(5.4.24)
Equations (5.4.22) and (5.4.24) may be derived by
using a finite
difference approximation as in Pearlmutter (1988).
They may also be obtained using the calculus of variations and
Lagrange multipliers as in optimal control theory (Bryson and
Denham, 1962).
Using numerical integration (e.g., first order finite
difference approximations) one first solves Equation (5.4.20)
for t [0, t1],
then set the boundary condition zi(t1)
= 0, and integrate the system in Equation (5.4.23) backward from
t1 to 0. Having
determined yi(t)
and zi(t),
we may proceed with computing and
from Equations (5.4.22) and (5.4.24), respectively. Next, the
wij's and
i are computed
from and ,
respectively.
Due to its memory requirements and continuoustime
nature, timedependent recurrent backpropagation is more appropriate
as an offline training method. Some applications of this technique
include learning limit cycles in twodimensional space (Pearlmutter,
1989a) like the one shown in Figure 5.4.11 and 5.4.12. The
trajectory
in Figure 5.4.11(b) and (c) are produced by a network of four
hidden units, two output units, and no input units, after 1,500
and 12,000 learning cycles, respectively. The desired trajectory
[d1(t) versus
d2(t)] is
the circle in Figure 5.4.11(a). The state space trajectories
in Figure 5.4.12 (b) and (c) are generated by a network with 10
hidden units and two output units, after 3,182 and 20,000 cycles,
respectively. The desired trajectory is shown in Figure 5.4.12(a).
This method has also been shown to work well in time series prediction
(Logar et al., 1993). Fang and Sejnowski (1990)
reported improved
learning speed and convergence of the above algorithm as the result
of allowing independent learning rates for individual weights
in the network (for example, they report a better formed figure
"eight" compared to the one in Figure 5.4.12(c), after
only 2000 cycles.)
(a) (b) (c)
Figure 5.4.11. Learning performance of timedependent
recurrent backpropagation: (a) desired trajectory d1(t)
versus d2(t),
(b) generated state space trajectory, y1(t)
versus y2(t)
after 1,500 cycles, and (c) y1(t)
versus y2(t)
after 12,000 cycles. (From B. A. Pearlmutter, 1989a, with permission
of the MIT Press.)
Finally, an important property of the continuoustime
recurrent net described by Equation (5.4.20) should be noted.
It has been shown (Funahashi
and Nakamura, 1993) that the output
of a sufficiently large continuoustime recurrent net with hidden
units can approximate any continuous state space trajectory to
any desired degree of accuracy. This means that recurrent neural
nets are universal approximators of dynamical systems. Note,
however, that this says nothing about the existence of a learning
procedure which will guarantee that any continuous trajectory
is learned successfully. What it implies, though, is that the
failure of learning a given continuous trajectory by a sufficiently
large recurrent net would be attributed to the learning algorithm
used.
(a) (b) (c)
Figure 5.4.12. Learning the figure "eight"
by a timedependent recurrent backpropagation net. (a) Desired
state space trajectory, (b) generated trajectory after 3,182 cycles,
(c) generated trajectory after 20,000 cycles. (From B. A. Pearlmutter,
1989a, with permission of the MIT Press.)
5.4.5 RealTime Recurrent Learning
Another method that allows sequences to be associated
is the realtime recurrent learning (RTRL) method proposed by
Williams and Zipser (1989a,
b). This method allows recurrent
networks to learn tasks that require retention of information
over time periods having either fixed or indefinite length. RTRL
assumes recurrent nets with discretetime states that evolve
according
to
(5.4.25)
A desired target trajectory d(t) is
associated with each input trajectory x(t). As
before, the quadratic error measure is used
(5.4.26)
where
Thus, gradient descent on Etotal
gives
(5.4.27)
with
(5.4.28)
The partial derivative in
Equation (5.4.28) can now be computed from Equation (5.4.25) as
(5.4.29)
Since Equation (5.4.29) relates the derivatives at
time t to those at time t1,
we can iterate it forward (starting from some initial value for
; e.g., zero) and compute
at any desired time, while using Equation (5.4.25) to iteratively
update states at each iteration. Each cycle of this algorithm
requires time proportional to N4,
where N is the number of units in a fully interconnected
net. Instead of using Equation (5.4.27) to update the weights,
it was found (Williams and
Zipser, 1989a) that updating the weights
after each time step according to Equation (5.4.28) works well
as long as the learning rate r
is kept sufficiently small; thus the name realtime recurrent
learning. This avoids the need for allocating memory proportional
to the maximum sequence length and leads to simple online
implementations.
The power of this method was demonstrated through a series of
simulations (Williams and
Zipser, 1989b). In one particular simulation,
a 12 unit recurrent net learned to detect whether a string of
arbitrary length comprised of left and right parentheses consists
entirely of sets of balanced parentheses, by observing only the
action of a Turing machine performing the same task. In some
of the simulations, it was found that learning speed (and sometimes
convergence) improved by setting the states of units
yl(t)
with known targets to their target values, but only after computing
El(t)
and the derivatives in Equation (5.4.29). This heuristic is known
as teacher forcing; it helps keep the network closer to the desired
trajectory. The reader may refer to Robinson and Fallside (1988),
Rohwer (1990), and Sun et al. (1992) for other
methods for learning
sequences in recurrent networks. (The reader is also referred
to the Special Issue of the IEEE Transactions on Neural Networks,
volume 5(2), 1994, for further exploration into recurrent
neural networks and their applications.)