**4.3 Characterization of Additional Learning Rules**

The above approximation of the average learning equation
is valid when the learning equation contains strongly mixing random
processes (processes for which the "past" and the "future"
are asymptotically independent) with the mixing rate high compared
to the rate of change of the solution process; i.e., it can be
assumed that the weights are uncorrelated with the patterns **x**
and with themselves. Taking the expected (average) value of a
stochastic equation one obtains a deterministic equation whose
solution approximates asymptotically the behavior of the original
system as described by the stochastic equation (here, = (*t*) =
is normally assumed). Roughly, the higher the mixing rate the
better the approximation in Equation (4.3.1) (Kushner and Clark,
1978). We shall frequently employ Equation (4.3.1) in the remainder
of this chapter.

A more rigorous characterization of stochastic learning
requires the more advanced theory of stochastic differential equations
and will not be considered here [see Kushner (1977) and Ljung
(1978)]. Rather, we may proceed with a deterministic analysis
using the "average versions" of the stochastic equations.
It may be shown that a necessary condition for the stochastic
learning rule to converge (in the mean-square sense) is that the
average version of the learning rule must converge. In addition,
and under certain assumptions, the exact solution of a stochastic
equation is guaranteed to "stay close," in a probabilistic
sense, to the solution of the associated
average equation. It has been shown (Geman, 1979) that under
strong mixing conditions (and some additional assumptions), .
This result implies that if sufficiently small learning rates
are used, the behavior of a stochastic learning equation may be
well approximated, in a mean-square sense, by the deterministic
dynamics of its corresponding average equation. Oja (1983) pointed
out that the convergence of constrained gradient descent- (or
ascent-) based stochastic learning equations (the type of equations
considered in this chapter) can be studied with averaging techniques;
i.e., the asymptotically stable equilibria of the average equation
are the possible limits of the stochastic equation. Several examples
of applying the averaging technique to the characterization of
learning rules can be found in Kohonen (1989).

Before proceeding with further analysis of learning
rules, we make the following important observations regarding
the nature of the learning parameter in the stochastic learning
equation (Heskes and Kappen, 1991). When a neural network interacts
with a fixed unchanging (stationary) environment, the aim of the
learning algorithm is to adjust the weights of the network in
order to produce an optimal response; i.e., an optimal representation
of the environment. To produce such an optimal and static representation,
we require the learning parameter, which controls the amount of
learning, to eventually approach zero. Otherwise, fluctuations
in the representation will persist, due to the stochastic nature
of the learning equation, and asymptotic convergence to optimal
representation is never achieved. For a large class of stochastic
algorithms, asymptotic convergence can be guaranteed (with high
probability) by using the learning parameter
( Ljung, 1977; Kushner and Clark, 1978).

On the other hand, consider biological neural nets.
Human beings, of course, are able to continually learn throughout
their entire lifetime. In fact, human learning is able to proceed
on two different time scales; humans learn with age (very large
time scale adaptation/learning) and are also capable of discovering
regularities and are attentive for details (short time scale learning).
This constant tendency to learn accounts for the adaptability
of natural neural systems to a changing environment. Therefore,
it is clear that the learning processes in biological neural networks
does not allow for asymptotically vanishing learning parameters.

In order for artificial neural networks to be capable
of adapting to a changing (nonstationary) environment, the learning
parameter must take a constant nonzero value. The larger the
learning parameter, the faster the response of the network to
the changing environment. On the other hand, a large learning
parameter has a negative effect on the accuracy of the network's
representation of the environment at a given time; a large gives
rise to large fluctuations around the desired optimal representation.
In practice, though, one might be willing to trade some degree
of fluctuation about the optimal representation (solution) for
adaptability to a nonstationary process. Similar ideas have been
proposed in connection with stochastic adaptive linear filtering.
Here, an adaptive algorithm with a constant step size is used
because it has the advantage of a limited memory, which enables
it to track time fluctuations in the incoming data. These ideas
date back to Wiener (1956) in connection with his work on linear
prediction theory.

from which we may determine the average instantaneous
criterion function

**4.3.2 Improved Hebbian Learning**

Consider the criterion function

It is a well established property of quadratic forms
that if **w** is constrained to the surface of the unit hypersphere,
then Equation (4.3.6) is minimized when **w** = **c**(1)
with (e.g. see Johnson and Wichern, 1988).
Also, for any real symmetric *n n* matrix **A**,
the Rayleigh quotient satisfies
where 1 and *n*
are the largest and smallest eigenvalues of **A**, respectively.
Let us start from the above criterion and derive an average learning
equation. Employing Equation (4.3.1), we get

(4.3.7)

which can be shown to be the average version of the
nonlinear stochastic learning rule

(4.3.8)

If we heuristically set
for the two terms in the above equation,
Equation (4.3.8) reduces to the continuous version of Oja's rule
[refer to Equation (3.3.6)]. Let us continue with the characterization
of (4.3.8) and defer the analysis of Oja's rule to Section 4.3.3.
At equilibrium, Equation (4.3.7) gives

(4.3.9)

Hence, the equilibria of Equation (4.3.7) are the
solutions of Equation (4.3.9) given by **w***
= **c**(*i*),
*i* = 1, 2, ..., *n*. Here,
*J *takes its smallest value of
at **w*** = **c**(1).
This can be easily verified by direct substitution in Equation
(4.3.6).

Next, consider the Hessian of* J *at **w***
= **c**(*i*)
(assuming, without loss of generality, = 1) and multiply it
by **c**(*j*) ,
namely, **H**(**c**(*i*)) **c**(*j*).
It can be shown that this quantity is given by (see Problem 4.3.1.
For a reference on matrix differential calculus, the reader is
referred to the book by Magnus and Neudecker, 1988):

(4.3.10)

This equation implies that **H**(**w***)
has the same eigenvectors as **C** but with different eigenvalues.
**H**(**w***) is
positive semi-definite only when **w***
= **c**(1). Thus,
by following the dynamics of Equation (4.3.7), **w** will eventually
point in the direction of **c**(1)
(since none of the other directions **c**(*i*)
is stable). Although the direction of **w** will eventually
stabilize, it is entirely possible for ||**w**|| to approach
infinity, and Equation (4.3.7) will appear never to converge.
We may artificially constrain ||**w**|| to finite values by
normalizing **w** after each update of Equation (4.3.8). Alternatively,
we may set the two terms in Equation
(4.3.8) equal to 1. This latter case is considered next.

The corresponding average learning equation is thus
( Oja, 1982; 1989)

(4.3.12)

which has its equilibria at **w** satisfying

(4.3.13)

The solutions of Equation (4.3.13) are **w***
= **c**(*i*),
*i* = 1, 2, ..., *n*. All
of these equilibria are unstable except for .
This can be seen by noting that the Hessian

(4.3.14)

is positive definite only at **w***
= **c**(1) (or -**c**(1)).
Note that Equation (4.3.14) is derived starting from ,
with given in Equation (4.3.12). Although
*J* is not known, the positive definiteness of **H** can
be seen from

(4.3.15)

and by noting that the eigenvalues of **C** satisfy
(we assume 1 2).
Therefore, Oja's rule is equivalent to a stable version of the
Hebbian rule given in Equation (4.3.8). A formal derivation of
Oja's rule is explored in Problem 4.3.7.

A single unit employing Oja's rule (Oja's unit)
is equivalent to a linear matched filter. To see this, assume
that for all **x**, , where
is a fixed vector (without loss of generality let )
and **v** is a vector of symmetrically distributed zero-mean
noise with uncorrelated components having variance 2.
Then . The largest eigenvalue of **C**
is 1 + 2 and
the corresponding eigenvector is . Oja's
unit then becomes a matched filter for the data, since
in Equation (4.3.12). Here, the unit responds maximally to the
data mean. Further characterization of Oja's rule can be found
in Xu (1993).

Oja's rule is interesting because it results in
a local learning rule which is biologically plausible. The locality
property is seen by considering the component weight adaptation
rule of Equation (4.3.11), namely

(4.3.16)

and by noting that the change in the *i*th weight
is not an "explicit" function of any other weight except
the *i*th weight itself. Of course,
does depend on **w** via *y* = **w**T**x**.
However, this dependence does not violate the concept of locality.

It is also interesting to note that Oja's rule is
similar to Hebbian learning with weight decay as in Equation (4.2.18).
For Oja's rule, though, the growth in ||**w**|| is controlled
by a "forgetting" or weight decay term, -*y*2**w**,
which has nonlinear gain; the forgetting becomes stronger with
stronger response, thus preventing ||**w**|| from diverging.

**Example 4.3.1:** This example
shows typical simulation results comparing the evolution of the
weight vector **w** according to the stochastic Oja rule in
Equation (4.3.11) and its corresponding average rule in Equation
(4.3.12).

Consider a training set {**x**} of forty 15-dimensional
column vectors with independent random components generated by
a normal distribution N(0, 1). In the following simulations,
the training vectors autocorrelation matrix
has the following set of eigenvalues: {2.561, 2.254, 2.081, 1.786,
1.358, 1.252, 1.121, 0.963, 0.745, 0.633, 0.500, 0.460, 0.357,
0.288, 0.238}.

During training, one of the forty vectors is selected
at random and is used in the learning rule to compute the next
weight vector. Discretized versions of Equations (4.3.11) and
(4.3.12) are used where is replaced by
**w***k*+1 - **w***k*.
A learning rate = 0.005 is used. This is equivalent
to integrating these equations with Euler's method (e.g., see
Gerald, 1978) using a time step *t* = 0.005 and
= 1. The initial weight vector is set equal to one
of the training vectors. Figures 4.3.1 (a) and (b) show the evolution
of the cosine of the angle between **w***k*
and **c**(1) and the
evolution of the norm of **w***k*,
respectively. The solid line corresponds to the stochastic rule
and the dashed line corresponds to the average rule.

Figure 4.3.1. (a) Evolution of the cosine of the
angle between the weight vector **w***k*
and the principal eigenvector of the autocorrelation matrix **C**
for the stochastic Oja rule (solid line) and for the average Oja
rule (dashed line). (b) Evolution of the magnitude of the weight
vector. The training set consists of forty 15-dimensional real-valued
vectors whose components are independently generated according
to a normal distribution N(0, 1). The presentation order
of the training vectors is random during training.

The above simulation is repeated, but with a fixed
presentation order of the training set. Results are shown in
Figures 4.3.2 (a) and (b). Note that the results for the average
learning equation (dashed line) are identical in both simulations
since they are not affected by the order of presentation of input
vectors. These simulations agree with the theoretical results
on the appropriateness of using the average learning equation
to approximate the limiting behavior of its corresponding stochastic
learning equation. Note that a monotonically decreasing learning
rate (say proportional to or
with *k* 1) can be used to force the convergence
of the direction of **w***k*
in the first simulation. It is also interesting to note that
better approximations are possible when the training vectors are
presented in a fixed deterministic order (or in a random order
but with each vector guaranteed to be selected once every training
cycle of *m* = 40 presentations). Here, a sufficiently
small, constant learning rate is sufficient for making the average
dynamics approximate, in a practical sense, the stochastic dynamics
for all time.

Figure 4.3.2. (a) Evolution of the cosine of the
angle between the weight vector **w***k*
and the principal eigenvector of the autocorrelation matrix **C**
for the stochastic Oja rule (solid line) and for the average Oja
rule (dashed line). (b) Evolution of the magnitude of the weight
vector. The training set consists of forty 15-dimensional real-valued
vectors whose components are independently generated according
to a normal distribution N(0, 1). The presentation order
of the training vectors is fixed.

The continuous-time version of the Yuille et al.
(1989) learning rule is

(4.3.17)

and the corresponding average learning equation is

(4.3.18)

with equilibria at

(4.3.19)

From Equation (4.3.18), the gradient of *J*
is

(4.3.20)

which leads to the Hessian

(4.3.21)

Note that one could have also computed **H** directly
from the known criterion function

(4.3.22)

Now, evaluating **H**(**w***i**)**w***j**
we get

(4.3.23)

which implies that the **w***j**
are eigenvectors of **H**(**w***i**)
with eigenvalues *i*
- *j* for *i j*
and 2*i* for *i
= j*. Therefore, **H**(**w***i**)
is positive definite if and only if *i *>
*j*, for *i j*.
In this case, **w***
= are the only stable equilibria, and
the dynamics of the stochastic equation are expected to approach
**w***.

In the following, the unsupervised Hebbian-type
learning rule

Next, we investigate the stability of these equilibria.
Starting from the Hessian

Note that the weight vector in
in Equation (4.3.33) need not be normalized to prevent divergence.

Goto [4.0][4.1] [4.2] [4.4] [4.5] [4.6] [4.7] [4.8] [4.9] [4.10]