**4.1 Learning as a Search Mechanism**

**4.2 Mathematical Theory of Learning in a Single Unit Setting**

**4.2.1 General Learning Equation**

and the continuous-time version

**4.2.2 Analysis of the Learning Equation**

**4.2.3 Analysis of Some Basic Learning Rules**

which leads to the average learning equation

Now, by setting = **0**,
one arrives at the (only) equilibrium point

From Equation (4.2.4), the underlying instantaneous
criterion function *J* is given by

In this case, the average learning equation becomes

or

(4.2.15)

Let **C** denote the positive semidefinite autocorrelation
matrix , defined in Equation (3.3.4),
and . If we have ,
then is the equilibrium state. Note
that **w*** approaches
the minimum SSE solution in the limit of a large training set,
and that this analysis is identical to the analysis of the -LMS
rule in Chapter 3. Let us now check the stability of **w***.
The Hessian matrix is

(4.2.16)

which is positive definite if 0.
Therefore, **w*** = **C**-1**P**
is the only (asymptotically) stable solution for Equation (4.2.14),
and the stochastic dynamics in Equation (4.2.13) are expected
to approach this solution.

Finally, note that with = 0 Equation
(4.2.4) leads to

or

(4.2.17)

which is the instantaneous SSE (or MSE) criterion
function.

Hebbian Learning

Here, upon setting *r*(**w**, **x**, *z*)
= *y* = **w**T**x**,
Equation (4.2.2) gives the Hebbian rule with decay

(4.2.18)

whose average is

(4.2.19)

Setting in Equation (4.2.19)
leads to the equilibria

(4.2.20)

So if **C** happens to have
as an eigenvalue then **w***
will be the eigenvector of **C** corresponding to .
In general, though, will not be an eigenvalue
of **C**, so Equation (4.2.19) will have only one equilibrium
at **w*** = **0**.
This equilibrium solution is asymptotically stable if
is greater than the largest eigenvalue of **C** since this
makes the Hessian

(4.2.21)

positive definite. Now, employing Equation (4.2.4)
we get the instantaneous criterion function minimized by the Hebbian
rule in Equation (4.2.18):

(4.2.22)

The regularization term
is not adequate here to stabilize the Hebbian rule at a solution
which maximizes *y*2.
However, other more appropriate regularization terms can insure
stability, as we will see in the next section.

**4.3 Characterization of Additional Learning Rules**

from which we may determine the average instantaneous
criterion function

**4.3.2 Improved Hebbian Learning**

Consider the criterion function

which can be shown to be the average version of the
nonlinear stochastic learning rule

The corresponding average learning equation is thus
(Oja, 1982; 1989)

which has its equilibria at **w** satisfying

Figure 4.3.1. (a) Evolution of the cosine of the
angle between the weight vector **w***k*
and the principal eigenvector of the autocorrelation matrix **C**
for the stochastic Oja rule (solid line) and for the average Oja
rule (dashed line). (b) Evolution of the magnitude of the weight
vector. The training set consists of forty 15-dimensional real-valued
vectors whose components are independently generated according
to a normal distribution N(0, 1). The presentation order
of the training vectors is random during training.

The above simulation is repeated, but with a fixed
presentation order of the training set. Results are shown in
Figures 4.3.2 (a) and (b). Note that the results for the average
learning equation (dashed line) are identical in both simulations
since they are not affected by the order of presentation of input
vectors. These simulations agree with the theoretical results
on the appropriateness of using the average learning equation
to approximate the limiting behavior of its corresponding stochastic
learning equation. Note that a monotonically decreasing learning
rate (say proportional to or
with *k* 1) can be used to force the convergence
of the direction of **w***k*
in the first simulation. It is also interesting to note that
better approximations are possible when the training vectors are
presented in a fixed deterministic order (or in a random order
but with each vector guaranteed to be selected once every training
cycle of *m* = 40 presentations). Here, a sufficiently
small, constant learning rate is sufficient for making the average
dynamics approximate, in a practical sense, the stochastic dynamics
for all time.

Figure 4.3.2. (a) Evolution of the cosine of the
angle between the weight vector **w***k*
and the principal eigenvector of the autocorrelation matrix **C**
for the stochastic Oja rule (solid line) and for the average Oja
rule (dashed line). (b) Evolution of the magnitude of the weight
vector. The training set consists of forty 15-dimensional real-valued
vectors whose components are independently generated according
to a normal distribution N(0, 1). The presentation order
of the training vectors is fixed.

The continuous-time version of the Yuille et al.
(1989) learning rule is

and the corresponding average learning equation is

From Equation (4.3.18), the gradient of *J*
is

Note that one could have also computed **H** directly
from the known criterion function

Now, evaluating **H**(**w***i**)**w***j**
we get

In the following, the unsupervised Hebbian-type
learning rule

Next, we investigate the stability of these equilibria.
Starting from the Hessian

Note that the weight vector in
in Equation (4.3.33) need not be normalized to prevent divergence.

**4.4 Principal Component Analysis (PCA)**

With the sequential update assumption, Equation
(4.4.3) becomes

which has equilibria satisfying

**4.5 Theory of Reinforcement Learning**

from which the average learning equation is given
as

and then evaluate its gradient with respect to **w**.
The gradient of is given by

which follows from Equation (4.5.4). We also have

which can be used in Equation (4.5.7) to give

If we now take the gradient of Equation (4.5.6) and
use (4.5.10), we arrive at

Finally, by averaging Equation (4.5.12) over all
inputs **x***k*,
we get

**4.6 Theory of Simple Competitive Learning**

Also, consider the criterion function (Ritter and
Schulten, 1988a)

Now, performing gradient descent on *J* in Equation
(4.6.4) yields

which implies that at equilibrium

Therefore, the *j*th component of vector **w***i
*is given as

**4.7.1 Characterization of Kohonen's Feature Map**

Consider the criterion function *J*(**w**)
defined by (Ritter and Schulten, 1988a)

where **r*** = **r***(**x**)
is the coordinate vector of the winning unit upon the presentation
of input **x**.

**4.7.2 Self-Organizing Neural Fields**

The dynamics of the neural field potential *u*(**r**,
*t*) are given by

(4.7.8)

where

(4.7.9)

and *h* is a constant bias field. In Equation
(4.7.8), it is assumed that the potential *u*(**r**, *t*)
decays with time constant to the resting potential *h* in
the absence of any stimulation. Also, it is assumed that this
potential increases in proportion to the total stimuli *s*(**r**,**
x**) which is the sum of the lateral stimuli
and the input stimuli **w**T**x**
due to the input signal **x** R*n*.
A conceptual diagram for the above neural field is shown in Figure
4.7.2.

In Equation (4.7.8), the rates of change of and
**w**, if any, are assumed to be much slower than that of the
neural field potential. The input signal (pattern) **x** is
a random time sequence, and it is assumed that a pattern **x**
is chosen according to probability density *p*(**x**).
Also, we assume that inputs are applied to the neural field for
a time duration which is longer than the time constant of the
neural field potential. On the other hand, the duration of stimulus
**x** is assumed to be much shorter than the time constant
' of the weight **w**. Thus, the potential distribution *u*(**r**,
*t*) can be considered to change in a quasi-equilibrium manner
denoted by *u*(**r**, **x**).

An initial excitation pattern applied to the neural
field changes according to the dynamics given in Equation (4.7.8),
and eventually converges to one of the equilibrium solutions.
Stable equilibrium solutions are the potential fields *u*(**r**)
which the neural field can retain persistently under a constant
input **x**. The equilibria of Equation (4.7.8) must satisfy
or

(4.7.10)

where *s*(**r**, *u**)
is the total stimuli at equilibrium. When the lateral connections
distribution (**r** - **r**') is strongly off-surround inhibitory,
given any **x**, only a local excitation pattern is aroused
as a stable equilibrium which satisfies Equation (4.7.10) (Amari,
1990). Here, a local excitation is a pattern where the excitation
is concentrated on units in a small local region; i.e., *u**(**r**)
is positive only for a small neighborhood centered at a maximally
excited unit **r**0.
Thus *u**(**r**,**
x**) represents a mapping from the input space onto the neural
field.

Let us now look at the dynamics of the self-organizing
process. We start by assuming a particular update rule for the
input weights of the neural field. One biologically plausible
update rule is the Hebbian rule:

(4.7.11)

where is the neural field's
equilibrium output activity due to input **x**. In Equation
(4.7.11), we use the earlier assumption ' . Next, we assume
strong mixing in Equation (4.7.11) which allows us to write an
expression for the average learning equation (absorbing ' in
and in ) as

(4.7.12)

where the averaging is over all possible **x**.
The equilibria of Equation (4.7.12) are given by

(4.7.13)

If we now transpose Equation (4.7.12) and multiply
it by an arbitrary input vector **x** we arrive at an equation
for the change in input stimuli

(4.7.14)

The vector inner product **x**T
**x**' represents the similarity of two input signals **x**
and **x**' and hence the topology of the signal space (Takeuchi
and Amari, 1979). Note how Equation (4.7.14) relates the topology
of the input stimulus set {**x**} with that of the neural field.

On the other hand, if one assumes a learning rule
where a unit **r** updates its input weight vector in proportion
to the correlation of its equilibrium potential *u**(**r**,
**x**) and the difference **x** - **w**(**r**), one
arrives at the average differential equation

(4.7.15)

This learning rule is equivalent to the averaged
continuum version of Kohonen's self-organizing feature map in
Equation (4.7.2), if one views the potential distribution *u**(**r**,
**x**) as the weighting neighborhood function . Here, self-organization
will emerge if the dynamics of the potential field evolve such
that the quasi-stable equilibrium potential *u**
starts positive for all **r**, then monotonically and slowly
shrinks in diameter for positive time. This may be accomplished
through proper control of the bias field *h*, as described
below.

In general, it is difficult to solve Equation (4.7.8)
and (4.7.12) [or (4.7.15)]. However, some properties of the formation
of feature maps are revealed from these equations for the special,
but revealing, case of a one-dimensional neural field (Takeuchi
and Amari, 1979; Amari, 1980 and 1983). The dynamics of the potential
field in Equation (4.7.8) for a one-dimensional neural field was
analyzed in detail by Amari (1977b) and Kishimoto and Amari (1979)
for a step activation function and a continuous monotonically
non-decreasing activation function, respectively. It was shown
that with (*r* - *r*') as shown in Figure 4.7.1 and
*f*(*u*) = step(*u*), there exist stable equilibrium
solutions *u**(*r*)
for the **x** = 0 case. The 0-solution potential
field, *u**(*r*)
0, and the -solution field, *u**(*r*)
0, are among these stable solutions. The 0-solution is stable
if and only if *h* < 0. On the other hand,
the -solution is stable if and only if *h* > -2 *A*(),
where with *A*(*a*) as defined
below. Local excitations (also known as *a*-solutions) where
*u**(*r*) is
positive only over a finite interval [*a*1,
*a*2] of the neural
field are also possible. An *a*-solution exists if and only
if *h* + *A*(*a*) = 0. Here,
*A*(*a*) is the definite integral defined by

(4.7.16)

and plotted in Figure 4.7.3. Amari (see also Krekelberg
and Kok, 1993) also showed that a single *a*-solution can
exist for the case of a non-zero input stimulus, and that the
corresponding active region of the neural field is centered at
the unit *r* receiving the maximum input. Furthermore, the
width, *a*, of this active region is a monotonically decreasing
function of the bias field *h*. Thus, one may exploit the
fact that the field potential/neighborhood function *u**(*r*, **x**)
is controlled by the bias field *h* in order to control the
convergence of the self-organizing process. Here, the uniform
bias field *h* is started at a positive value *h* > -2*A*()
and is slowly decreased towards negative values. This, in turn,
causes *u** to start
at the -solution, then gradually move through *a*-solutions
with decreasing width *a* until, ultimately, *u**
becomes the 0-solution. For further analysis of the self-organizing
process in a neural field, the reader is referred to Zhang (1991).

Kohonen (1993a and 1993b) proposed a self-organizing
map model for which he gives physiological justification. The
model is similar to Amari's self-organizing neural field except
that it uses a discrete two-dimensional array of units. The model
assumes sharp self-on off-surround lateral interconnections so
that the neural activity of the map is stabilized where the unit
receiving the maximum excitation becomes active and all other
units are inactive. Kohonen's model employs unit potential dynamics
similar to those of Equation (4.7.8). On the other hand, Kohonen
uses a learning equation more complex than those in Equations
(4.7.12) and (4.7.15). This equation is given for the *i*th
unit weight vector by the pseudo-Hebbian learning rule

(4.7.17)

where is a positive constant. The term
in Equation (4.7.17) models a natural "transient" neighborhood
function. It represents a weighted-sum of the output activities
*y**l* of nearby
units, which describes the strength of the diffuse chemical effect
of cell *l* on cell *i*; *h**il*
is a function of the distance of these units. This weighted-sum
of output activities replaces the output activity of the same
cell, *y**i*,
in Hebb's learning rule. On the other hand, the
term acts as a stabilizing term which models "forgetting"
effects, or disturbance due to adjacent synapses. Typically,
forgetting effects in *w**i*
are proportional to the weight *w**i*
itself. In addition, if the disturbance caused by synaptic site
*r* is mediated through the post synaptic potential, the
forgetting effect must further be proportional to *w**irxr*.
This phenomenon is modeled by in the
forgetting term. Here, the summation is taken over a subset of
the synapses of unit *i* that are located near the *j*th
synapse *w**ij*,
and approximately act as one collectively interacting set. The
major difference between the learning rules in Equation (4.7.17)
and (4.7.12) is that, in the former, the "neighborhood"
function is determined by a "transient" activity due
to a diffusive chemical effect of nearby cell potentials, whereas
it is determined by a stable region of neural field potential
in the latter.

Under the assumption that the index *r* ranges
over all components of the input signal **x** and regarding
as a positive scalar independent of **w**
and **x**, the vector form of Equation (4.7.18) takes the Riccati
differential equation form

(4.7.18)

where = > 0. Now, multiplying
both sides of Equation (4.7.18) by 2**w**T
leads to the differential equation

(4.7.19)

Thus, for arbitrary **x** with **w**T**x** > 0,
Equation (4.7.19) converges to . On the other hand, the solution
for the direction of **w***
cannot be determined in closed form from the deterministic differential
Equation (4.7.18). However, a solution for the expected value
of **w** may be found if Equation (4.7.18) is treated as a
stochastic differential equation with strong mixing in accordance
to the discussion of Section 4.3. Taking the expected value of
both sides of Equation (4.7.18) and solving for its equilibrium
points (by setting ) gives

(4.7.20)

Furthermore, this equilibrium point can be shown
to be stable (Kohonen, 1989).

From the above analysis, it can be concluded that
the synaptic weight vector **w** is automatically normalized
to the length , independent of the input signal **x**, and
that **w** rotates such that its average direction is aligned
with the mean of **x**. This is the expected result of a self-organizing
map when a uniform nondecreasing neighborhood function is used.
In general, though, the neighborhood term is nonuniform and
time varying, which makes the analysis of Equation (4.7.17) much
more difficult.

**4.8.1 Generalization Capabilities of Deterministic
Networks**

Each time an example {**x***k*, *f**d*(**x***k*)}
of a desired function *f**d*
is presented and is successfully learned (supervised learning
is assumed), the weight vector **w** is modified so that it
enters the region of weight space that is compatible with the
presented example. If *m* examples are learned, then the
volume of this region is given by

(4.8.3)

where

The region *V**m*
represents the total volume of weight space which realizes the
desired function *f**d***
**as well as all other functions *f***
**that agree with *f**d***
**on the desired training set. Thus, if
a new input is presented to the trained network it will be ambiguous
with respect to a number of functions represented by *V**m*
(recall the discussion on ambiguity for the simple case of a single
threshold gate in Section 1.5). As the number of learned examples
*m* is increased, the expected ambiguity decreases.

Next, the volume of weight space consistent with
both the training examples and a "particular" function
*f* is given by

(4.8.4)

where Equation (4.8.2) was used. Note that *f***w**
in *I* has been replaced by *f* and that the product
term factors outside of the integral. Now, assuming independent
input vectors **x***k*
generated randomly from distribution *p*(**x**), the factors
*I*(*f*, **x***k*)
in Equation (4.8.4) are independent. Thus, averaging *V**m*(*f *)
over all **x***k*
gives

(4.8.5)

The quantity *g*(*f *) takes on values
between 0 and 1. It is referred to as the generalization ability
of *f *; i.e., *g*(*f* ) may be viewed
as the probability that for an input **x** randomly chosen
from *p*(**x**). As an example, for a completely specified
*n*-input Boolean function *f**d*,
*g*(*f *) is given by (assuming that all 2*n*
inputs are equally likely)

where *d**H*(*f*,
*f**d*) is the
number of bits by which * f* and *f**d*
differ.

Let us define the probability *P**m*(*f *)
that a particular function *f* can be implemented after training
on *m* examples of *f**d*.
This probability is equal to the average fraction of the remaining
weight space that *f* occupies:

(4.8.6)

The approximation in Equation (4.8.6) is based on
the assumption that *V**m*
does not vary much with the particular training sequence; i.e.,
*V**m*
for each probable sequence. This assumption is expected to be
valid as long as *m* is small compared to the total number
of possible input combinations.

Good generalization requires that *P**m*(*f *)
be small. Let us use Equation (4.8.6) to compute the distribution
of generalization ability *g*(*f *) across all
possible *f *'s after successful training with *m*
examples:

(4.8.7)

Note that an exact *m*(*g*)
can be derived by dividing the right-hand side of Equation (4.8.7)
by . The above result is interesting since it allows us to compute,
before learning, the distribution of generalization ability after
training with *m* examples. The form of Equation (4.8.7)
shows that the distribution *m*(*g*)
tends to get concentrated at higher and higher values of *g*
as more and more examples are learned. Thus, during learning,
although the allowed volume of weight (or function) space shrinks,
the remaining regions tend to have large generalization ability.

Another useful measure of generalization is the
average generalization ability *G*(*m*) given by

(4.8.8)

which is the ratio between the *m *+ 1
and the *m*th moments of 0(*g*)
and can be computed if 0(*g*)
is given or estimated. *G*(*m*) gives the entire "learning
curve"; i.e., it gives the average expected success rate
as a function of *m*. Equation (4.8.8) allows us to predict
the number of examples, *m*, necessary to train the network
to a desired average generalization performance. We may also
define the average prediction error as 1 - *G*(*m*).
The asymptotic behavior (*m* ) of the average
prediction error is determined by the form of the initial distribution
0(*g*) near *g* = 1.
If a finite gap between *g* = 1 and the next
highest *g* for which 0(*g*)
is nonzero exists, then the prediction error decays to 1 exponentially
as . If, on the other hand, there is no such gap in 0(*g*),
then the prediction error decays as . These two behaviors of
the learning curve have also been verified through numerical experiments.
The nature of the gap in the distribution of generalizations
near the region of perfect generalization (*g* = 1)
is not completely understood. These gaps have been detected in
experiments involving the learning of binary mappings (Cohn and
Tesauro, 1991 and 1992). It is speculated that such a gap could
be due to the dynamic effects of the learning process where the
learning algorithm may, for some reason, avoid the observed near-perfect
solutions. Another possibility is that the gap is inherent in
the nature of the binary mappings themselves.

The above approach, though theoretically interesting,
is of little practical use for estimating *m*, since it requires
the knowledge of the distribution 0(*g*),
whose estimation is computationally expensive. It also gives
results that are only valid in an average sense, and does not
necessarily represent the typical situation encountered in a specific
training scheme.

Next, we summarize a result that tells us about
the generalization ability of a deterministic feedforward neural
network in the "worst" case. Here, also, the case of
learning a binary-valued output function *f *:R*n* {0, 1}
is treated. Consider a set of *m* labeled training example
pairs (**x**, *y*) selected randomly from some arbitrary
probability distribution *p*(**x**, *y*), with **x**
R*n* and *y = f*(**x**)
{0, 1}. Also, consider a single hidden layer feedforward neural
net with *k* LTG's and *d* weights that has been trained
on the *m* examples so that at least a fraction 1 - , where
, of the examples are correctly classified. Then, with a probability
approaching 1, this network will correctly classify the fraction
1 - of future random test examples drawn from *p*(**x**,
*y*), as long as (Baum and Haussler, 1989)

(4.8.9)

Ignoring the log term, we may write Equation (4.8.9),
to a first order approximation, as

(4.8.10)

which requires *m* *d* for good generalization.
It is interesting to note that this is the same condition for
"good" generalization (low ambiguity) for a single LTG
derived by Cover (1965) (refer to Section 1.5) and obtained empirically
by Widrow (1987). Therefore, one may note that in the limit of
large *m* the architecture of the network is not important
in determining the worst case generalization behavior; what matters
is the ratio of the number of degrees of freedom (weights) to
the training set size. On the other hand, none of the above theories
may hold for the case of a small training set. In this later
case, the size and architecture of the network and the learning
scheme all play a role in determining generalization quality (see
the next chapter for more details). It should also be noted that
the architecture of the net can play an important role in determining
the speed of convergence of a given class of learning methods,
as discussed later in Section 4.9.

Similar results for worst case generalization are
reported in Blumer et al. (1989). A more general learning curve
based on statistical physics and VC dimension theories (Vapnik
and Chervonenkis, 1971) which applies to a general class of networks
can be found in Haussler et al. (1992). For generalization results
with noisy target signals the reader is referred to Amari et al.
(1992).

**4.8.2 Generalization in Stochastic Networks **

where

(4.8.12)

and *g*(**x**, **w**) may be considered
as a smooth deterministic function (e.g., superposition of multivariate
sigmoid functions typically employed in layered neural nets).
Thus, in this example, the stochastic nature of the machine is
determined by its stochastic output unit.

Assume that there exists a true machine that can
be faithfully represented by one of the above family of stochastic
machines with parameter **w**0.
The true machine receives inputs **x***k*,
*k* = 1, 2, ..., *m*, which are randomly generated according
to a fixed but unknown probability distribution *p*(**x**),
and emits *y**k*.
The maximum likelihood estimator (refer to Section 3.1.5 for
definition) characterized by the machine will be our first candidate
machine. This machine predicts *y* for a given **x**
with probability *P*(*y* | **x**, ). An entropic
loss function is used to evaluate the generalization of a trained
machine for a new example (**x***m*+1, *y**m*+1).

Let *gen*
be the average predictive entropy (also known as the average entropic
loss) of a trained machine parameterized by for a new example
(**x***m*+1, *y**m*+1):

(4.8.13)

Similarly, we define *train*
as the average entropic loss over the training examples used to
obtain

(4.8.14)

Finally, let *H*0
be the average entropic error of the true machine

(4.8.15)

Amari and Murata proved the following theorem for
training and generalization error.

**Theorem 4.8.1 (Amari and Murata, 1993)**:
The asymptotic learning curve for the entropic training error
is given by

(4.8.16)

*and for the entropic generalization error by
*

(4.8.17)

The proof of Theorem 4.8.1 uses standard techniques
of asymptotic statistics and is omitted here. (The reader is
referred to the original paper by Amari and Murata (1993) for
such proof).

In general, *H*0
is unknown and it can be eliminated from Equation (4.8.17) by
substituting its value from Equation (4.8.16). This gives

(4.8.18)

which shows that for a faithful stochastic machine
and in the limit of *m* *d*, the generalization error
approaches that of the trained machine on *m* examples, which
from Equation (4.8.16) is the classification error *H*0
of the true machine.

Again, the particular network architecture is of
no importance here as long as it allows for a faithful realization
of the true machine and *m* >> *d*.
It is interesting to note that this result is similar to the
worst case learning curve for deterministic machines [Equation
(4.8.10)], when the training error is zero. The result is also
in agreement with Cover's result on classifier ambiguity in Equation
(1.5.3), where the term in Equation (4.8.18) may be viewed as
the probability of ambiguous response on the *m *+ 1
input. In fact, Amari (1993) also proved that the average predictive
entropy *gen*(*m*)
for a general deterministic dichotomy machine (e.g., feedforward
neural net classifier) converges to 0 as in the limit *m* .

Figure 4.9.1. A layered architecture consisting of
a fixed preprocessing layer *D* followed by an adaptive LTG.

The efficiency of learning linearly separable classification
tasks in a single threshold gate should not be surprising. We
may recall from Chapter One that the average amount of necessary
and sufficient information for the characterization of the set
of separating surfaces for a random, separable dichotomy of *m*
points grows slowly with *m* and asymptotically approaches
2*d* (twice the number of degrees of freedom of the class
of separating surfaces). This implies that for a random set of
linear inequalities in *d* unknowns, the expected number
of extreme inequalities which are necessary and sufficient to
cover the whole set, tends to 2*d* as the number of consistent
inequalities tends to infinity, thus bounding the (expected) necessary
number of training examples for learning algorithms in separable
problems. Moreover, this limit of 2*d* consistent inequalities
is within the learning capacity of a single *d*-input LTG.

Another intuitive reason that the network in Figure
4.9.1 is easier to train than a fully adaptive two layer feedforward
net is that we are giving it predefined nonlinearities. The former
net does not have to start from scratch, but instead is given
more powerful building blocks to work with. However, there is
a trade off. By using the network in Figure 4.9.1, we gain in
a worst-case computational sense, but lose in that the number
of weights increases from *n* to O(*n*2)
or higher. This increase in the number of weights implies that
the number of training examples must increase so that the network
can meaningfully generalize on new examples (recall the results
of the previous section).

The problem of NP-complete learning in multilayer
neural networks may be attributed to the use of fixed network
resources (Baum, 1989). Learning an arbitrary mapping can be
achieved in polynomial time for a network that allocates new computational
units as more patterns are learned. Mukhopadhyay et al. (1993)
gave a polynomial-time training algorithm for the general class
of classification problems (defined by mappings of the form )
based on clustering and linear programming models. This algorithm
simultaneously designs and trains an appropriate network for a
given classification task. The basic idea of this method is to
cover class regions with a minimal number of dynamically allocated
hyperquadratic volumes (e.g., hyperspheres) of varying size.
The resulting network has a layered structure consisting of a
simple fixed preprocessing layer, a hidden layer of LTG's, and
an output layer of logical OR gates. This and other efficiently
trainable nets are considered in detail in Section 6.3.

for the equilibrium point **x*** = [0 1]T.
Is this an asymptotically stable point? Why?

**4.2.3** Study the stability
of the lossless pendulum system with the nonlinear dynamics

about the equilibrium points **x*** = [0 0]T
and **x*** = [ 0]T.
Here, measures the angle of the pendulum with respect to its
vertical rest position, *g* is gravitational acceleration,
and *L* is the length of the pendulum.

**4.2.4** Liapunov's first
method for studying the stability of nonlinear dynamical systems
(see footnote #4) is equivalent to studying the asymptotic stability
of a linearized version of these systems about an equilibrium
point. Linearize the second order nonlinear system in Problem
4.2.2 about the equilibrium point **x*** = [0 1]T
and write the system equations in the form . Show that the system
matrix **A** is identical to the Jacobian matrix **f** '(**x**)
at **x** = [0 1]T
of the original nonlinear system and thus both matrices have the
same eigenvalues. (Note that the asymptotic stability of a linear
system requires the eigenvalues of its system matrix **A**
to have strictly negative real parts).

**4.2.5** The linearization
method for studying the stability of a nonlinear system at a given
equilibrium point may fail when the linearized system is stable,
but not asymptotically; that is, if all eigenvalues of the system
matrix **A** have nonpositive real parts, and if the real part
of one or more eigenvalues of **A** has zero real part. Demonstrate
this fact for the nonlinear system

at the equilibrium point **x*** = [0, 0]T.
(Hint: Simulate the above dynamical system for the three cases:
*a* < 0, *a* = 0, and *a* > 0,
for an initial condition **x**(0) of your choice).

* **4.3.1** Show that the
Hessian of *J* in Equation (4.3.6) is given by

Also, show that

* **4.3.2** Study the stability
of the equilibrium point(s) of the dynamical system , where *J*(**w**)
is given by

Note that the above system corresponds to the average learning equation of Linsker's learning rule (see Section 3.3.4) without weight clipping.

**4.3.3** Verify the Hessian
matrices given in Section 4.3 for Oja's, Yuille et al., and Hassoun's
learning rules, given in Equations (4.3.14), (4.3.21), and (4.3.30),
respectively.

**4.3.4** Show that the average
learning equation for Hassoun's rule is given by Equation (4.3.26).

**4.3.5** Study the stability
of the equilibrium points of the stochastic differential equation/learning
rule (Riedel and Schild, 1992)

where is a positive integer. ( Note: this rule
is equivalent to Yuille et al. rule for = 2 ).

† **4.3.6** Study, via
numerical simulations, the stability of the learning rule ( Riedel
and Schild, 1992 )

where **. **Assume a training set {**x**}
of twenty vectors in R10
whose components are generated randomly and independently according
to the normal distribution N(0, 1). Is there a relation between
the stable point(s) **w***
(if such point(s) exist) and the eigenvectors of the input data
autocorrelation matrix? Is this learning rule local? Why?

**4.3.7** Show that the discrete-time
version of Oja's rule is a good approximation of the normalized
Hebbian rule in Equation (4.3.33) for small values. Hint: Start
by showing that

* **4.3.8** Consider the general
learning rule described by the following discrete-time gradient
system:

(1)

with > 0. Assume that **w***
is an equilibrium point for this dynamical system.

a. Show that in the neighborhood of **w***,
the gradient *J*(**w**) can be approximated as

(2)

where **H**(**w***)
is the Hessian of *J* evaluated at **w***.

b. Show that the gradient in Equation (2) is exact
when *J*(**w**) is quadratic; i.e., where **Q** is
a symmetric matrix, and **b** is a vector of constants.

c. Show that the linearized gradient system at **w***
is given by

(3)

where **I** is the identity matrix.

d. What are the conditions on **H**(**w***)
and for local asymptotic stability of **w***
in Equation (3)?

e. Use the above results to show that, in an average
sense, the -LMS rule in Equation (3.1.35) converges asymptotically
to the equilibrium solution in Equation (3.1.45) if 0 < < ,
where max is the largest
eigenvalue of the autocorrelation matrix **C** in Equation
(3.3.4). (Hint: Start with the gradient system in Equation (1)
and use Equation (3.1.44) for *J*). Now, show that is a
sufficient condition for convergence of the -LMS rule. (Hint:
The trace of a matrix (the sum of all diagonal elements) is equal
to the sum of its eigenvalues).

* **4.3.9** Use the results
from the previous problem and Equation (4.3.14) to find the range
of values for for which the discrete-time Oja's rule is stable
(in an average sense). Repeat for Hassoun's rule which has the
Hessian matrix given by Equation (4.3.30), and give a justification
for the choice = 1 which has led to Equation (4.3.32).

† **4.3.10** Consider
a training set of 40 15-dimensional vectors whose components are
independently generated according to a normal distribution N(0, 1).
Employ the stochastic discrete-time version of Oja's, Yuille
et al., and Hassoun's rules (replace by **w***k*+1 - **w***k*
in Equations (4.3.11), (4.3.17), and (4.3.24), respectively) to
extract the principal eigenvector of this training set. Use a
fixed presentation order of the training vectors. Compare the
convergence behavior of the three learning rules by generating
plots similar to those in Figures 4.3.1 (a) and (b). Use = 0.005,
= 100, and a random initial **w**. Repeat using
the corresponding discrete-time average learning equations with
the same learning parameters and initial weight vector as before
and compare the two sets of simulations.

**4.3.11** This problem illustrates
an alternative approach to the one of Section 4.3 for proving
the stability of equilibrium points of an average learning equation
(Kohonen, 1989). Consider a stochastic first order differential
equation of the form where **x**(*t*) is governed by
a stationary stochastic process. Furthermore, assume that the
vectors **x** are statistically independent from each other
and that strong mixing exists. Let **z** be an arbitrary constant
vector having the same dimension as **x** and **w**. Now,
the "averaged" trajectories of **w**(*t*) are
obtained by taking the expected value of ,

a. Show that

where is the angle between vectors **z**
and **w**.

b. Let **z** = **c**(*i*),
the *i*th unity-norm eigenvector of the autocorrelation matrix
**C** = <**xx**T>.
Show that, the average rate of change of the cosine of the angle
between **w** and **c**(*i*)
for Oja's rule [Equation (4.3.12)] is given by

where *i*
is the eigenvalue associated with **c**(*i*).
Note that the **c**(*i*)'s
are the equilibria of Oja's rule.

c. Use the result in part b to show that if then
**w**(*t*) will converge to the solution **w*** = **c**(1),
where **c**(1) is the
eigenvector with the largest eigenvalue 1
(Hint: Recall the bounds on the Rayleigh quotient given in Section
4.3.2).

*** 4.3.12**
Use the technique outlined in Problem 4.3.11 to study the convergence
properties (in the average) of the following stochastic learning
rules which employ a generalized forgetting law:

a. .

b. .

Assume that > 0 and *y* = **w**T**x**,
and that *g*(*y*) is an arbitrary scalar function of
*y* such that exists. Note that Equation (4.7.18) and Oja's
rule are special cases of the learning rules in a and b, respectively.

**4.4.1** Show that Equation
(4.4.7) is the Hessian for the criterion function implied by Equation
(4.4.5).

**4.6.1** Study (qualitatively)
the competitive learning behavior which minimizes the criterion
function

where is as defined in Equation (4.6.3). Can you
think of a physical system (for some integer value of *N*)
which is governed by this "energy" function *J*?

**4.6.2** Derive a stochastic
competitive learning rule whose corresponding average learning
equation maximizes the criterion function in Equation (4.6.16).

* **4.7.1** Show that for
the one-dimensional feature map, Equation (4.7.6) can be derived
from Equation (4.7.4). See Hertz et al. (1991) for hints.

**4.7.2** Show that Equation
(4.7.5) satisfies Equation (4.7.6).

**4.7.3** Solve Equation (4.7.5)
for *p*(*x*) *x, where R.
For which input distribution p*(*x*) do we have a zero-distortion
feature map?

**4.7.4** Prove the stability
of the equilibrium point in Equation (4.7.20). (Hint: Employ the
technique outlined in Problem 4.3.11).