4.0 Introduction
This chapter deals with theoretical aspects of learning
in artificial neural networks. It investigates mathematically,
the nature and stability of the asymptotic solutions obtained
using the basic supervised, Hebbian and reinforcement learning
rules, which were introduced in the previous chapter. Formal
analysis is also given for simple competitive learning and self-organizing
feature map learning.
A unifying framework for the characterization of
various learning rules is presented. This framework is based
on the notion that learning in general neural networks can be
viewed as search, in a multidimensional space, for a solution
which optimizes a prespecified criterion function, with or without
constraints. Under this framework, a continuous-time learning
rule is viewed as a first-order, stochastic differential equation/dynamical
system, whereby the state of the system evolves so as to minimize
an associated instantaneous criterion function. Approximation
techniques are employed to determine, in an average sense, the
nature of the asymptotic solutions of the stochastic system.
This approximation leads to an "average learning equation"
which, in most cases, can be cast as a globally, asymptotically
stable gradient system whose stable equilibria are minimizers
of a well-defined average criterion function. Finally, and subject
to certain assumptions, these stable equilibria can be taken as
the possible limits (attractor states) of the stochastic learning
equation.
The chapter also treats two important issues associated
with learning in a general feedforward neural network. These
are learning generalization and learning complexity. The section
on generalization presents a theoretical method for calculating
the asymptotic probability of correct generalization of a neural
network as a function of the training set size and the number
of free parameters in the network. Here, generalization in deterministic
and stochastic nets are investigated. The chapter concludes by
reviewing some significant results on the complexity of learning
in neural networks.
4.1 Learning as a Search Mechanism
Learning in an artificial neural network, whether
it is supervised, reinforcement, or unsupervised, can be viewed
as a process of searching a multi-dimensional parameter space
for a state which optimizes a predefined criterion function J
(J is also commonly referred to as an error function, a
cost function, or an objective function). In fact, all of the
learning rules considered in Chapter 3, except Oja's rule (refer
to Table 3.1), have well-defined analytical criterion functions.
These learning rules implement local search mechanisms (i.e.,
gradient search) to obtain weight vector solutions which (locally)
optimize the associated criterion function. Therefore, it is
the criterion function that determines the nature of a learning
rule. For example, supervised learning rules are designed so
as to minimize an error measure between the network's output and
the corresponding desired output, unsupervised Hebbian learning
is designed to maximize the variance of the output of a given
unit, and reinforcement learning is designed to maximize the average
reinforcement signal.
Supervised learning can be related to classical
approximation theory (Poggio and Girosi, 1990a). Here, the idea
is to approximate or interpolate a continuous multivariate function
g(x), from samples {x, g(x)},
by an approximation function (or class of functions) G(w,
x), where w Rd
is a parameter vector with d degrees of freedom, and x
belongs to a compact set Rn.
In this case, the set of samples {x, g(x)},
x , is referred to as a training set. The approximation
problem is to find an optimal parameter vector that provides the
"best" approximation of g on the set for
a given class of functions G. Formally stated, we
desire a solution w* Rd
such that
where the tolerance is a positive real number and
whose global minimum represents the minimum sum of
square error (SSE) solution. The choice of approximation function
G, the criterion function J, and the search mechanism
for w* all play
critical roles in determining the quality and properties of the
resulting solution/approximation.
Usually we know our objective and this knowledge
is translated into an appropriate criterion function J.
In terms of search mechanisms, gradient-based search is appropriate
(and is simple to implement) for cases where it is known that
J is differentiable and bounded. Of course, if gradient
search is used, we must be willing to accept locally-optimal solutions.
In general, only global search mechanisms, which are computationally
very expensive, may lead to global optimal solutions (global search-based
learning is the subject of Chapter 8). Among the above three
factors, the most critical in terms of affecting the quality of
the approximation in Equation (4.1.1) is the choice of approximation
function G.
In classical analysis, polynomials and rational
functions are typically used for function approximation. On the
other hand, for artificial neural networks, the approximation
functions are usually chosen from the class of smooth sigmoidal-type
functions, and the approximation is constructed as a superposition
of such sigmoidal functions. In general, there are two important
issues in selecting an appropriate class of basis functions: universality
and generalization. By universality, we mean the ability of G
to represent, to any desired degree of accuracy, the class of
functions g being approximated; in Chapter 2, we have established
the universality of feedforward layered neural nets. On the other
hand, generalization means the ability of G to correctly
map new points x, drawn from the same underlying input
distribution p(x), which were not seen during the
learning phase. Thus, an interesting question here is how well
does a neural network compare to other universal approximation
functions in terms of generalization (some insight into answering
this question is given in Section 5.2.5). Later in this chapter,
we will see that for feedforward neural networks, the number of
degrees of freedom (weights) in G plays a critical role
in determining the degree of data overfitting, which directly
affects generalization quality.
Another way to control generalization is through
criterion function conditioning. A "regularization"
term (Poggio and Girosi, 1990a) may be added to an initial criterion
function
The ||P(w)||2
term in Equation (4.1.3) is used to imbed a priori information
about the function g, such as smoothness, invariance, etc.
In this case,
4.2 Mathematical Theory of Learning in a Single Unit Setting
In this section, instead of dealing separately with
the various learning rules proposed in the previous chapter, we
seek to study a single learning rule, called the general learning
equation (Amari, 1977a and 1990), which captures the salient features
of several of the different single-unit learning rules. Two forms
of the general learning equation will be presented: a discrete-time
version, in which the weight vector evolves according to a discrete
dynamical system of the form w(k+1) = g(w(k)),
and a continuous-time version, in which the weight vector evolves
according to a smooth dynamical system of the form
4.2.1 General Learning Equation
Consider a single unit which is characterized by
a weight vector w Rn,
an input vector x Rn,
and, in some cases, a scalar teacher signal z. In a supervised
learning setting, the teacher signal is taken as the desired target
associated with a particular input vector. The input vector (signal)
is assumed to be generated by an environment or an information
source according to the probability density p(x,
z), or p(x) if z is missing (as in unsupervised
learning). Now, consider the following discrete-time dynamical
process which governs the evolution of the unit's weight vector
w
and the continuous-time version
where and are positive real, and
From the point of view of analysis, it is useful
to think of Equation (4.2.2) as implementing a fixed increment
steepest gradient descent search of an instantaneous criterion
function J, or formally,
For the case r(w, x, z)
= r(wTx,
z) = r(u, z), the right-hand side of Equation
(4.2.2) can be integrated to yield
where =
The criterion function J in Equation (4.2.4)
fits the general form of the constrained criterion function in
Equation (4.1.3). Therefore, we may view the task of minimizing
J as an optimization problem with the objective of maximizing
4.2.2 Analysis of the Learning Equation
In a stochastic environment where the information
source is ergodic, the sequence of inputs x(t) is
an independent stochastic process governed by p(x).
The general learning equation in Equation (4.2.2) then becomes
a stochastic differential equation or a stochastic approximation
algorithm. The weight vector w is changed in random directions
depending on the random variable x. From Equation (4.2.3),
the average value of
where
It is interesting to note that finding w*
does not require the knowledge of J explicitly if the gradient
of J is known. Equation (4.2.6) is useful from a theoretical
point of view in determining the equilibrium state(s) and in characterizing
the stochastic learning equation [Equation (4.2.2)] in an "average"
sense. In practice, the stochastic learning equation is implemented
and its average convergence behavior is characterized by the "average
learning equation" given as
The gradient system in Equation (4.2.6) has special
properties that makes its dynamics rather simple to analyze.
First, note that the equilibria w*
are solutions to <J> = 0. This
means that the equilibria w*
are local minima, local maxima, and/or saddle points of <J>.
Furthermore, it is a well established result that, for any > 0,
these local minima are asymptotically stable points (attractors)
and that the local maxima are unstable points (Hirsch and Smale,
1974). Thus, one would expect the stochastic dynamics of the
system in Equation (4.2.3), with sufficiently small , to approach
a local minimum of <J>.
In practice, discrete-time versions of the stochastic
dynamical system in Equation (4.2.3) are used for weight adaptation.
Here, the stability of the corresponding discrete-time average
learning equation (discrete-time gradient system) is ensured if
0 < <
4.2.3 Analysis of Some Basic Learning Rules
By utilizing Equation (4.2.7), we are now ready
to analyze some basic learning rules. These are the correlation,
LMS, and Hebbian learning rules.
Correlation Learning
Here, r(w, x, z) = z,
which represents the desired target associated with the input
x. From Equation (4.2.2) we have the stochastic equation
which leads to the average learning equation
Now, by setting
The stability of w*
may now be systematically studied through the "expected"
Hessian matrix < H(w*)
> which is computed, by first employing Equations (4.2.5) and
(4.2.9) to identify
This equation shows that the Hessian of <J>
is positive definite; i.e., its eigenvalues are strictly positive
or, equivalently, the eigenvalues of
From Equation (4.2.4), the underlying instantaneous
criterion function J is given by
which may be minimized by maximizing the correlation
zy subject to the regularization term
LMS Learning
For r(w, x, z)
= z - wTx
(the output error due to input x) and = 0, Equation (4.2.2)
leads to the stochastic equation
In this case, the average learning equation becomes
with equilibria satisfying
or
Let C denote the positive semidefinite autocorrelation
matrix
which is positive definite if
Finally, note that with = 0 Equation
(4.2.4) leads to
or
which is the instantaneous SSE (or MSE) criterion
function.
Hebbian Learning
Here, upon setting r(w, x, z)
= y = wTx,
Equation (4.2.2) gives the Hebbian rule with decay
whose average is
Setting
So if C happens to have
positive definite. Now, employing Equation (4.2.4)
we get the instantaneous criterion function minimized by the Hebbian
rule in Equation (4.2.18):
The regularization term
4.3 Characterization of Additional Learning Rules
Equation (4.2.5) [or (4.2.6)] is a powerful tool
which can be used in the characterization of the average behavior
of stochastic learning equations. We will employ it in this section
in order to characterize some unsupervised learning rules which
were considered in Chapter 3. The following analysis is made
easier if one assumes that w and x are uncorrelated,
and that
The above approximation of the average learning equation
is valid when the learning equation contains strongly mixing random
processes (processes for which the "past" and the "future"
are asymptotically independent) with the mixing rate high compared
to the rate of change of the solution process; i.e., it can be
assumed that the weights are uncorrelated with the patterns x
and with themselves. Taking the expected (average) value of a
stochastic equation one obtains a deterministic equation whose
solution approximates asymptotically the behavior of the original
system as described by the stochastic equation (here, = (t) =
A more rigorous characterization of stochastic learning
requires the more advanced theory of stochastic differential equations
and will not be considered here [see Kushner (1977) and Ljung
(1978)]. Rather, we may proceed with a deterministic analysis
using the "average versions" of the stochastic equations.
It may be shown that a necessary condition for the stochastic
learning rule to converge (in the mean-square sense) is that the
average version of the learning rule must converge. In addition,
and under certain assumptions, the exact solution of a stochastic
equation is guaranteed to "stay close," in a probabilistic
sense, to the solution
Before proceeding with further analysis of learning
rules, we make the following important observations regarding
the nature of the learning parameter in the stochastic learning
equation (Heskes and Kappen, 1991). When a neural network interacts
with a fixed unchanging (stationary) environment, the aim of the
learning algorithm is to adjust the weights of the network in
order to produce an optimal response; i.e., an optimal representation
of the environment. To produce such an optimal and static representation,
we require the learning parameter, which controls the amount of
learning, to eventually approach zero. Otherwise, fluctuations
in the representation will persist, due to the stochastic nature
of the learning equation, and asymptotic convergence to optimal
representation is never achieved. For a large class of stochastic
algorithms, asymptotic convergence can be guaranteed (with high
probability) by using the learning parameter
On the other hand, consider biological neural nets.
Human beings, of course, are able to continually learn throughout
their entire lifetime. In fact, human learning is able to proceed
on two different time scales; humans learn with age (very large
time scale adaptation/learning) and are also capable of discovering
regularities and are attentive for details (short time scale learning).
This constant tendency to learn accounts for the adaptability
of natural neural systems to a changing environment. Therefore,
it is clear that the learning processes in biological neural networks
does not allow for asymptotically vanishing learning parameters.
In order for artificial neural networks to be capable
of adapting to a changing (nonstationary) environment, the learning
parameter must take a constant nonzero value. The larger the
learning parameter, the faster the response of the network to
the changing environment. On the other hand, a large learning
parameter has a negative effect on the accuracy of the network's
representation of the environment at a given time; a large gives
rise to large fluctuations around the desired optimal representation.
In practice, though, one might be willing to trade some degree
of fluctuation about the optimal representation (solution) for
adaptability to a nonstationary process. Similar ideas have been
proposed in connection with stochastic adaptive linear filtering.
Here, an adaptive algorithm with a constant step size is used
because it has the advantage of a limited memory, which enables
it to track time fluctuations in the incoming data. These ideas
date back to Wiener (1956) in connection with his work on linear
prediction theory.
4.3.1 Simple Hebbian Learning
We have already analyzed one version of the Hebbian
learning rule in the previous section. However, here we consider
the most simple form of Hebbian learning which is given by Equation
(4.2.18) with = 0; namely,
The above equation is a continuous-time version of
the unsupervised Hebbian learning rule introduced in Chapter 3.
Employing Equation (4.3.1), we get the approximate average learning
equation
In Equation (4.3.3) and in the remainder of this
chapter, the subscript x in
from which we may determine the average instantaneous
criterion function
Note that
In the following, we will characterize other versions
of the Hebbian learning rule some of which were introduced in
Chapter 3. These rules are well-behaved, and hence solve the
divergence problem encountered with simple Hebbian learning.
For simplifying mathematical notation and terminology, the following
sections will use J, J, and H to designate
4.3.2 Improved Hebbian Learning
Consider the criterion function
It is a well established property of quadratic forms
that if w is constrained to the surface of the unit hypersphere,
then Equation (4.3.6) is minimized when w = c(1)
with
which can be shown to be the average version of the
nonlinear stochastic learning rule
If we heuristically set
Hence, the equilibria of Equation (4.3.7) are the
solutions of Equation (4.3.9) given by w*
= c(i),
i = 1, 2, ..., n. Here,
J takes its smallest value of
Next, consider the Hessian of J at w*
= c(i)
(assuming, without loss of generality, = 1) and multiply it
by c(j) ,
namely, H(c(i)) c(j).
It can be shown that this quantity is given by (see Problem 4.3.1.
For a reference on matrix differential calculus, the reader is
referred to the book by Magnus and Neudecker, 1988):
This equation implies that H(w*)
has the same eigenvectors as C but with different eigenvalues.
H(w*) is
positive semi-definite only when w*
= c(1). Thus,
by following the dynamics of Equation (4.3.7), w will eventually
point in the direction of c(1)
(since none of the other directions c(i)
is stable). Although the direction of w will eventually
stabilize, it is entirely possible for ||w|| to approach
infinity, and Equation (4.3.7) will appear never to converge.
We may artificially constrain ||w|| to finite values by
normalizing w after each update of Equation (4.3.8). Alternatively,
we may set the two
4.3.3 Oja's Rule
Oja's rule was defined by Equation (3.3.6). Its
continuous-time version is given by the nonlinear stochastic equation
The corresponding average learning equation is thus
(Oja, 1982; 1989)
which has its equilibria at w satisfying
The solutions of Equation (4.3.13) are w*
= c(i),
i = 1, 2, ..., n. All
of these equilibria are unstable except for
is positive definite only at w*
= c(1) (or -c(1)).
Note that Equation (4.3.14) is derived starting from
and by noting that the eigenvalues of C satisfy
A single unit employing Oja's rule (Oja's unit)
is equivalent to a linear matched filter. To see this, assume
that for all x,
Oja's rule is interesting because it results in
a local learning rule which is biologically plausible. The locality
property is seen by considering the component weight adaptation
rule of Equation (4.3.11), namely
and by noting that the change in the ith weight
is not an "explicit" function of any other weight except
the ith weight itself. Of course,
It is also interesting to note that Oja's rule is
similar to Hebbian learning with weight decay as in Equation (4.2.18).
For Oja's rule, though, the growth in ||w|| is controlled
by a "forgetting" or weight decay term, -y2w,
which has nonlinear gain; the forgetting becomes stronger with
stronger response, thus preventing ||w|| from diverging.
Example 4.3.1: This example
shows typical simulation results comparing the evolution of the
weight vector w according to the stochastic Oja rule in
Equation (4.3.11) and its corresponding average rule in Equation
(4.3.12).
Consider a training set {x} of forty 15-dimensional
column vectors with independent random components generated by
a normal distribution N(0, 1). In the following simulations,
the training vectors autocorrelation matrix
During training, one of the forty vectors is selected
at random and is used in the learning rule to compute the next
weight vector. Discretized versions of Equations (4.3.11) and
(4.3.12) are used where
Figure 4.3.1. (a) Evolution of the cosine of the
angle between the weight vector wk
and the principal eigenvector of the autocorrelation matrix C
for the stochastic Oja rule (solid line) and for the average Oja
rule (dashed line). (b) Evolution of the magnitude of the weight
vector. The training set consists of forty 15-dimensional real-valued
vectors whose components are independently generated according
to a normal distribution N(0, 1). The presentation order
of the training vectors is random during training.
The above simulation is repeated, but with a fixed
presentation order of the training set. Results are shown in
Figures 4.3.2 (a) and (b). Note that the results for the average
learning equation (dashed line) are identical in both simulations
since they are not affected by the order of presentation of input
vectors. These simulations agree with the theoretical results
on the appropriateness of using the average learning equation
to approximate the limiting behavior of its corresponding stochastic
learning equation. Note that a monotonically decreasing learning
rate (say proportional to
Figure 4.3.2. (a) Evolution of the cosine of the
angle between the weight vector wk
and the principal eigenvector of the autocorrelation matrix C
for the stochastic Oja rule (solid line) and for the average Oja
rule (dashed line). (b) Evolution of the magnitude of the weight
vector. The training set consists of forty 15-dimensional real-valued
vectors whose components are independently generated according
to a normal distribution N(0, 1). The presentation order
of the training vectors is fixed.
4.3.4 Yuille et al. Rule
The continuous-time version of the Yuille et al.
(1989) learning rule is
and the corresponding average learning equation is
with equilibria at
From Equation (4.3.18), the gradient of J
is
which leads to the Hessian
Note that one could have also computed H directly
from the known criterion function
Now, evaluating H(wi*)wj*
we get
which implies that the wj*
are eigenvectors of H(wi*)
with eigenvalues i
- j for i j
and 2i for i
= j. Therefore, H(wi*)
is positive definite if and only if i >
j, for i j.
In this case, w*
=
4.3.5 Hassoun's Rule
In the following, the unsupervised Hebbian-type
learning rule
with ||w|| 0 is analyzed. Another
way for stabilizing Equation (4.3.2) is to start from a criterion
function that explicitly penalizes the divergence of w.
For example, if we desire
with > 0. It can be easily shown that steepest
gradient descent on the above criterion function leads to the
learning equation
Equation (4.3.26) is the average learning equation
for the stochastic rule of Equation (4.3.24). Its equilibria
are solutions of the equation
Thus, it can be seen that the solution vectors of
Equation (4.3.27), denoted by
where i
is the ith eigenvalue of C. From Equation (4.3.28)
it can be seen that the norm of the ith equilibrium state
is given by
which requires > i
for all i. Note that if >> i,
then
Next, we investigate the stability of these equilibria.
Starting from the Hessian
we have
which implies that
For the discrete-time version, it is interesting
to note that for the case >> 1,
and = 1, Equation (4.3.24) reduces to
which is very similar to the discrete-time simple
Hebbian learning rule with weight vector normalization of Equation
(3.3.5) and expressed here as
Note that the weight vector in
In practice, discrete-time versions of the stochastic
learning rules in Equations (4.3.11), (4.3.17), and (4.3.24) are
used where
In concluding this section, another interpretation
of the regularization effects on the stabilization of Hebbian-type
rules is presented. In the stochastic learning Equations (4.2.18),
(4.3.11), (4.3.17), and (4.3.24), regularization appears as weight
decay terms -w, -y2w,
-||w||2w,
and -
4.4 Principal Component Analysis (PCA)
The PCA network of Section 3.3.5, employing Sanger's
rule, is analyzed in this section. Recalling Sanger's rule [from
Equation (3.3.19)] and writing it in vector form for the continuous-time
case, we get
with i = 1, 2, ..., m, where
m is the number of units in the PCA network. We will assume,
without any loss of generality, that m = 2. This leads
to the following set of coupled learning equations for the two
units:
and
Equation (4.4.2) is Oja's rule. It is independent
of unit 2 and thus converges to
With the sequential update assumption, Equation
(4.4.3) becomes
For clarity, we will drop the subscript on w.
Now, the average learning equation for unit 2 is given by
which has equilibria satisfying
Hence,
Since
which is positive definite only at
The unit-by-unit description presented here helps
simplify the explanation of the PCA net behavior. In fact, the
weight vectors wi
approach their final values simultaneously, not one at a time.
Though, the above analysis still applies, asymptotically, to
the end points. Note that the simultaneous evolution of the wi
is advantageous since it leads to faster learning than if the
units are trained one at a time.
4.5 Theory of Reinforcement Learning
Recall the simplified stochastic reinforcement learning
rule of Equation (3.2.5). The continuous-time version of this
rule is given by
from which the average learning equation is given
as
Now, employing Equation (4.2.6), we may think of
Equation (4.5.2) as implementing a gradient search on an average
instantaneous criterion function,
In Equations (4.5.1) through (4.5.3), we have
with the expected output
as in Section 3.1.6. Next, it is shown that Equation
(4.5.3) is proportional to the gradient of the expected reinforcement
signal
First, we express the expected (average) reinforcement
signal,
and then evaluate its gradient with respect to w.
The gradient of
which follows from Equation (4.5.4). We also have
and
which can be used in Equation (4.5.7) to give
If we now take the gradient of Equation (4.5.6) and
use (4.5.10), we arrive at
which can also be written as
Finally, by averaging Equation (4.5.12) over all
inputs xk,
we get
where now the averages are across all patterns and
all outputs. Note that
which implies that the average weight vector converges
to a local maximum of
Extensions of these results to a wider class of
reinforcement algorithms can be found in Williams (1987). The
characterization of the associative reward-penalty algorithm in
Equation (3.2.1) is more difficult since it does not necessarily
maximize
4.6 Theory of Simple Competitive Learning
In this section we attempt to characterize simple
competitive learning. Two approaches are described: one deterministic
and the other statistical.
4.6.1 Deterministic Analysis
Consider a single layer of linear units, where each
unit uses the simple continuous-time competitive rule [based on
Equations (3.4.3) and (3.4.5)]
Also, consider the criterion function (Ritter and
Schulten, 1988a)
where
Here,
Now, performing gradient descent on J in Equation
(4.6.4) yields
which is the batch mode version of the learning rule
in Equation (4.6.1). It was noted by Hertz et al. (1991) that
the above batch mode competitive learning rule corresponds to
the k-means clustering algorithm when a finite training
set is used. The local rule of Equation (4.6.1) may have an advantage
over the batch mode rule in Equation (4.6.5) since stochastic
noise due to the random presentation order of the input patterns
may kick the solution out of "poor" minima towards minima
which are more optimal. However, only in the case of "sufficiently
sparse" input data points can one prove stability and convergence
theorems for the stochastic (incremental) competitive learning
rule (Grossberg, 1976). The data points are sparse enough if
there exists a set of clusters so that the minimum overlap
Criterion functions, other than the one in Equation
(4.6.4), may be employed which incorporate some interesting heuristics
into the competitive rule for enhancing convergence speed or for
altering the underlying "similarity measure" implemented
by the learning rule. For an example, we may replace
4.6.2 Stochastic Analysis
The following is an analysis of simple competitive
learning based on the stochastic approximation technique introduced
in Section 4.3. Consider the following normalized discrete-time
competitive rule (von der Malsberg, 1973; Rumelhart and Zipser,
1985):
where, again, the setting is a single layer network
of linear units. Here,
Let P(xk)
be the probability that input xk
is presented on any trial. Then, the average learning equation
may be expressed as
where
which implies that at equilibrium
Therefore, the jth component of vector wi
is given as
We now make the following observations. First, note
that the denominator of Equation (4.6.10) is the probability that
unit i wins averaged over all stimulus patterns. Note
further that
which states that at equilibrium, wij
is expected to be proportional to the conditional probability
that the jth bit of input xk
is active given unit i is a winner.
Next, upon the presentation of a new pattern
or
where
Because of the dependency of
Equation (4.6.6) leads the search to one of many
stable equilibrium states satisfying Equations (4.6.9) and (4.6.14).
In such a state, the ith unit activations
where the averaging is taken over all xk,
and i* is the index
of winning units. Note that Equation (4.6.15) can be written
as
The larger the value of J, the more stable
the system is expected to be. Maximizing J can also be
viewed as maximizing the overlap among patterns within a group
(cluster) while minimizing the overlap among patterns between
groups; this is exactly what is required for the clustering of
unlabeled data. In geometric terms, J is maximized when
the weight vectors point toward maximally compact stimulus (input)
regions that are as distant as possible from other such regions.
4.7 Theory of Feature Mapping
The characterization of topological feature preserving
maps has received special attention in the literature (Kohonen,
1982b; Cottrell and Fort, 1986; Ritter and Schulten, 1986 and
1988b; Tolat, 1990; Heskes and Kappen, 1993a; Lo et al., 1993;
Kohonen, 1993b). In particular, Takeuchi and Amari (1979) and
Amari (1980 and 1983) have extensively studied a continuous-time
dynamical version of this map to investigate the topological relation
between the self-organized map and the input space governed by
the density p(x), the resolution and stability of
the map, and convergence speed. The characterization of a general
feature map is difficult, and much of the analysis has been done
under simplifying assumptions.
In the following, a one-dimensional version of the
self-organizing feature map of Kohonen is characterized following
the approach of Ritter and Schulten (1986). A continuous dynamical
version of Kohonen's map is also described and analyzed.
4.7.1 Characterization of Kohonen's Feature Map
Consider the criterion function J(w)
defined by (Ritter and Schulten, 1988a)
where i*
is the label of the winner unit upon the presentation of stimulus
(input) xk
and
which is just the batch mode version of Kohonen's
self-organizing rule in Equation (3.5.1). Thus, Kohonen's rule
is a stochastic gradient descent search and leads, on average
and for small , to a local minimum of J in Equation (4.7.1).
These minima are given as solutions to
This equation is not easy to solve; it depends on
the choice of and the distribution p(x). Actually,
we desire the global minimum of the criterion function J.
Local minima of J are topological defects like kinks in
one-dimensional maps and twists in two-dimensional maps [Kohonen,
1989; Geszti, 1990].
The analysis of feature maps becomes more tractable
if one replaces Equation (4.7.3) with a continuous version that
assumes a continuum of units and where the distribution p(x)
appears explicitly, namely
where r* = r*(x)
is the coordinate vector of the winning unit upon the presentation
of input x.
An implicit partial differential equation for w
can be derived from Equation (4.7.4) [Ritter and Schulten, 1986].
However, for the case of two- or higher-dimensional maps, no
explicit solutions exist for w(r) given an arbitrary
p(x). On the other hand, solutions of Equation
(4.7.4) are relatively easy to find for the one-dimensional map
with scalar r and a given input distribution p(x).
Here, the equilibrium w*
satisfies (Ritter and Schulten, 1986)
which, in turn, satisfies the implicit differential
equation corresponding to Equation (4.7.4), given by (assuming
a sharply peaked symmetric ):
In Equations (4.7.5) and (4.7.6), the term p(w)
is given by p(x)|x=w.
Equation (4.7.5) shows that the density of the units in w
space is proportional to
Finally, one may obtain the equilibria w*
by solving Equation (4.7.6). The local stability of some of these
equilibria is ensured (with a probability approaching 1) if the
learning coefficient = (t) is sufficiently small,
positive, and decays according to the following necessary and
sufficient conditions (Ritter and Schulten, 1988b)
and
In particular, the decay law
4.7.2 Self-Organizing Neural Fields
Consider a continuum of units arranged as an infinite
two-dimensional array (neural field). Each point (unit) on this
array may be represented by a position vector r and has
an associated potential u(r). The output of the
unit at r is assumed to be a nonlinear function of its
potential y(r) = f [u(r)],
where f is either a monotonically non-decreasing positive
saturating activation function or a step function. Associated
with each unit r is a set of input weights w(r)
and another set of lateral weights (r, r') = (r - r').
This lateral weight distribution is assumed to be of the on-center
off-surround type as is shown in Figure 4.7.1 for the one-dimensional
case. Here, a unit at position r makes excitatory connections
with all of its neighbors located within a distance
The dynamics of the neural field potential u(r,
t) are given by
where
and h is a constant bias field. In Equation
(4.7.8), it is assumed that the potential u(r, t)
decays with time constant to the resting potential h in
the absence of any stimulation. Also, it is assumed that this
potential increases in proportion to the total stimuli s(r,
x) which is the sum of the lateral stimuli
In Equation (4.7.8), the rates of change of and
w, if any, are assumed to be much slower than that of the
neural field potential. The input signal (pattern) x is
a random time sequence, and it is assumed that a pattern x
is chosen according to probability density p(x).
Also, we assume that inputs are applied to the neural field for
a time duration which is longer than the time constant of the
neural field potential. On the other hand, the duration of stimulus
x is assumed to be much shorter than the time constant
' of the weight w. Thus, the potential distribution u(r,
t) can be considered to change in a quasi-equilibrium manner
denoted by u(r, x).
An initial excitation pattern applied to the neural
field changes according to the dynamics given in Equation (4.7.8),
and eventually converges to one of the equilibrium solutions.
Stable equilibrium solutions are the potential fields u(r)
which the neural field can retain persistently under a constant
input x. The equilibria of Equation (4.7.8) must satisfy
where s(r, u*)
is the total stimuli at equilibrium. When the lateral connections
distribution (r - r') is strongly off-surround inhibitory,
given any x, only a local excitation pattern is aroused
as a stable equilibrium which satisfies Equation (4.7.10) (Amari,
1990). Here, a local excitation is a pattern where the excitation
is concentrated on units in a small local region; i.e., u*(r)
is positive only for a small neighborhood centered at a maximally
excited unit r0.
Thus u*(r,
x) represents a mapping from the input space onto the neural
field.
Let us now look at the dynamics of the self-organizing
process. We start by assuming a particular update rule for the
input weights of the neural field. One biologically plausible
update rule is the Hebbian rule:
where
where the averaging is over all possible x.
The equilibria of Equation (4.7.12) are given by
If we now transpose Equation (4.7.12) and multiply
it by an arbitrary input vector x we arrive at an equation
for the change in input stimuli
The vector inner product xT
x' represents the similarity of two input signals x
and x' and hence the topology of the signal space (Takeuchi
and Amari, 1979). Note how Equation (4.7.14) relates the topology
of the input stimulus set {x} with that of the neural field.
On the other hand, if one assumes a learning rule
where a unit r updates its input weight vector in proportion
to the correlation of its equilibrium potential u*(r,
x) and the difference x - w(r), one
arrives at the average differential equation
This learning rule is equivalent to the averaged
continuum version of Kohonen's self-organizing feature map in
Equation (4.7.2), if one views the potential distribution u*(r,
x) as the weighting neighborhood function . Here, self-organization
will emerge if the dynamics of the potential field evolve such
that the quasi-stable equilibrium potential u*
starts positive for all r, then monotonically and slowly
shrinks in diameter for positive time. This may be accomplished
through proper control of the bias field h, as described
below.
In general, it is difficult to solve Equation (4.7.8)
and (4.7.12) [or (4.7.15)]. However, some properties of the formation
of feature maps are revealed from these equations for the special,
but revealing, case of a one-dimensional neural field (Takeuchi
and Amari, 1979; Amari, 1980 and 1983). The dynamics of the potential
field in Equation (4.7.8) for a one-dimensional neural field was
analyzed in detail by Amari (1977b) and Kishimoto and Amari (1979)
for a step activation function and a continuous monotonically
non-decreasing activation function, respectively. It was shown
that with (r - r') as shown in Figure 4.7.1 and
f(u) = step(u), there exist stable equilibrium
solutions u*(r)
for the x = 0 case. The 0-solution potential
field, u*(r)
0, and the -solution field, u*(r)
0, are among these stable solutions. The 0-solution is stable
if and only if h < 0. On the other hand,
the -solution is stable if and only if h > -2 A(),
where
and plotted in Figure 4.7.3. Amari (see also Krekelberg
and Kok, 1993) also showed that a single a-solution can
exist for the case of a non-zero input stimulus, and that the
corresponding active region of the neural field is centered at
the unit r receiving the maximum input. Furthermore, the
width, a, of this active region is a monotonically decreasing
function of the bias field h. Thus, one may exploit the
fact that the field potential/neighborhood function u*(r, x)
is controlled by the bias field h in order to control the
convergence of the self-organizing process. Here, the uniform
bias field h is started at a positive value h > -2A()
and is slowly decreased towards negative values. This, in turn,
causes u* to start
at the -solution, then gradually move through a-solutions
with decreasing width a until, ultimately, u*
becomes the 0-solution. For further analysis of the self-organizing
process in a neural field, the reader is referred to Zhang (1991).
Kohonen (1993a and 1993b) proposed a self-organizing
map model for which he gives physiological justification. The
model is similar to Amari's self-organizing neural field except
that it uses a discrete two-dimensional array of units. The model
assumes sharp self-on off-surround lateral interconnections so
that the neural activity of the map is stabilized where the unit
receiving the maximum excitation becomes active and all other
units are inactive. Kohonen's model employs unit potential dynamics
similar to those of Equation (4.7.8). On the other hand, Kohonen
uses a learning equation more complex than those in Equations
(4.7.12) and (4.7.15). This equation is given for the ith
unit weight vector by the pseudo-Hebbian learning rule
where is a positive constant. The term
Under the assumption that the index r ranges
over all components of the input signal x and regarding
where = > 0. Now, multiplying
both sides of Equation (4.7.18) by 2wT
leads to the differential equation
Thus, for arbitrary x with wTx > 0,
Equation (4.7.19) converges to . On the other hand, the solution
for the direction of w*
cannot be determined in closed form from the deterministic differential
Equation (4.7.18). However, a solution for the expected value
of w may be found if Equation (4.7.18) is treated as a
stochastic differential equation with strong mixing in accordance
to the discussion of Section 4.3. Taking the expected value of
both sides of Equation (4.7.18) and solving for its equilibrium
points (by setting ) gives
(4.7.20)
Furthermore, this equilibrium point can be shown
to be stable (Kohonen, 1989).
From the above analysis, it can be concluded that
the synaptic weight vector w is automatically normalized
to the length , independent of the input signal x, and
that w rotates such that its average direction is aligned
with the mean of x. This is the expected result of a self-organizing
map when a uniform nondecreasing neighborhood function is used.
In general, though, the neighborhood term is nonuniform and
time varying, which makes the analysis of Equation (4.7.17) much
more difficult.
4.8 Generalization
In Chapter 2, we analyzed the capabilities of some
neural network architectures for realizing arbitrary mappings.
We have found that a feedforward neural net with a single hidden
layer having an arbitrary number of sigmoidal activation units
is capable of approximating any mapping (or continuous multivariate
function) to within any desired degree of accuracy. The results
of Chapter 2 on the computational capabilities of layered neural
networks say nothing about the synthesis/learning procedure needed
to set the interconnection weights of these networks. What remains
to be seen is whether such networks are capable of finding the
necessary weight configuration for a given mapping by employing
a suitable learning algorithm.
Later chapters in this book address the question
of learning in specific neural network architectures by extending
appropriate learning rules covered in Chapter 3. In the remainder
of this chapter, we address two important issues related to learning
in neural networks: generalization and complexity. The following
discussion is general in nature, and thus it holds for a wide
range of neural network paradigms.
Generalization is measured by the ability of a trained
network to generate the correct output for a new randomly chosen
input belonging to the same probability density p(x)
governing the training set. In this section, two cases are considered:
Average generalization and worst case generalization. This section
also considers the generalization capabilities of stochastic neural
networks.
4.8.1 Generalization Capabilities of Deterministic
Networks
One important performance measure of trainable neural
networks is the size of the training set needed to bound their
generalization error below some specified number. Schwartz et
al. (1990) gave a theoretical framework for calculating the average
probability of correct generalization for a neural net trained
with a training set of size m. Here, the averaging is
over all possible networks (of fixed architecture) consistent
with the training set, and the only assumptions about the network's
architecture is that it is deterministic (employs deterministic
units) and that it is a universal architecture (or faithful model)
of the class of functions/mappings being learned. It is assumed
that these functions are of the form f : Rn
{0, 1}, but the ideas can be extended to multiple and/or continuous-valued
outputs as well.
The following analysis is based on the theoretical
framework of Schwartz et al. (1990), and it also draws on some
clarifications given by Hertz et al. (1991). The main result
of this analysis is rather surprising; one can calculate the average
probability of correct generalization for any training set of
size m if one knows a certain function that can (in theory)
be calculated before training begins. However, it should be kept
in mind that this result is only meaningful when interpreted in
an average sense, and does not necessarily represent the typical
situation encountered in a specific training scheme.
Consider a class of networks with a certain fixed
architecture specified by the number of layers, the number of
units within each layer, and the interconnectivity pattern between
layers. Let us define the quantity V0
as
(4.8.1)
which stands for the total "volume" of
the weight space, where w represents the weights of an
arbitrary network, and is some a priori weight probability
density function. Thus, each network is represented as a point
w in weight space which implements a function fw(x).
We may now partition the weight space into a set of disjoint
regions, one for each function fw
that this class of networks can implement. The volume of the
region of weight space that implements a particular function f(x)
is given by
(4.8.2)
where
Each time an example {xk, fd(xk)}
of a desired function fd
is presented and is successfully learned (supervised learning
is assumed), the weight vector w is modified so that it
enters the region of weight space that is compatible with the
presented example. If m examples are learned, then the
volume of this region is given by
(4.8.3)
where
The region Vm
represents the total volume of weight space which realizes the
desired function fd
as well as all other functions f
that agree with fd
on the desired training set. Thus, if
a new input is presented to the trained network it will be ambiguous
with respect to a number of functions represented by Vm
(recall the discussion on ambiguity for the simple case of a single
threshold gate in Section 1.5). As the number of learned examples
m is increased, the expected ambiguity decreases.
Next, the volume of weight space consistent with
both the training examples and a "particular" function
f is given by
(4.8.4)
where Equation (4.8.2) was used. Note that fw
in I has been replaced by f and that the product
term factors outside of the integral. Now, assuming independent
input vectors xk
generated randomly from distribution p(x), the factors
I(f, xk)
in Equation (4.8.4) are independent. Thus, averaging Vm(f )
over all xk
gives
(4.8.5)
The quantity g(f ) takes on values
between 0 and 1. It is referred to as the generalization ability
of f ; i.e., g(f ) may be viewed
as the probability that for an input x randomly chosen
from p(x). As an example, for a completely specified
n-input Boolean function fd,
g(f ) is given by (assuming that all 2n
inputs are equally likely)
where dH(f,
fd) is the
number of bits by which f and fd
differ.
Let us define the probability Pm(f )
that a particular function f can be implemented after training
on m examples of fd.
This probability is equal to the average fraction of the remaining
weight space that f occupies:
(4.8.6)
The approximation in Equation (4.8.6) is based on
the assumption that Vm
does not vary much with the particular training sequence; i.e.,
Vm
for each probable sequence. This assumption is expected to be
valid as long as m is small compared to the total number
of possible input combinations.
Good generalization requires that Pm(f )
be small. Let us use Equation (4.8.6) to compute the distribution
of generalization ability g(f ) across all
possible f 's after successful training with m
examples:
(4.8.7)
Note that an exact m(g)
can be derived by dividing the right-hand side of Equation (4.8.7)
by . The above result is interesting since it allows us to compute,
before learning, the distribution of generalization ability after
training with m examples. The form of Equation (4.8.7)
shows that the distribution m(g)
tends to get concentrated at higher and higher values of g
as more and more examples are learned. Thus, during learning,
although the allowed volume of weight (or function) space shrinks,
the remaining regions tend to have large generalization ability.
Another useful measure of generalization is the
average generalization ability G(m) given by
(4.8.8)
which is the ratio between the m + 1
and the mth moments of 0(g)
and can be computed if 0(g)
is given or estimated. G(m) gives the entire "learning
curve"; i.e., it gives the average expected success rate
as a function of m. Equation (4.8.8) allows us to predict
the number of examples, m, necessary to train the network
to a desired average generalization performance. We may also
define the average prediction error as 1 - G(m).
The asymptotic behavior (m ) of the average
prediction error is determined by the form of the initial distribution
0(g) near g = 1.
If a finite gap between g = 1 and the next
highest g for which 0(g)
is nonzero exists, then the prediction error decays to 1 exponentially
as . If, on the other hand, there is no such gap in 0(g),
then the prediction error decays as . These two behaviors of
the learning curve have also been verified through numerical experiments.
The nature of the gap in the distribution of generalizations
near the region of perfect generalization (g = 1)
is not completely understood. These gaps have been detected in
experiments involving the learning of binary mappings (Cohn and
Tesauro, 1991 and 1992). It is speculated that such a gap could
be due to the dynamic effects of the learning process where the
learning algorithm may, for some reason, avoid the observed near-perfect
solutions. Another possibility is that the gap is inherent in
the nature of the binary mappings themselves.
The above approach, though theoretically interesting,
is of little practical use for estimating m, since it requires
the knowledge of the distribution 0(g),
whose estimation is computationally expensive. It also gives
results that are only valid in an average sense, and does not
necessarily represent the typical situation encountered in a specific
training scheme.
Next, we summarize a result that tells us about
the generalization ability of a deterministic feedforward neural
network in the "worst" case. Here, also, the case of
learning a binary-valued output function f :Rn {0, 1}
is treated. Consider a set of m labeled training example
pairs (x, y) selected randomly from some arbitrary
probability distribution p(x, y), with x
Rn and y = f(x)
{0, 1}. Also, consider a single hidden layer feedforward neural
net with k LTG's and d weights that has been trained
on the m examples so that at least a fraction 1 - , where
, of the examples are correctly classified. Then, with a probability
approaching 1, this network will correctly classify the fraction
1 - of future random test examples drawn from p(x,
y), as long as (Baum and Haussler, 1989)
(4.8.9)
Ignoring the log term, we may write Equation (4.8.9),
to a first order approximation, as
(4.8.10)
which requires m d for good generalization.
It is interesting to note that this is the same condition for
"good" generalization (low ambiguity) for a single LTG
derived by Cover (1965) (refer to Section 1.5) and obtained empirically
by Widrow (1987). Therefore, one may note that in the limit of
large m the architecture of the network is not important
in determining the worst case generalization behavior; what matters
is the ratio of the number of degrees of freedom (weights) to
the training set size. On the other hand, none of the above theories
may hold for the case of a small training set. In this later
case, the size and architecture of the network and the learning
scheme all play a role in determining generalization quality (see
the next chapter for more details). It should also be noted that
the architecture of the net can play an important role in determining
the speed of convergence of a given class of learning methods,
as discussed later in Section 4.9.
Similar results for worst case generalization are
reported in Blumer et al. (1989). A more general learning curve
based on statistical physics and VC dimension theories (Vapnik
and Chervonenkis, 1971) which applies to a general class of networks
can be found in Haussler et al. (1992). For generalization results
with noisy target signals the reader is referred to Amari et al.
(1992).
4.8.2 Generalization in Stochastic Networks
This section deals with the asymptotic learning
behavior of a general stochastic learning dichotomy machine (classifier).
We desire a relation between the generalization error and the
training error in terms of the number of free parameters of the
machine (machine complexity) and the size of the training set.
The results in this section are based on the work of Amari and
Murata (1993). Consider a parametric family of stochastic machines
where a machine is specified by a d-dimensional parameter
vector w such that the probability of output y,
given an input x, is specified by P(y | x,
w). As an example, one may assume the machine to be a
stochastic multilayer neural network parameterized by a weight
vector w Rd,
which, for a given input x Rn,
emits a binary output with probability
(4.8.11)
and
where
(4.8.12)
and g(x, w) may be considered
as a smooth deterministic function (e.g., superposition of multivariate
sigmoid functions typically employed in layered neural nets).
Thus, in this example, the stochastic nature of the machine is
determined by its stochastic output unit.
Assume that there exists a true machine that can
be faithfully represented by one of the above family of stochastic
machines with parameter w0.
The true machine receives inputs xk,
k = 1, 2, ..., m, which are randomly generated according
to a fixed but unknown probability distribution p(x),
and emits yk.
The maximum likelihood estimator (refer to Section 3.1.5 for
definition) characterized by the machine will be our first candidate
machine. This machine predicts y for a given x
with probability P(y | x, ). An entropic
loss function is used to evaluate the generalization of a trained
machine for a new example (xm+1, ym+1).
Let gen
be the average predictive entropy (also known as the average entropic
loss) of a trained machine parameterized by for a new example
(xm+1, ym+1):
(4.8.13)
Similarly, we define train
as the average entropic loss over the training examples used to
obtain
(4.8.14)
Finally, let H0
be the average entropic error of the true machine
(4.8.15)
Amari and Murata proved the following theorem for
training and generalization error.
Theorem 4.8.1 (Amari and Murata, 1993):
The asymptotic learning curve for the entropic training error
is given by
(4.8.16)
and for the entropic generalization error by
(4.8.17)
The proof of Theorem 4.8.1 uses standard techniques
of asymptotic statistics and is omitted here. (The reader is
referred to the original paper by Amari and Murata (1993) for
such proof).
In general, H0
is unknown and it can be eliminated from Equation (4.8.17) by
substituting its value from Equation (4.8.16). This gives
(4.8.18)
which shows that for a faithful stochastic machine
and in the limit of m d, the generalization error
approaches that of the trained machine on m examples, which
from Equation (4.8.16) is the classification error H0
of the true machine.
Again, the particular network architecture is of
no importance here as long as it allows for a faithful realization
of the true machine and m >> d.
It is interesting to note that this result is similar to the
worst case learning curve for deterministic machines [Equation
(4.8.10)], when the training error is zero. The result is also
in agreement with Cover's result on classifier ambiguity in Equation
(1.5.3), where the term in Equation (4.8.18) may be viewed as
the probability of ambiguous response on the m + 1
input. In fact, Amari (1993) also proved that the average predictive
entropy gen(m)
for a general deterministic dichotomy machine (e.g., feedforward
neural net classifier) converges to 0 as in the limit m .
4.9 Complexity of Learning
This section deals with the computational complexity
of learning: How much computation is required to learn (exactly
or approximately to some "acceptable" degree) an arbitrary
mapping in a multilayer neural network? In other words, is there
an algorithm that is computationally "efficient" for
training layered neural networks? Here, we will assume that the
desired learning algorithm is a supervised one, which implies
that the training set is labeled. Also, we will assume that the
neural network has an arbitrary architecture but with no feedback
connections.
Learning in artificial neural networks is hard.
More precisely, loading of an arbitrary mapping onto a "faithful"
neural network architecture requires exponential time irrespective
of the learning algorithm used (batch or adaptive). Judd (1987,
1990) showed that the learning problem in neural networks is NP-complete,
even for approximate learning; i.e., in the worst case, we will
not be able to do much better than just randomly exhausting all
combinations of weight settings to see if one happens to work.
Therefore, as the problem size increases (that is, as the input
pattern dimension or the number of input patterns increases) the
training time scales up exponentially in the size of the problem.
Moreover, it has been shown (Blum and Rivest, 1989) that training
a simple n-input, three-unit two layer net of LTG's can
be NP-complete in the worst-case when learning a given set of
examples. Consider the class of functions defined on a collection
of m arbitrary points in Rn.
It has been shown that the problem of whether there exist two
hyperplanes that separate them is NP-complete (Megiddo, 1986);
i.e., the training of a net with two-hidden n-input LTG's
and a single output LTG on examples of such functions is exponential
in time, in the worst case, even if a solution exists. Blum and
Rivest (1992) extend this result to the case of Boolean functions.
They also showed that learning Boolean functions with a two-layer
feedforward network of k-hidden units (k bounded
by some polynomial in n) and one output unit (which computes
the AND function) is NP-complete.
However, these theoretical results do not rule out
the possibility of finding a polynomial-time algorithm for the
training of certain classes of problems onto certain carefully
selected architectures. Blum and Rivest (1992) gave an example
of two networks trained on the same task, such that training the
first is NP-complete but the second can be trained in polynomial
time. Also, the class of linearly separable mappings can be trained
in polynomial time if single layer LTG nets are employed (only
a single unit is needed if the mapping has a single output).
This is easy to prove since one can use linear programming (Karmarkar,
1984) to compute the weights and thresholds of such nets in polynomial
time. One can also use this fact and construct layered networks
that have polynomial learning time complexity for certain classes
of non-linearly separable mappings. This is illustrated next.
Consider a set F of non-linearly separable
functions or which has the following two properties: (1) There
exists at least one layered neural net architecture for which
loading m training pairs {x, yd}
of f(x) is NP-complete; (2) there exists a fixed
dimensionality expansion process D that maps points x in
Rn to points z
in Rd, such that
d is bounded by some polynomial in n [e.g., ], and
that the m training examples {z, yd}
representing f (x) in the expanded space Rd
are linearly separable. This set F is not empty; Blum
and Rivest (1992) gave examples of functions in F. Figure
4.9.1 depicts a layered architecture which can realize any function
in F. Here, a fixed preprocessing layer, labeled D in
Figure 4.9.1, implements the above dimensionality expansion process.
The output node is a d-input LTG. It can be easily shown
that the learning complexity of this network for functions in
F is polynomial. This can be seen by noting that the training
of the trainable part of this network (the output LTG) has polynomial
complexity for m linearly separable examples in Rd
and that as n increases, d remains polynomial in
n.
Figure 4.9.1. A layered architecture consisting of
a fixed preprocessing layer D followed by an adaptive LTG.
The efficiency of learning linearly separable classification
tasks in a single threshold gate should not be surprising. We
may recall from Chapter One that the average amount of necessary
and sufficient information for the characterization of the set
of separating surfaces for a random, separable dichotomy of m
points grows slowly with m and asymptotically approaches
2d (twice the number of degrees of freedom of the class
of separating surfaces). This implies that for a random set of
linear inequalities in d unknowns, the expected number
of extreme inequalities which are necessary and sufficient to
cover the whole set, tends to 2d as the number of consistent
inequalities tends to infinity, thus bounding the (expected) necessary
number of training examples for learning algorithms in separable
problems. Moreover, this limit of 2d consistent inequalities
is within the learning capacity of a single d-input LTG.
Another intuitive reason that the network in Figure
4.9.1 is easier to train than a fully adaptive two layer feedforward
net is that we are giving it predefined nonlinearities. The former
net does not have to start from scratch, but instead is given
more powerful building blocks to work with. However, there is
a trade off. By using the network in Figure 4.9.1, we gain in
a worst-case computational sense, but lose in that the number
of weights increases from n to O(n2)
or higher. This increase in the number of weights implies that
the number of training examples must increase so that the network
can meaningfully generalize on new examples (recall the results
of the previous section).
The problem of NP-complete learning in multilayer
neural networks may be attributed to the use of fixed network
resources (Baum, 1989). Learning an arbitrary mapping can be
achieved in polynomial time for a network that allocates new computational
units as more patterns are learned. Mukhopadhyay et al. (1993)
gave a polynomial-time training algorithm for the general class
of classification problems (defined by mappings of the form )
based on clustering and linear programming models. This algorithm
simultaneously designs and trains an appropriate network for a
given classification task. The basic idea of this method is to
cover class regions with a minimal number of dynamically allocated
hyperquadratic volumes (e.g., hyperspheres) of varying size.
The resulting network has a layered structure consisting of a
simple fixed preprocessing layer, a hidden layer of LTG's, and
an output layer of logical OR gates. This and other efficiently
trainable nets are considered in detail in Section 6.3.
(4.1.1)
is any appropriate norm. For the case
where
is the Euclidean norm, it is well
known that an appropriate criterion function J is
(4.1.2)
according to
(4.1.3)
is the quantity to be
minimized subject to the regularization constraint that
is kept "small" with the Lagrange multiplier determining
the degree of regularization. These ideas can also be extended
to unsupervised learning where the
term
may be thought of as a constraint satisfaction term; such a term
may help condition the criterion
so that
the search process is stabilized. Examples of this regularization
strategy have already been encountered in the unsupervised Hebbian-type
learning rules presented in Chapter 3 [e.g., refer to Equation
(3.3.8)].
.
Statistical analysis of the continuous-time version of the general
learning equation is then performed for selected learning rules,
including correlation, LMS, and Hebbian learning rules.
(4.2.1)
(4.2.2)
.
Here, r(w, x, z) is referred
to as a "learning signal." One can easily verify that
the above two equations lead to discrete-time and continuous-time
versions, respectively, of the perceptron learning rule in Equation
(3.1.2) if = 0, y = sgn(wTx),
and r(w, x, z) = z -
y (here z is taken as bipolar binary). The -LMS (or
Widrow-Hoff) rule of Equation (3.1.35) can be obtained by setting
= 0 and r(w, x, z) = z - wTx
in Equation (4.2.1). Similarly, substituting
,
= 1, and r(w, x, z)
= z in Equation (4.2.1) lead to the simple correlation rule
in Equation (3.1.50), and = 0 and r(w, x, z)
= y leads to the Hebbian rule in Equation (3.3.1). In the
remainder of this section, Equation (4.2.2) is adopted and is
referred to as the "general learning equation." Note
that in Equation (4.2.2) the state w* = 0
is an asymptotically stable equilibrium point if either r(w, x, z)
and/or x are identically zero. Thus, the term -w
in Equation (4.2.2) plays the role of a "forgetting term"
which tends to "erase" those weights not receiving sufficient
reinforcement during learning.
(4.2.3)
(4.2.4)
. This type of
criterion function (which has the classic form of a potential
function) is appropriate for learning rules such as perceptron,
LMS, Hebbian, and correlation rules. In the most general case,
however, r(w, x, z) r(wTx,
z), and a suitable criterion function J satisfying
Equation (4.2.3) may not readily be determined (or may not even
exist). It is interesting to note that finding the equilibrium
points w* of the
general learning equation does not require the knowledge of J
explicitly if the gradient of J is known.
subject to regularization which penalizes
solution vectors w*
with large norm. It is interesting to note that by maximizing
, one is actually maximizing the amount
of information learned from a given example pair {x, z}.
In other words, the general learning equation is designed so
that it extracts the maximum amount of "knowledge" present
in the learning signal r(w, x, z).
becomes proportional
to the average gradient of the instantaneous criterion function
J. Formally, we write
(4.2.5)
implies averaging
over all possible inputs x with respect to the probability
distribution p(x). We will refer to this equation
as the "average learning equation." Equation (4.2.5)
may be viewed as a steepest gradient descent search for w*
which (locally) minimizes the expected criterion function,
,
because the linear nature of the averaging operation allows us
to express Equation (4.2.5) as
(4.2.6)
=
(4.2.7)
, where
max is the largest eigenvalue
of the Hessian matrix H = J, evaluated
at the current point in the search space (the proof of this statement
is outlined in Problem 4.3.8). These discrete-time "learning
rules" and their associated average learning equations have
been extensively studied in more general context than that of
neural networks. The book by Tsypkin (1971) gives an excellent
treatment of these iterative learning rules and their stability.
(4.2.8)
(4.2.9)
= 0,
one arrives at the (only) equilibrium point
(4.2.10)
, as
(4.2.11)
are strictly negative. This makes the system
locally asymptotically stable at the equilibrium solution w*
by virtue of Liapunov's first method (see Gill et al., 1981; Dickinson,
1991). Thus, w*
is a stable equilibrium of Equation (4.2.9). In fact, the positive
definite Hessian implies that w*
is a minimum of
, and therefore the gradient
system
converges globally and asymptotically
to w*, its only
minimum from any initial state. Thus, the trajectory w(t)
of the stochastic system in Equation (4.2.8) is expected to approach
and then fluctuate about the state
.
(4.2.12)
.
Here, the regularization term is needed in order to keep the
solution bounded.
(4.2.13)
(4.2.14)

(4.2.15)
, defined in Equation (3.3.4),
and
. If we have
,
then
is the equilibrium state. Note
that w* approaches
the minimum SSE solution in the limit of a large training set,
and that this analysis is identical to the analysis of the -LMS
rule in Chapter 3. Let us now check the stability of w*.
The Hessian matrix is
(4.2.16)
0.
Therefore, w* = C-1P
is the only (asymptotically) stable solution for Equation (4.2.14),
and the stochastic dynamics in Equation (4.2.13) are expected
to approach this solution.

(4.2.17)
(4.2.18)
(4.2.19)
in Equation (4.2.19)
leads to the equilibria
(4.2.20)
as an eigenvalue then w*
will be the eigenvector of C corresponding to
.
In general, though,
will not be an eigenvalue
of C, so Equation (4.2.19) will have only one equilibrium
at w* = 0.
This equilibrium solution is asymptotically stable if
is greater than the largest eigenvalue of C since this
makes the Hessian
(4.2.21)
(4.2.22)
is not adequate here to stabilize the Hebbian rule at a solution
which maximizes y2.
However, other more appropriate regularization terms can insure
stability, as we will see in the next section.
is averaged with respect to
x (denoted by
), with w
replaced by its mean
. This assumption
leads to the "approximate" average learning equation
(4.3.1)
is normally assumed). Roughly, the higher the mixing rate the
better the approximation in Equation (4.3.1) (Kushner and Clark,
1978). We shall frequently employ Equation (4.3.1) in the remainder
of this chapter.
of the associated
average equation. It has been shown (Geman, 1979) that under
strong mixing conditions (and some additional assumptions),
.
This result implies that if sufficiently small learning rates
are used, the behavior of a stochastic learning equation may be
well approximated, in a mean-square sense, by the deterministic
dynamics of its corresponding average equation. Oja (1983) pointed
out that the convergence of constrained gradient descent- (or
ascent-) based stochastic learning equations (the type of equations
considered in this chapter) can be studied with averaging techniques;
i.e., the asymptotically stable equilibria of the average equation
are the possible limits of the stochastic equation. Several examples
of applying the averaging technique to the characterization of
learning rules can be found in Kohonen (1989).
(Ljung, 1977; Kushner and Clark, 1978).
(4.3.2)
(4.3.3)
is
dropped in order to simplify notation. Now, the average gradient
of J in Equation (4.3.3) may be written as
(4.3.4)
(4.3.5)
is minimized by
maximizing the unit's output variance. Again, C =
is the autocorrelation matrix, which is positive semidefinite
having orthonormal eigenvectors c(i)
with corresponding eigenvalues
. That
is, Cc(i) = ic(i)
for i = 1, 2, ..., n.
The dynamics of Equation (4.3.3) are unstable. To see this, we
first find the equilibrium points by setting
= 0, giving C
= 0 or w* =
0. w* is
unstable because the Hessian (in an average sense)
=
= -C is non-positive for all
. Therefore, Equation (4.3.3) is unstable
and results in
. Note, however, that
the direction of
is not random;
it will tend to point in the direction of c(1),
since if one assumes a fixed weight vector magnitude,
is minimized when
is parallel to the
eigenvector with the largest corresponding eigenvalue.
,
, and
,
respectively. We will refer to
as simply
the criterion function, to
as the gradient
of J, and to
as the Hessian of
J. Also, the quantity w in the following average
equations should be interpreted as the state of the average learning
equation.
(4.3.6)
(e.g. see Johnson and Wichern, 1988).
Also, for any real symmetric n n matrix A,
the Rayleigh quotient
satisfies
where 1 and n
are the largest and smallest eigenvalues of A, respectively.
Let us start from the above criterion and derive an average learning
equation. Employing Equation (4.3.1), we get
(4.3.7)
(4.3.8)
for the two
terms in the above equation,
Equation (4.3.8) reduces to the continuous version of Oja's rule
[refer to Equation (3.3.6)]. Let us continue with the characterization
of (4.3.8) and defer the analysis of Oja's rule to Section 4.3.3.
At equilibrium, Equation (4.3.7) gives
(4.3.9)
at w* = c(1).
This can be easily verified by direct substitution in Equation
(4.3.6).
(4.3.10)
terms in Equation
(4.3.8) equal to 1. This latter case is considered next.
(4.3.11)
(4.3.12)
(4.3.13)
.
This can be seen by noting that the Hessian
(4.3.14)
,
with
given in Equation (4.3.12). Although
J is not known, the positive definiteness of H can
be seen from
(4.3.15)
(we assume 1 2).
Therefore, Oja's rule is equivalent to a stable version of the
Hebbian rule given in Equation (4.3.8). A formal derivation of
Oja's rule is explored in Problem 4.3.7.
, where
is a fixed vector (without loss of generality let
)
and v is a vector of symmetrically distributed zero-mean
noise with uncorrelated components having variance 2.
Then
. The largest eigenvalue of C
is 1 + 2 and
the corresponding eigenvector is
. Oja's
unit then becomes a matched filter for the data, since
in Equation (4.3.12). Here, the unit responds maximally to the
data mean. Further characterization of Oja's rule can be found
in Xu (1993).
(4.3.16)
does depend on w via y = wTx.
However, this dependence does not violate the concept of locality.
has the following set of eigenvalues: {2.561, 2.254, 2.081, 1.786,
1.358, 1.252, 1.121, 0.963, 0.745, 0.633, 0.500, 0.460, 0.357,
0.288, 0.238}.
is replaced by
wk+1 - wk.
A learning rate = 0.005 is used. This is equivalent
to integrating these equations with Euler's method (e.g., see
Gerald, 1978) using a time step t = 0.005 and
= 1. The initial weight vector is set equal to one
of the training vectors. Figures 4.3.1 (a) and (b) show the evolution
of the cosine of the angle between wk
and c(1) and the
evolution of the norm of wk,
respectively. The solid line corresponds to the stochastic rule
and the dashed line corresponds to the average rule.
or
with k 1) can be used to force the convergence
of the direction of wk
in the first simulation. It is also interesting to note that
better approximations are possible when the training vectors are
presented in a fixed deterministic order (or in a random order
but with each vector guaranteed to be selected once every training
cycle of m = 40 presentations). Here, a sufficiently
small, constant learning rate is sufficient for making the average
dynamics approximate, in a practical sense, the stochastic dynamics
for all time.
(4.3.17)
(4.3.18)
(4.3.19)
(4.3.20)
(4.3.21)
(4.3.22)
(4.3.23)
are the only stable equilibria, and
the dynamics of the stochastic equation are expected to approach
w*.
(4.3.24)
to be satisfied,
we may utilize J given by
(4.3.25)
(4.3.26)
(4.3.27)
, must be
parallel to one of the eigenvectors of C, say c(i),
(4.3.28)
(4.3.29)
approaches one for all equilibrium
points. Thus, the equilibria of Equation (4.3.26) approach unity
norm eigenvectors of the correlation matrix C when is
large.
(4.3.30)
(4.3.31)
is positive
definite if and only if
is parallel to
c(1) and > 1.
Therefore, the only stable equilibria of Equation (4.3.26) are
which approach c(1)
for >> 1.
Like the Yuille et al. rule, this rule preserves the information
about the size of 1 [1
can be computed from Equation (4.3.29)].
(4.3.32)
(4.3.33)
in Equation (4.3.33) need not be normalized to prevent divergence.
is replaced by wk+1 - wk
and w(t) by wk.
Here, the stability of these discrete-time stochastic dynamical
systems critically depends on the value of the learning rate .
Although the stability analysis is difficult, one can resort
to the discrete-time versions of the average learning equations
corresponding to these rules and derive conditions on for asymptotic
convergence (in the mean) in the neighborhood of equilibrium states
w*. Such analysis
is explored in Problems 4.3.8 and 4.3.9.
w, respectively. Therefore,
one may think of weight decay as a way of stabilizing unstable
learning rules. However, one should carefully design the gain
coefficient in the weight decay term for proper performance.
For example, it has been shown earlier that a simple positive
constant gain in Equation (4.2.18) does not stabilize Hebb's
rule. On the other hand, the nonlinear dynamic gains y2,
||w||2, and
lead to stability. Note that the weight decay gain in Oja's rule
utilizes more information, in the form of y2,
than the Yuille et al. rule or Hassoun's rule. The regularization
in these latter rules is only a function of the current weight
vector magnitude.
(4.4.1)
(4.4.2)
(4.4.3)
, the
principal eigenvector of the autocorrelation matrix of the input
data (assuming zero mean input vectors). Equation (4.4.3) is
Oja's rule with the added inhibitory term
.
Next, we will assume a sequential operation of the two-unit net
where unit 1 is allowed to fully converge before evolving unit
2. This mode of operation is permissible since unit 1 is independent
of unit 2.
(4.4.4)
(4.4.5)
(4.4.6)
and
with i = 2, 3, ..., n are solutions. Note that
the point
is not an equilibrium. The
Hessian is given by
(4.4.7)
is not positive definite,
the equilibrium w* = 0
is not stable. For the remaining equilibria we have the Hessian
matrix
(4.4.8)
,
assuming 2 3.
Thus, Equation (4.4.5) converges asymptotically to the unique
stable vector
which is the eigenvector
of C with the second largest eigenvalue 2.
Similarly, for a network with m interacting units according
to Equation (4.4.1), the ith unit (i = 1, 2, ..., m)
will extract the ith eigenvector of C.
(4.5.1)
(4.5.2)
, whose
gradient is given by
(4.5.3)
and the output y is generated by a stochastic unit according
to the probability function P(y| w,
x), given by
(4.5.4)
(4.5.5)
(Williams, 1987; Hertz et al.,
1991).
, for the kth input vector
with respect to all possible outputs y as
(4.5.6)
is given by
(4.5.7)
(4.5.8)
(4.5.9)
(4.5.10)
(4.5.11)
(4.5.12)
(4.5.13)
is proportional
to
in Equation (4.5.3) and has an opposite
sign. Thus, Equation (4.5.2) can be written in terms of
as
(4.5.14)
; i.e., Equation
(4.5.1) converges, on average, to a solution that locally maximizes
the average reinforcement signal.
. However, the above analysis
should give some insight into the behavior of simple reinforcement
learning.
(4.6.1)
(4.6.2)
is the weight vector
of the winner unit upon the presentation of the input vector xk.
Here, all vectors xk
are assumed to be equally probable. In general, a probability
of occurrence of xk,
P(xk),
should be inserted inside the summation in Equation (4.6.2).
An alternative way of expressing J in Equation (4.6.2)
is through the use of a "cluster membership matrix"
M defined for each unit i = 1, 2, ..., n
by
(4.6.3)
is a dynamically evolving
function of k and i which specifies whether or not
unit i is the winning unit upon the presentation of input
xk. The
cluster membership matrix allows us to express the criterion function
J as
(4.6.4)
(4.6.5)
within a cluster exceeds the maximum overlap between that cluster
and any other cluster. In practice, a damped learning rate k
is used (e.g.,
, where
and 0 is a positive constant)
in order to stop weight evolution at one of the local solutions.
Here, the relatively large initial learning rate allows for wide
exploration during the initial phase of learning.
by
in Equation (4.6.4). This causes
the winning weight vector to be repelled by input vectors in other
clusters, while being attracted by its own cluster, which enhances
convergence. Another example is to employ a different similarity
measure (norm) in J such as the Minkowski-r norm
of Equation (3.1.68), which has the ability to reduce the effects
of outlier data points by proper choice of the exponent r.
Other criterion functions may also be employed; the reader is
referred to Bachmann et al. (1987) for yet another suitable criterion
function.
(4.6.6)
and typically,
. Also, the weight normalization
is assumed for all units. It can be easily verified that Equation
(4.6.6) preserves this weight normalization at any iteration (this
was explored in Problem 3.4.1).
(4.6.7)
is the conditional
probability that unit i wins when input xk
is presented. Now, using Equation (4.6.6) in Equation (4.6.7)
we get
(4.6.8)
(4.6.9)
(4.6.10)
is the probability that
(active) and unit i is a winner.
Thus, assuming that all patterns have the same number of active
bits (i.e.,
for all k), we may
employ Bayes' rule
and write Equation
(4.6.10) as
(4.6.11)
[assuming the equilibrium weight values given by Equation (4.6.9)],
unit i will have a weighted sum (activity) of
(4.6.12)
(4.6.13)
represents the overlap
between stimulus
and the kth training
pattern xk.
Thus, at equilibrium, a unit responds most strongly to patterns
that overlap other patterns to which the unit responds and most
weakly to patterns that are far from patterns to which it responds.
Note that we may express the conditional probability
according to the winner-take-all mechanism
(4.6.14)
on wi, there
are many solutions which satisfy the equilibrium relation given
in Equation (4.6.9).
become stable (fluctuate minimally), and therefore,
becomes stable. A sequence of stimuli might, however, be presented
in such a way as to introduce relatively large fluctuations in
the wi's.
In this case, the system might move to a new equilibrium state
which is, generally, more stable in the sense that
becomes unlikely to change values for a very long period of time.
Rumelhart and Zipser (1985) gave a measure of the stability of
an equilibrium state as the average amount by which the output
of the winning units is greater than the response of all of the
other units averaged over all patterns and all clusters. This
stability measure is given by
(4.6.15)
(4.6.16)
(4.7.1)
is the neighborhood function that
was introduced in Section 3.5. It can be seen that Equation (4.7.1)
is an extension of the competitive learning criterion function
of Equation (4.6.2). Performing gradient descent on (4.7.1) yields
(4.7.2)
(4.7.3)
(4.7.4)
(4.7.5)
(4.7.6)
around point
r. This verifies the density preserving feature of the
map. Ideally, however, we would have
for zero distortion. Therefore, a self-organizing feature map
tends to under sample high probability regions and over sample
low probability ones.
(4.7.7a)
(4.7.7b)
with 0 < 1 ensures convergence. For laws with > 1
or exponential decay laws, Equation (4.7.7a) is not fulfilled,
and some residual error remains even in the limit t .
It can also be shown that during convergence, the map first becomes
untangled and fairly even, and then moves into a refinement phase
where it adapts to the details of p(x). Occasionally,
the "untangling" phase can slow convergence, because
some types of tangle (e.g., kinks and twists) can take a long
time to untangle. Geszti (1990) suggested the use of a strongly
asymmetric neighborhood function in order to speed up learning
by breaking the symmetry effects responsible for slow untangling
of kinks and twists.
from r, and makes inhibitory connections with all other
units.

(4.7.8)
(4.7.9)
and the input stimuli wTx
due to the input signal x Rn.
A conceptual diagram for the above neural field is shown in Figure
4.7.2.

or
(4.7.10)
(4.7.11)
is the neural field's
equilibrium output activity due to input x. In Equation
(4.7.11), we use the earlier assumption ' . Next, we assume
strong mixing in Equation (4.7.11) which allows us to write an
expression for the average learning equation (absorbing ' in
and in ) as
(4.7.12)
(4.7.13)
(4.7.14)
(4.7.15)
with A(a) as defined
below. Local excitations (also known as a-solutions) where
u*(r) is
positive only over a finite interval [a1,
a2] of the neural
field are also possible. An a-solution exists if and only
if h + A(a) = 0. Here,
A(a) is the definite integral defined by
(4.7.16)

(4.7.17)
in Equation (4.7.17) models a natural "transient" neighborhood
function. It represents a weighted-sum of the output activities
yl of nearby
units, which describes the strength of the diffuse chemical effect
of cell l on cell i; hil
is a function of the distance of these units. This weighted-sum
of output activities replaces the output activity of the same
cell, yi,
in Hebb's learning rule. On the other hand, the
term acts as a stabilizing term which models "forgetting"
effects, or disturbance due to adjacent synapses. Typically,
forgetting effects in wi
are proportional to the weight wi
itself. In addition, if the disturbance caused by synaptic site
r is mediated through the post synaptic potential, the
forgetting effect must further be proportional to wirxr.
This phenomenon is modeled by
in the
forgetting term. Here, the summation is taken over a subset of
the synapses of unit i that are located near the jth
synapse wij,
and approximately act as one collectively interacting set. The
major difference between the learning rules in Equation (4.7.17)
and (4.7.12) is that, in the former, the "neighborhood"
function is determined by a "transient" activity due
to a diffusive chemical effect of nearby cell potentials, whereas
it is determined by a stable region of neural field potential
in the latter.
as a positive scalar independent of w
and x, the vector form of Equation (4.7.18) takes the Riccati
differential equation form
(4.7.18)
(4.7.19)