4. MATHEMATICAL THEORY OF

NEURAL LEARNING

This chapter deals with theoretical aspects of learning in artificial neural networks. It investigates mathematically, the nature and stability of the asymptotic solutions obtained using the basic supervised, Hebbian and reinforcement learning rules, which were introduced in the previous chapter. Formal analysis is also given for simple competitive learning and self-organizing feature map learning.

A unifying framework for the characterization of various learning rules is presented. This framework is based on the notion that learning in general neural networks can be viewed as search, in a multidimensional space, for a solution which optimizes a prespecified criterion function, with or without constraints. Under this framework, a continuous-time learning rule is viewed as a first-order, stochastic differential equation/dynamical system, whereby the state of the system evolves so as to minimize an associated instantaneous criterion function. Approximation techniques are employed to determine, in an average sense, the nature of the asymptotic solutions of the stochastic system. This approximation leads to an "average learning equation" which, in most cases, can be cast as a globally, asymptotically stable gradient system whose stable equilibria are minimizers of a well-defined average criterion function. Finally, and subject to certain assumptions, these stable equilibria can be taken as the possible limits (attractor states) of the stochastic learning equation.

The chapter also treats two important issues associated with learning in a general feedforward neural network. These are learning generalization and learning complexity. The section on generalization presents a theoretical method for calculating the asymptotic probability of correct generalization of a neural network as a function of the training set size and the number of free parameters in the network. Here, generalization in deterministic and stochastic nets are investigated. The chapter concludes by reviewing some significant results on the complexity of learning in neural networks.

Learning in an artificial neural network, whether it is supervised, reinforcement, or unsupervised, can be viewed as a process of searching a multi-dimensional parameter space for a state which optimizes a predefined criterion function J (J is also commonly referred to as an error function, a cost function, or an objective function). In fact, all of the learning rules considered in Chapter 3, except Oja's rule (refer to Table 3.1), have well-defined analytical criterion functions. These learning rules implement local search mechanisms (i.e., gradient search) to obtain weight vector solutions which (locally) optimize the associated criterion function. Therefore, it is the criterion function that determines the nature of a learning rule. For example, supervised learning rules are designed so as to minimize an error measure between the network's output and the corresponding desired output, unsupervised Hebbian learning is designed to maximize the variance of the output of a given unit, and reinforcement learning is designed to maximize the average reinforcement signal.

Supervised learning can be related to classical approximation theory (Poggio and Girosi, 1990a). Here, the idea is to approximate or interpolate a continuous multivariate function g(x), from samples {x, g(x)}, by an approximation function (or class of functions) G(w, x), where w  Rd is a parameter vector with d degrees of freedom, and x belongs to a compact set   Rn. In this case, the set of samples {xg(x)}, x  , is referred to as a training set. The approximation problem is to find an optimal parameter vector that provides the "best" approximation of g on the set for a given class of functions G. Formally stated, we desire a solution w*  Rd such that

(4.1.1)

where the tolerance is a positive real number and is any appropriate norm. For the case where is the Euclidean norm, it is well known that an appropriate criterion function J is

(4.1.2)

whose global minimum represents the minimum sum of square error (SSE) solution. The choice of approximation function G, the criterion function J, and the search mechanism for w* all play critical roles in determining the quality and properties of the resulting solution/approximation.

Usually we know our objective and this knowledge is translated into an appropriate criterion function J. In terms of search mechanisms, gradient-based search is appropriate (and is simple to implement) for cases where it is known that J is differentiable and bounded. Of course, if gradient search is used, we must be willing to accept locally-optimal solutions. In general, only global search mechanisms, which are computationally very expensive, may lead to global optimal solutions (global search-based learning is the subject of Chapter 8). Among the above three factors, the most critical in terms of affecting the quality of the approximation in Equation (4.1.1) is the choice of approximation function G.

In classical analysis, polynomials and rational functions are typically used for function approximation. On the other hand, for artificial neural networks, the approximation functions are usually chosen from the class of smooth sigmoidal-type functions, and the approximation is constructed as a superposition of such sigmoidal functions. In general, there are two important issues in selecting an appropriate class of basis functions: universality and generalization. By universality, we mean the ability of G to represent, to any desired degree of accuracy, the class of functions g being approximated; in Chapter 2, we have established the universality of feedforward layered neural nets. On the other hand, generalization means the ability of G to correctly map new points x, drawn from the same underlying input distribution p(x), which were not seen during the learning phase. Thus, an interesting question here is how well does a neural network compare to other universal approximation functions in terms of generalization (some insight into answering this question is given in Section 5.2.5). Later in this chapter, we will see that for feedforward neural networks, the number of degrees of freedom (weights) in G plays a critical role in determining the degree of data overfitting, which directly affects generalization quality.

Another way to control generalization is through criterion function conditioning. A "regularization" term (Poggio and Girosi, 1990a) may be added to an initial criterion function according to

(4.1.3)

The ||P(w)||2 term in Equation (4.1.3) is used to imbed a priori information about the function g, such as smoothness, invariance, etc. In this case, is the quantity to be minimized subject to the regularization constraint that is kept "small" with the Lagrange multiplier determining the degree of regularization. These ideas can also be extended to unsupervised learning where the term may be thought of as a constraint satisfaction term; such a term may help condition the criterion so that the search process is stabilized. Examples of this regularization strategy have already been encountered in the unsupervised Hebbian-type learning rules presented in Chapter 3 [e.g., refer to Equation (3.3.8)].

In this section, instead of dealing separately with the various learning rules proposed in the previous chapter, we seek to study a single learning rule, called the general learning equation (Amari, 1977a and 1990), which captures the salient features of several of the different single-unit learning rules. Two forms of the general learning equation will be presented: a discrete-time version, in which the weight vector evolves according to a discrete dynamical system of the form w(k+1) = g(w(k)), and a continuous-time version, in which the weight vector evolves according to a smooth dynamical system of the form . Statistical analysis of the continuous-time version of the general learning equation is then performed for selected learning rules, including correlation, LMS, and Hebbian learning rules.

Consider a single unit which is characterized by a weight vector w Rn, an input vector x Rn, and, in some cases, a scalar teacher signal z. In a supervised learning setting, the teacher signal is taken as the desired target associated with a particular input vector. The input vector (signal) is assumed to be generated by an environment or an information source according to the probability density p(x, z), or p(x) if z is missing (as in unsupervised learning). Now, consider the following discrete-time dynamical process which governs the evolution of the unit's weight vector w

(4.2.1)

and the continuous-time version

(4.2.2)

where and are positive real, and . Here, r(wxz) is referred to as a "learning signal." One can easily verify that the above two equations lead to discrete-time and continuous-time versions, respectively, of the perceptron learning rule in Equation (3.1.2) if = 0, y = sgn(wTx), and r(wxz) = z - y (here z is taken as bipolar binary). The -LMS (or Widrow-Hoff) rule of Equation (3.1.35) can be obtained by setting  = 0 and r(wxz) = wTx in Equation (4.2.1). Similarly, substituting ,  = 1, and r(wxz) = z in Equation (4.2.1) lead to the simple correlation rule in Equation (3.1.50), and  = 0 and r(wxz) = y leads to the Hebbian rule in Equation (3.3.1). In the remainder of this section, Equation (4.2.2) is adopted and is referred to as the "general learning equation." Note that in Equation (4.2.2) the state w* = 0 is an asymptotically stable equilibrium point if either r(wxz) and/or x are identically zero. Thus, the term -w in Equation (4.2.2) plays the role of a "forgetting term" which tends to "erase" those weights not receiving sufficient reinforcement during learning.

From the point of view of analysis, it is useful to think of Equation (4.2.2) as implementing a fixed increment steepest gradient descent search of an instantaneous criterion function J, or formally,

(4.2.3)

For the case r(wxz) = r(wTx, z) = r(u, z), the right-hand side of Equation (4.2.2) can be integrated to yield

(4.2.4)

where = . This type of criterion function (which has the classic form of a potential function) is appropriate for learning rules such as perceptron, LMS, Hebbian, and correlation rules. In the most general case, however, r(wxz)  r(wTx, z), and a suitable criterion function J satisfying Equation (4.2.3) may not readily be determined (or may not even exist). It is interesting to note that finding the equilibrium points w* of the general learning equation does not require the knowledge of J explicitly if the gradient of J is known.

The criterion function J in Equation (4.2.4) fits the general form of the constrained criterion function in Equation (4.1.3). Therefore, we may view the task of minimizing J as an optimization problem with the objective of maximizing subject to regularization which penalizes solution vectors w* with large norm. It is interesting to note that by maximizing , one is actually maximizing the amount of information learned from a given example pair {xz}. In other words, the general learning equation is designed so that it extracts the maximum amount of "knowledge" present in the learning signal r(wxz).

In a stochastic environment where the information source is ergodic, the sequence of inputs x(t) is an independent stochastic process governed by p(x). The general learning equation in Equation (4.2.2) then becomes a stochastic differential equation or a stochastic approximation algorithm. The weight vector w is changed in random directions depending on the random variable x. From Equation (4.2.3), the average value of becomes proportional to the average gradient of the instantaneous criterion function J. Formally, we write

(4.2.5)

where implies averaging over all possible inputs x with respect to the probability distribution p(x). We will refer to this equation as the "average learning equation." Equation (4.2.5) may be viewed as a steepest gradient descent search for w* which (locally) minimizes the expected criterion function, , because the linear nature of the averaging operation allows us to express Equation (4.2.5) as

(4.2.6)

It is interesting to note that finding w* does not require the knowledge of J explicitly if the gradient of J is known. Equation (4.2.6) is useful from a theoretical point of view in determining the equilibrium state(s) and in characterizing the stochastic learning equation [Equation (4.2.2)] in an "average" sense. In practice, the stochastic learning equation is implemented and its average convergence behavior is characterized by the "average learning equation" given as

= (4.2.7)

The gradient system in Equation (4.2.6) has special properties that makes its dynamics rather simple to analyze. First, note that the equilibria w* are solutions to <J> = 0. This means that the equilibria w* are local minima, local maxima, and/or saddle points of <J>. Furthermore, it is a well established result that, for any  > 0, these local minima are asymptotically stable points (attractors) and that the local maxima are unstable points (Hirsch and Smale, 1974). Thus, one would expect the stochastic dynamics of the system in Equation (4.2.3), with sufficiently small , to approach a local minimum of <J>.

In practice, discrete-time versions of the stochastic dynamical system in Equation (4.2.3) are used for weight adaptation. Here, the stability of the corresponding discrete-time average learning equation (discrete-time gradient system) is ensured if 0 <  < , where max is the largest eigenvalue of the Hessian matrix H = J, evaluated at the current point in the search space (the proof of this statement is outlined in Problem 4.3.8). These discrete-time "learning rules" and their associated average learning equations have been extensively studied in more general context than that of neural networks. The book by Tsypkin (1971) gives an excellent treatment of these iterative learning rules and their stability.

By utilizing Equation (4.2.7), we are now ready to analyze some basic learning rules. These are the correlation, LMS, and Hebbian learning rules.

Correlation Learning

Here, r(w, x, z) = z, which represents the desired target associated with the input x. From Equation (4.2.2) we have the stochastic equation

(4.2.8)

which leads to the average learning equation

(4.2.9)

Now, by setting = 0, one arrives at the (only) equilibrium point

(4.2.10)

The stability of w* may now be systematically studied through the "expected" Hessian matrix < H(w*) > which is computed, by first employing Equations (4.2.5) and (4.2.9) to identify , as

(4.2.11)

This equation shows that the Hessian of <J> is positive definite; i.e., its eigenvalues are strictly positive or, equivalently, the eigenvalues of are strictly negative. This makes the system locally asymptotically stable at the equilibrium solution w* by virtue of Liapunov's first method (see Gill et al., 1981; Dickinson, 1991). Thus, w* is a stable equilibrium of Equation (4.2.9). In fact, the positive definite Hessian implies that w* is a minimum of , and therefore the gradient system converges globally and asymptotically to w*, its only minimum from any initial state. Thus, the trajectory w(t) of the stochastic system in Equation (4.2.8) is expected to approach and then fluctuate about the state .

From Equation (4.2.4), the underlying instantaneous criterion function J is given by

(4.2.12)

which may be minimized by maximizing the correlation zy subject to the regularization term . Here, the regularization term is needed in order to keep the solution bounded.

LMS Learning

For r(wxz) = z - wTx (the output error due to input x) and = 0, Equation (4.2.2) leads to the stochastic equation

(4.2.13)

In this case, the average learning equation becomes

(4.2.14)

with equilibria satisfying

or

(4.2.15)

Let C denote the positive semidefinite autocorrelation matrix , defined in Equation (3.3.4), and . If we have , then is the equilibrium state. Note that w* approaches the minimum SSE solution in the limit of a large training set, and that this analysis is identical to the analysis of the -LMS rule in Chapter 3. Let us now check the stability of w*. The Hessian matrix is

(4.2.16)

which is positive definite if 0. Therefore, w* = C-1P is the only (asymptotically) stable solution for Equation (4.2.14), and the stochastic dynamics in Equation (4.2.13) are expected to approach this solution.

Finally, note that with  = 0 Equation (4.2.4) leads to

or

(4.2.17)

which is the instantaneous SSE (or MSE) criterion function.

Hebbian Learning

Here, upon setting r(wxz) = y = wTx, Equation (4.2.2) gives the Hebbian rule with decay

(4.2.18)

whose average is

(4.2.19)

Setting in Equation (4.2.19) leads to the equilibria

(4.2.20)

So if C happens to have as an eigenvalue then w* will be the eigenvector of C corresponding to . In general, though, will not be an eigenvalue of C, so Equation (4.2.19) will have only one equilibrium at w* = 0. This equilibrium solution is asymptotically stable if is greater than the largest eigenvalue of C since this makes the Hessian

(4.2.21)

positive definite. Now, employing Equation (4.2.4) we get the instantaneous criterion function minimized by the Hebbian rule in Equation (4.2.18):

(4.2.22)

The regularization term is not adequate here to stabilize the Hebbian rule at a solution which maximizes y2. However, other more appropriate regularization terms can insure stability, as we will see in the next section.

Equation (4.2.5) [or (4.2.6)] is a powerful tool which can be used in the characterization of the average behavior of stochastic learning equations. We will employ it in this section in order to characterize some unsupervised learning rules which were considered in Chapter 3. The following analysis is made easier if one assumes that w and x are uncorrelated, and that is averaged with respect to x (denoted by ), with w replaced by its mean . This assumption leads to the "approximate" average learning equation

(4.3.1)

The above approximation of the average learning equation is valid when the learning equation contains strongly mixing random processes (processes for which the "past" and the "future" are asymptotically independent) with the mixing rate high compared to the rate of change of the solution process; i.e., it can be assumed that the weights are uncorrelated with the patterns x and with themselves. Taking the expected (average) value of a stochastic equation one obtains a deterministic equation whose solution approximates asymptotically the behavior of the original system as described by the stochastic equation (here,  = (t) =  is normally assumed). Roughly, the higher the mixing rate the better the approximation in Equation (4.3.1) (Kushner and Clark, 1978). We shall frequently employ Equation (4.3.1) in the remainder of this chapter.

A more rigorous characterization of stochastic learning requires the more advanced theory of stochastic differential equations and will not be considered here [see Kushner (1977) and Ljung (1978)]. Rather, we may proceed with a deterministic analysis using the "average versions" of the stochastic equations. It may be shown that a necessary condition for the stochastic learning rule to converge (in the mean-square sense) is that the average version of the learning rule must converge. In addition, and under certain assumptions, the exact solution of a stochastic equation is guaranteed to "stay close," in a probabilistic sense, to the solution of the associated average equation. It has been shown (Geman, 1979) that under strong mixing conditions (and some additional assumptions), . This result implies that if sufficiently small learning rates are used, the behavior of a stochastic learning equation may be well approximated, in a mean-square sense, by the deterministic dynamics of its corresponding average equation. Oja (1983) pointed out that the convergence of constrained gradient descent- (or ascent-) based stochastic learning equations (the type of equations considered in this chapter) can be studied with averaging techniques; i.e., the asymptotically stable equilibria of the average equation are the possible limits of the stochastic equation. Several examples of applying the averaging technique to the characterization of learning rules can be found in Kohonen (1989).

Before proceeding with further analysis of learning rules, we make the following important observations regarding the nature of the learning parameter in the stochastic learning equation (Heskes and Kappen, 1991). When a neural network interacts with a fixed unchanging (stationary) environment, the aim of the learning algorithm is to adjust the weights of the network in order to produce an optimal response; i.e., an optimal representation of the environment. To produce such an optimal and static representation, we require the learning parameter, which controls the amount of learning, to eventually approach zero. Otherwise, fluctuations in the representation will persist, due to the stochastic nature of the learning equation, and asymptotic convergence to optimal representation is never achieved. For a large class of stochastic algorithms, asymptotic convergence can be guaranteed (with high probability) by using the learning parameter (Ljung, 1977; Kushner and Clark, 1978).

On the other hand, consider biological neural nets. Human beings, of course, are able to continually learn throughout their entire lifetime. In fact, human learning is able to proceed on two different time scales; humans learn with age (very large time scale adaptation/learning) and are also capable of discovering regularities and are attentive for details (short time scale learning). This constant tendency to learn accounts for the adaptability of natural neural systems to a changing environment. Therefore, it is clear that the learning processes in biological neural networks does not allow for asymptotically vanishing learning parameters.

In order for artificial neural networks to be capable of adapting to a changing (nonstationary) environment, the learning parameter must take a constant nonzero value. The larger the learning parameter, the faster the response of the network to the changing environment. On the other hand, a large learning parameter has a negative effect on the accuracy of the network's representation of the environment at a given time; a large gives rise to large fluctuations around the desired optimal representation. In practice, though, one might be willing to trade some degree of fluctuation about the optimal representation (solution) for adaptability to a nonstationary process. Similar ideas have been proposed in connection with stochastic adaptive linear filtering. Here, an adaptive algorithm with a constant step size is used because it has the advantage of a limited memory, which enables it to track time fluctuations in the incoming data. These ideas date back to Wiener (1956) in connection with his work on linear prediction theory.

We have already analyzed one version of the Hebbian learning rule in the previous section. However, here we consider the most simple form of Hebbian learning which is given by Equation (4.2.18) with = 0; namely,

(4.3.2)

The above equation is a continuous-time version of the unsupervised Hebbian learning rule introduced in Chapter 3. Employing Equation (4.3.1), we get the approximate average learning equation

(4.3.3)

In Equation (4.3.3) and in the remainder of this chapter, the subscript x in is dropped in order to simplify notation. Now, the average gradient of J in Equation (4.3.3) may be written as

(4.3.4)

from which we may determine the average instantaneous criterion function

(4.3.5)

Note that is minimized by maximizing the unit's output variance. Again, C = is the autocorrelation matrix, which is positive semidefinite having orthonormal eigenvectors c(i) with corresponding eigenvalues . That is, Cc(i) = ic(i) for i = 1, 2, ..., n. The dynamics of Equation (4.3.3) are unstable. To see this, we first find the equilibrium points by setting = 0, giving C = 0 or w* = 0. w* is unstable because the Hessian (in an average sense) = = -C is non-positive for all . Therefore, Equation (4.3.3) is unstable and results in . Note, however, that the direction of is not random; it will tend to point in the direction of c(1), since if one assumes a fixed weight vector magnitude, is minimized when is parallel to the eigenvector with the largest corresponding eigenvalue.

In the following, we will characterize other versions of the Hebbian learning rule some of which were introduced in Chapter 3. These rules are well-behaved, and hence solve the divergence problem encountered with simple Hebbian learning. For simplifying mathematical notation and terminology, the following sections will use J, J, and H to designate , , and , respectively. We will refer to as simply the criterion function, to as the gradient of J, and to as the Hessian of J. Also, the quantity w in the following average equations should be interpreted as the state of the average learning equation.

Consider the criterion function

(4.3.6)

It is a well established property of quadratic forms that if w is constrained to the surface of the unit hypersphere, then Equation (4.3.6) is minimized when w = c(1) with (e.g. see Johnson and Wichern, 1988). Also, for any real symmetric n  n matrix A, the Rayleigh quotient satisfies where 1 and n are the largest and smallest eigenvalues of A, respectively. Let us start from the above criterion and derive an average learning equation. Employing Equation (4.3.1), we get

(4.3.7)

which can be shown to be the average version of the nonlinear stochastic learning rule

(4.3.8)

If we heuristically set for the two terms in the above equation, Equation (4.3.8) reduces to the continuous version of Oja's rule [refer to Equation (3.3.6)]. Let us continue with the characterization of (4.3.8) and defer the analysis of Oja's rule to Section 4.3.3. At equilibrium, Equation (4.3.7) gives

(4.3.9)

Hence, the equilibria of Equation (4.3.7) are the solutions of Equation (4.3.9) given by w* = c(i), i = 1, 2, ..., n. Here, J takes its smallest value of at w* = c(1). This can be easily verified by direct substitution in Equation (4.3.6).

Next, consider the Hessian of J at w* = c(i) (assuming, without loss of generality, = 1) and multiply it by c(j) , namely, H(c(i)c(j). It can be shown that this quantity is given by (see Problem 4.3.1. For a reference on matrix differential calculus, the reader is referred to the book by Magnus and Neudecker, 1988):

(4.3.10)

This equation implies that H(w*) has the same eigenvectors as C but with different eigenvalues. H(w*) is positive semi-definite only when w* = c(1). Thus, by following the dynamics of Equation (4.3.7), w will eventually point in the direction of c(1) (since none of the other directions c(i) is stable). Although the direction of w will eventually stabilize, it is entirely possible for ||w|| to approach infinity, and Equation (4.3.7) will appear never to converge. We may artificially constrain ||w|| to finite values by normalizing w after each update of Equation (4.3.8). Alternatively, we may set the two terms in Equation (4.3.8) equal to 1. This latter case is considered next.

Oja's rule was defined by Equation (3.3.6). Its continuous-time version is given by the nonlinear stochastic equation

(4.3.11)

The corresponding average learning equation is thus (Oja, 1982; 1989)

(4.3.12)

which has its equilibria at w satisfying

(4.3.13)

The solutions of Equation (4.3.13) are w* = c(i), i = 1, 2, ..., n. All of these equilibria are unstable except for . This can be seen by noting that the Hessian

(4.3.14)

is positive definite only at w* = c(1) (or -c(1)). Note that Equation (4.3.14) is derived starting from , with given in Equation (4.3.12). Although J is not known, the positive definiteness of H can be seen from

(4.3.15)

and by noting that the eigenvalues of C satisfy (we assume 1  2). Therefore, Oja's rule is equivalent to a stable version of the Hebbian rule given in Equation (4.3.8). A formal derivation of Oja's rule is explored in Problem 4.3.7.

A single unit employing Oja's rule (Oja's unit) is equivalent to a linear matched filter. To see this, assume that for all x, , where is a fixed vector (without loss of generality let ) and v is a vector of symmetrically distributed zero-mean noise with uncorrelated components having variance 2. Then . The largest eigenvalue of C is 1 + 2 and the corresponding eigenvector is . Oja's unit then becomes a matched filter for the data, since in Equation (4.3.12). Here, the unit responds maximally to the data mean. Further characterization of Oja's rule can be found in Xu (1993).

Oja's rule is interesting because it results in a local learning rule which is biologically plausible. The locality property is seen by considering the component weight adaptation rule of Equation (4.3.11), namely

(4.3.16)

and by noting that the change in the ith weight is not an "explicit" function of any other weight except the ith weight itself. Of course, does depend on w via y = wTx. However, this dependence does not violate the concept of locality.

It is also interesting to note that Oja's rule is similar to Hebbian learning with weight decay as in Equation (4.2.18). For Oja's rule, though, the growth in ||w|| is controlled by a "forgetting" or weight decay term, -y2w, which has nonlinear gain; the forgetting becomes stronger with stronger response, thus preventing ||w|| from diverging.

Example 4.3.1: This example shows typical simulation results comparing the evolution of the weight vector w according to the stochastic Oja rule in Equation (4.3.11) and its corresponding average rule in Equation (4.3.12).

Consider a training set {x} of forty 15-dimensional column vectors with independent random components generated by a normal distribution N(0, 1). In the following simulations, the training vectors autocorrelation matrix has the following set of eigenvalues: {2.561, 2.254, 2.081, 1.786, 1.358, 1.252, 1.121, 0.963, 0.745, 0.633, 0.500, 0.460, 0.357, 0.288, 0.238}.

During training, one of the forty vectors is selected at random and is used in the learning rule to compute the next weight vector. Discretized versions of Equations (4.3.11) and (4.3.12) are used where is replaced by wk+1 - wk. A learning rate  = 0.005 is used. This is equivalent to integrating these equations with Euler's method (e.g., see Gerald, 1978) using a time step t = 0.005 and  = 1. The initial weight vector is set equal to one of the training vectors. Figures 4.3.1 (a) and (b) show the evolution of the cosine of the angle between wk and c(1) and the evolution of the norm of wk, respectively. The solid line corresponds to the stochastic rule and the dashed line corresponds to the average rule.

(a)

(b)

Figure 4.3.1. (a) Evolution of the cosine of the angle between the weight vector wk and the principal eigenvector of the autocorrelation matrix C for the stochastic Oja rule (solid line) and for the average Oja rule (dashed line). (b) Evolution of the magnitude of the weight vector. The training set consists of forty 15-dimensional real-valued vectors whose components are independently generated according to a normal distribution N(0, 1). The presentation order of the training vectors is random during training.

The above simulation is repeated, but with a fixed presentation order of the training set. Results are shown in Figures 4.3.2 (a) and (b). Note that the results for the average learning equation (dashed line) are identical in both simulations since they are not affected by the order of presentation of input vectors. These simulations agree with the theoretical results on the appropriateness of using the average learning equation to approximate the limiting behavior of its corresponding stochastic learning equation. Note that a monotonically decreasing learning rate (say proportional to or with k  1) can be used to force the convergence of the direction of wk in the first simulation. It is also interesting to note that better approximations are possible when the training vectors are presented in a fixed deterministic order (or in a random order but with each vector guaranteed to be selected once every training cycle of m = 40 presentations). Here, a sufficiently small, constant learning rate is sufficient for making the average dynamics approximate, in a practical sense, the stochastic dynamics for all time.

(a)

(b)

Figure 4.3.2. (a) Evolution of the cosine of the angle between the weight vector wk and the principal eigenvector of the autocorrelation matrix C for the stochastic Oja rule (solid line) and for the average Oja rule (dashed line). (b) Evolution of the magnitude of the weight vector. The training set consists of forty 15-dimensional real-valued vectors whose components are independently generated according to a normal distribution N(0, 1). The presentation order of the training vectors is fixed.

The continuous-time version of the Yuille et al. (1989) learning rule is

(4.3.17)

and the corresponding average learning equation is

(4.3.18)

with equilibria at

(4.3.19)

From Equation (4.3.18), the gradient of J is

(4.3.20)

(4.3.21)

Note that one could have also computed H directly from the known criterion function

(4.3.22)

Now, evaluating H(wi*)wj* we get

(4.3.23)

which implies that the wj* are eigenvectors of H(wi*) with eigenvalues i - j for i j and 2i for i = j. Therefore, H(wi*) is positive definite if and only if i > j, for i j. In this case, w* = are the only stable equilibria, and the dynamics of the stochastic equation are expected to approach w*.

In the following, the unsupervised Hebbian-type learning rule

(4.3.24)

with ||w||  0 is analyzed. Another way for stabilizing Equation (4.3.2) is to start from a criterion function that explicitly penalizes the divergence of w. For example, if we desire to be satisfied, we may utilize J given by

(4.3.25)

with > 0. It can be easily shown that steepest gradient descent on the above criterion function leads to the learning equation

(4.3.26)

Equation (4.3.26) is the average learning equation for the stochastic rule of Equation (4.3.24). Its equilibria are solutions of the equation

(4.3.27)

Thus, it can be seen that the solution vectors of Equation (4.3.27), denoted by , must be parallel to one of the eigenvectors of C, say c(i),

(4.3.28)

where i is the ith eigenvalue of C. From Equation (4.3.28) it can be seen that the norm of the ith equilibrium state is given by

(4.3.29)

which requires  > i for all i. Note that if >> i, then approaches one for all equilibrium points. Thus, the equilibria of Equation (4.3.26) approach unity norm eigenvectors of the correlation matrix C when is large.

Next, we investigate the stability of these equilibria. Starting from the Hessian

(4.3.30)

we have

(4.3.31)

which implies that is positive definite if and only if is parallel to c(1) and > 1. Therefore, the only stable equilibria of Equation (4.3.26) are which approach c(1) for >> 1. Like the Yuille et al. rule, this rule preserves the information about the size of 1 [1 can be computed from Equation (4.3.29)].

For the discrete-time version, it is interesting to note that for the case >> 1, and = 1, Equation (4.3.24) reduces to

(4.3.32)

which is very similar to the discrete-time simple Hebbian learning rule with weight vector normalization of Equation (3.3.5) and expressed here as

(4.3.33)

Note that the weight vector in in Equation (4.3.33) need not be normalized to prevent divergence.

In practice, discrete-time versions of the stochastic learning rules in Equations (4.3.11), (4.3.17), and (4.3.24) are used where is replaced by wk+1 - wk and w(t) by wk. Here, the stability of these discrete-time stochastic dynamical systems critically depends on the value of the learning rate . Although the stability analysis is difficult, one can resort to the discrete-time versions of the average learning equations corresponding to these rules and derive conditions on for asymptotic convergence (in the mean) in the neighborhood of equilibrium states w*. Such analysis is explored in Problems 4.3.8 and 4.3.9.

In concluding this section, another interpretation of the regularization effects on the stabilization of Hebbian-type rules is presented. In the stochastic learning Equations (4.2.18), (4.3.11), (4.3.17), and (4.3.24), regularization appears as weight decay terms -w, -y2w, -||w||2w, and -w, respectively. Therefore, one may think of weight decay as a way of stabilizing unstable learning rules. However, one should carefully design the gain coefficient in the weight decay term for proper performance. For example, it has been shown earlier that a simple positive constant gain in Equation (4.2.18) does not stabilize Hebb's rule. On the other hand, the nonlinear dynamic gains y2, ||w||2, and lead to stability. Note that the weight decay gain in Oja's rule utilizes more information, in the form of y2, than the Yuille et al. rule or Hassoun's rule. The regularization in these latter rules is only a function of the current weight vector magnitude.

The PCA network of Section 3.3.5, employing Sanger's rule, is analyzed in this section. Recalling Sanger's rule [from Equation (3.3.19)] and writing it in vector form for the continuous-time case, we get

(4.4.1)

with = 1, 2, ..., m, where m is the number of units in the PCA network. We will assume, without any loss of generality, that m = 2. This leads to the following set of coupled learning equations for the two units:

(4.4.2)

and

(4.4.3)

Equation (4.4.2) is Oja's rule. It is independent of unit 2 and thus converges to , the principal eigenvector of the autocorrelation matrix of the input data (assuming zero mean input vectors). Equation (4.4.3) is Oja's rule with the added inhibitory term . Next, we will assume a sequential operation of the two-unit net where unit 1 is allowed to fully converge before evolving unit 2. This mode of operation is permissible since unit 1 is independent of unit 2.

With the sequential update assumption, Equation (4.4.3) becomes

(4.4.4)

For clarity, we will drop the subscript on w. Now, the average learning equation for unit 2 is given by

(4.4.5)

which has equilibria satisfying

(4.4.6)

Hence, and with i = 2, 3, ..., n are solutions. Note that the point is not an equilibrium. The Hessian is given by

(4.4.7)

Since is not positive definite, the equilibrium w* = 0 is not stable. For the remaining equilibria we have the Hessian matrix

(4.4.8)

which is positive definite only at , assuming 2  3. Thus, Equation (4.4.5) converges asymptotically to the unique stable vector which is the eigenvector of C with the second largest eigenvalue 2. Similarly, for a network with m interacting units according to Equation (4.4.1), the ith unit (i = 1, 2, ..., m) will extract the ith eigenvector of C.

The unit-by-unit description presented here helps simplify the explanation of the PCA net behavior. In fact, the weight vectors wi approach their final values simultaneously, not one at a time. Though, the above analysis still applies, asymptotically, to the end points. Note that the simultaneous evolution of the wi is advantageous since it leads to faster learning than if the units are trained one at a time.

Recall the simplified stochastic reinforcement learning rule of Equation (3.2.5). The continuous-time version of this rule is given by

(4.5.1)

from which the average learning equation is given as

(4.5.2)

Now, employing Equation (4.2.6), we may think of Equation (4.5.2) as implementing a gradient search on an average instantaneous criterion function, , whose gradient is given by

(4.5.3)

In Equations (4.5.1) through (4.5.3), we have and the output y is generated by a stochastic unit according to the probability function P(yw, x), given by

(4.5.4)

with the expected output

(4.5.5)

as in Section 3.1.6. Next, it is shown that Equation (4.5.3) is proportional to the gradient of the expected reinforcement signal (Williams, 1987; Hertz et al., 1991).

First, we express the expected (average) reinforcement signal, , for the kth input vector with respect to all possible outputs y as

(4.5.6)

and then evaluate its gradient with respect to w. The gradient of is given by

(4.5.7)

which follows from Equation (4.5.4). We also have

(4.5.8)

and (4.5.9)

which can be used in Equation (4.5.7) to give

(4.5.10)

If we now take the gradient of Equation (4.5.6) and use (4.5.10), we arrive at

(4.5.11)

which can also be written as

(4.5.12)

Finally, by averaging Equation (4.5.12) over all inputs xk, we get

(4.5.13)

where now the averages are across all patterns and all outputs. Note that is proportional to in Equation (4.5.3) and has an opposite sign. Thus, Equation (4.5.2) can be written in terms of as

(4.5.14)

which implies that the average weight vector converges to a local maximum of ; i.e., Equation (4.5.1) converges, on average, to a solution that locally maximizes the average reinforcement signal.

Extensions of these results to a wider class of reinforcement algorithms can be found in Williams (1987). The characterization of the associative reward-penalty algorithm in Equation (3.2.1) is more difficult since it does not necessarily maximize . However, the above analysis should give some insight into the behavior of simple reinforcement learning.

In this section we attempt to characterize simple competitive learning. Two approaches are described: one deterministic and the other statistical.

Consider a single layer of linear units, where each unit uses the simple continuous-time competitive rule [based on Equations (3.4.3) and (3.4.5)]

(4.6.1)

Also, consider the criterion function (Ritter and Schulten, 1988a)

(4.6.2)

where is the weight vector of the winner unit upon the presentation of the input vector xk. Here, all vectors xk are assumed to be equally probable. In general, a probability of occurrence of xk, P(xk), should be inserted inside the summation in Equation (4.6.2). An alternative way of expressing J in Equation (4.6.2) is through the use of a "cluster membership matrix" M defined for each unit i = 1, 2, ..., n by

(4.6.3)

Here, is a dynamically evolving function of k and i which specifies whether or not unit i is the winning unit upon the presentation of input xk. The cluster membership matrix allows us to express the criterion function J as

(4.6.4)

Now, performing gradient descent on J in Equation (4.6.4) yields

(4.6.5)

which is the batch mode version of the learning rule in Equation (4.6.1). It was noted by Hertz et al. (1991) that the above batch mode competitive learning rule corresponds to the k-means clustering algorithm when a finite training set is used. The local rule of Equation (4.6.1) may have an advantage over the batch mode rule in Equation (4.6.5) since stochastic noise due to the random presentation order of the input patterns may kick the solution out of "poor" minima towards minima which are more optimal. However, only in the case of "sufficiently sparse" input data points can one prove stability and convergence theorems for the stochastic (incremental) competitive learning rule (Grossberg, 1976). The data points are sparse enough if there exists a set of clusters so that the minimum overlap within a cluster exceeds the maximum overlap between that cluster and any other cluster. In practice, a damped learning rate k is used (e.g., , where and 0 is a positive constant) in order to stop weight evolution at one of the local solutions. Here, the relatively large initial learning rate allows for wide exploration during the initial phase of learning.

Criterion functions, other than the one in Equation (4.6.4), may be employed which incorporate some interesting heuristics into the competitive rule for enhancing convergence speed or for altering the underlying "similarity measure" implemented by the learning rule. For an example, we may replace by in Equation (4.6.4). This causes the winning weight vector to be repelled by input vectors in other clusters, while being attracted by its own cluster, which enhances convergence. Another example is to employ a different similarity measure (norm) in J such as the Minkowski-r norm of Equation (3.1.68), which has the ability to reduce the effects of outlier data points by proper choice of the exponent r. Other criterion functions may also be employed; the reader is referred to Bachmann et al. (1987) for yet another suitable criterion function.

The following is an analysis of simple competitive learning based on the stochastic approximation technique introduced in Section 4.3. Consider the following normalized discrete-time competitive rule (von der Malsberg, 1973; Rumelhart and Zipser, 1985):

(4.6.6)

where, again, the setting is a single layer network of linear units. Here, and typically, . Also, the weight normalization is assumed for all units. It can be easily verified that Equation (4.6.6) preserves this weight normalization at any iteration (this was explored in Problem 3.4.1).

Let P(xk) be the probability that input xk is presented on any trial. Then, the average learning equation may be expressed as

(4.6.7)

where is the conditional probability that unit i wins when input xk is presented. Now, using Equation (4.6.6) in Equation (4.6.7) we get

(4.6.8)

which implies that at equilibrium

(4.6.9)

Therefore, the jth component of vector wi is given as

(4.6.10)

We now make the following observations. First, note that the denominator of Equation (4.6.10) is the probability that unit i wins averaged over all stimulus patterns. Note further that is the probability that (active) and unit i is a winner. Thus, assuming that all patterns have the same number of active bits (i.e., for all k), we may employ Bayes' rule and write Equation (4.6.10) as

(4.6.11)

which states that at equilibrium, wij is expected to be proportional to the conditional probability that the jth bit of input xk is active given unit i is a winner.

Next, upon the presentation of a new pattern [assuming the equilibrium weight values given by Equation (4.6.9)], unit i will have a weighted sum (activity) of

(4.6.12)

or

(4.6.13)

where represents the overlap between stimulus and the kth training pattern xk. Thus, at equilibrium, a unit responds most strongly to patterns that overlap other patterns to which the unit responds and most weakly to patterns that are far from patterns to which it responds. Note that we may express the conditional probability according to the winner-take-all mechanism

(4.6.14)

Because of the dependency of on wi, there are many solutions which satisfy the equilibrium relation given in Equation (4.6.9).

Equation (4.6.6) leads the search to one of many stable equilibrium states satisfying Equations (4.6.9) and (4.6.14). In such a state, the ith unit activations become stable (fluctuate minimally), and therefore, becomes stable. A sequence of stimuli might, however, be presented in such a way as to introduce relatively large fluctuations in the wi's. In this case, the system might move to a new equilibrium state which is, generally, more stable in the sense that becomes unlikely to change values for a very long period of time. Rumelhart and Zipser (1985) gave a measure of the stability of an equilibrium state as the average amount by which the output of the winning units is greater than the response of all of the other units averaged over all patterns and all clusters. This stability measure is given by

(4.6.15)

where the averaging is taken over all xk, and i* is the index of winning units. Note that Equation (4.6.15) can be written as

(4.6.16)

The larger the value of J, the more stable the system is expected to be. Maximizing J can also be viewed as maximizing the overlap among patterns within a group (cluster) while minimizing the overlap among patterns between groups; this is exactly what is required for the clustering of unlabeled data. In geometric terms, J is maximized when the weight vectors point toward maximally compact stimulus (input) regions that are as distant as possible from other such regions.

The characterization of topological feature preserving maps has received special attention in the literature (Kohonen, 1982b; Cottrell and Fort, 1986; Ritter and Schulten, 1986 and 1988b; Tolat, 1990; Heskes and Kappen, 1993a; Lo et al., 1993; Kohonen, 1993b). In particular, Takeuchi and Amari (1979) and Amari (1980 and 1983) have extensively studied a continuous-time dynamical version of this map to investigate the topological relation between the self-organized map and the input space governed by the density p(x), the resolution and stability of the map, and convergence speed. The characterization of a general feature map is difficult, and much of the analysis has been done under simplifying assumptions.

In the following, a one-dimensional version of the self-organizing feature map of Kohonen is characterized following the approach of Ritter and Schulten (1986). A continuous dynamical version of Kohonen's map is also described and analyzed.

Consider the criterion function J(w) defined by (Ritter and Schulten, 1988a)

(4.7.1)

where i* is the label of the winner unit upon the presentation of stimulus (input) xk and is the neighborhood function that was introduced in Section 3.5. It can be seen that Equation (4.7.1) is an extension of the competitive learning criterion function of Equation (4.6.2). Performing gradient descent on (4.7.1) yields

(4.7.2)

which is just the batch mode version of Kohonen's self-organizing rule in Equation (3.5.1). Thus, Kohonen's rule is a stochastic gradient descent search and leads, on average and for small , to a local minimum of J in Equation (4.7.1). These minima are given as solutions to

(4.7.3)

This equation is not easy to solve; it depends on the choice of and the distribution p(x). Actually, we desire the global minimum of the criterion function J. Local minima of J are topological defects like kinks in one-dimensional maps and twists in two-dimensional maps [Kohonen, 1989; Geszti, 1990].

The analysis of feature maps becomes more tractable if one replaces Equation (4.7.3) with a continuous version that assumes a continuum of units and where the distribution p(x) appears explicitly, namely

(4.7.4)

where rr*(x) is the coordinate vector of the winning unit upon the presentation of input x.

An implicit partial differential equation for w can be derived from Equation (4.7.4) [Ritter and Schulten, 1986]. However, for the case of two- or higher-dimensional maps, no explicit solutions exist for w(r) given an arbitrary p(x). On the other hand, solutions of Equation (4.7.4) are relatively easy to find for the one-dimensional map with scalar r and a given input distribution p(x). Here, the equilibrium w* satisfies (Ritter and Schulten, 1986)

(4.7.5)

which, in turn, satisfies the implicit differential equation corresponding to Equation (4.7.4), given by (assuming a sharply peaked symmetric ):

(4.7.6)

In Equations (4.7.5) and (4.7.6), the term p(w) is given by p(x)|x=w. Equation (4.7.5) shows that the density of the units in w space is proportional to around point r. This verifies the density preserving feature of the map. Ideally, however, we would have for zero distortion. Therefore, a self-organizing feature map tends to under sample high probability regions and over sample low probability ones.

Finally, one may obtain the equilibria w* by solving Equation (4.7.6). The local stability of some of these equilibria is ensured (with a probability approaching 1) if the learning coefficient  = (t) is sufficiently small, positive, and decays according to the following necessary and sufficient conditions (Ritter and Schulten, 1988b)

(4.7.7a)

and

(4.7.7b)

In particular, the decay law with 0 < 1 ensures convergence. For laws with  > 1 or exponential decay laws, Equation (4.7.7a) is not fulfilled, and some residual error remains even in the limit t  . It can also be shown that during convergence, the map first becomes untangled and fairly even, and then moves into a refinement phase where it adapts to the details of p(x). Occasionally, the "untangling" phase can slow convergence, because some types of tangle (e.g., kinks and twists) can take a long time to untangle. Geszti (1990) suggested the use of a strongly asymmetric neighborhood function in order to speed up learning by breaking the symmetry effects responsible for slow untangling of kinks and twists.

Consider a continuum of units arranged as an infinite two-dimensional array (neural field). Each point (unit) on this array may be represented by a position vector r and has an associated potential u(r). The output of the unit at r is assumed to be a nonlinear function of its potential y(r) = [u(r)], where f is either a monotonically non-decreasing positive saturating activation function or a step function. Associated with each unit r is a set of input weights w(r) and another set of lateral weights (rr') = (r'). This lateral weight distribution is assumed to be of the on-center off-surround type as is shown in Figure 4.7.1 for the one-dimensional case. Here, a unit at position r makes excitatory connections with all of its neighbors located within a distance from r, and makes inhibitory connections with all other units.

Figure 4.7.1. One-dimensional plot of a neural field's lateral weight distribution.

The dynamics of the neural field potential u(r, t) are given by

(4.7.8)

where

(4.7.9)

and h is a constant bias field. In Equation (4.7.8), it is assumed that the potential u(r, t) decays with time constant to the resting potential h in the absence of any stimulation. Also, it is assumed that this potential increases in proportion to the total stimuli s(r, x) which is the sum of the lateral stimuli and the input stimuli wTx due to the input signal x Rn. A conceptual diagram for the above neural field is shown in Figure 4.7.2.

Figure 4.7.2. Self-organizing neural field.

In Equation (4.7.8), the rates of change of and w, if any, are assumed to be much slower than that of the neural field potential. The input signal (pattern) x is a random time sequence, and it is assumed that a pattern x is chosen according to probability density p(x). Also, we assume that inputs are applied to the neural field for a time duration which is longer than the time constant of the neural field potential. On the other hand, the duration of stimulus x is assumed to be much shorter than the time constant ' of the weight w. Thus, the potential distribution u(r, t) can be considered to change in a quasi-equilibrium manner denoted by u(r, x).

An initial excitation pattern applied to the neural field changes according to the dynamics given in Equation (4.7.8), and eventually converges to one of the equilibrium solutions. Stable equilibrium solutions are the potential fields u(r) which the neural field can retain persistently under a constant input x. The equilibria of Equation (4.7.8) must satisfy or

(4.7.10)

where s(ru*) is the total stimuli at equilibrium. When the lateral connections distribution (r - r') is strongly off-surround inhibitory, given any x, only a local excitation pattern is aroused as a stable equilibrium which satisfies Equation (4.7.10) (Amari, 1990). Here, a local excitation is a pattern where the excitation is concentrated on units in a small local region; i.e., u*(r) is positive only for a small neighborhood centered at a maximally excited unit r0. Thus u*(r, x) represents a mapping from the input space onto the neural field.

Let us now look at the dynamics of the self-organizing process. We start by assuming a particular update rule for the input weights of the neural field. One biologically plausible update rule is the Hebbian rule:

(4.7.11)

where is the neural field's equilibrium output activity due to input x. In Equation (4.7.11), we use the earlier assumption ' . Next, we assume strong mixing in Equation (4.7.11) which allows us to write an expression for the average learning equation (absorbing ' in and in ) as

(4.7.12)

where the averaging is over all possible x. The equilibria of Equation (4.7.12) are given by

(4.7.13)

If we now transpose Equation (4.7.12) and multiply it by an arbitrary input vector x we arrive at an equation for the change in input stimuli

(4.7.14)

The vector inner product xT x' represents the similarity of two input signals x and x' and hence the topology of the signal space (Takeuchi and Amari, 1979). Note how Equation (4.7.14) relates the topology of the input stimulus set {x} with that of the neural field.

On the other hand, if one assumes a learning rule where a unit r updates its input weight vector in proportion to the correlation of its equilibrium potential u*(r, x) and the difference x - w(r), one arrives at the average differential equation

(4.7.15)

This learning rule is equivalent to the averaged continuum version of Kohonen's self-organizing feature map in Equation (4.7.2), if one views the potential distribution u*(r, x) as the weighting neighborhood function . Here, self-organization will emerge if the dynamics of the potential field evolve such that the quasi-stable equilibrium potential u* starts positive for all r, then monotonically and slowly shrinks in diameter for positive time. This may be accomplished through proper control of the bias field h, as described below.

In general, it is difficult to solve Equation (4.7.8) and (4.7.12) [or (4.7.15)]. However, some properties of the formation of feature maps are revealed from these equations for the special, but revealing, case of a one-dimensional neural field (Takeuchi and Amari, 1979; Amari, 1980 and 1983). The dynamics of the potential field in Equation (4.7.8) for a one-dimensional neural field was analyzed in detail by Amari (1977b) and Kishimoto and Amari (1979) for a step activation function and a continuous monotonically non-decreasing activation function, respectively. It was shown that with (r - r') as shown in Figure 4.7.1 and f(u) = step(u), there exist stable equilibrium solutions u*(r) for the x = 0 case. The 0-solution potential field, u*(r) 0, and the -solution field, u*(r) 0, are among these stable solutions. The 0-solution is stable if and only if h < 0. On the other hand, the -solution is stable if and only if h > -2 A(), where with A(a) as defined below. Local excitations (also known as a-solutions) where u*(r) is positive only over a finite interval [a1, a2] of the neural field are also possible. An a-solution exists if and only if h + A(a) = 0. Here, A(a) is the definite integral defined by

(4.7.16)

and plotted in Figure 4.7.3. Amari (see also Krekelberg and Kok, 1993) also showed that a single a-solution can exist for the case of a non-zero input stimulus, and that the corresponding active region of the neural field is centered at the unit r receiving the maximum input. Furthermore, the width, a, of this active region is a monotonically decreasing function of the bias field h. Thus, one may exploit the fact that the field potential/neighborhood function u*(rx) is controlled by the bias field h in order to control the convergence of the self-organizing process. Here, the uniform bias field h is started at a positive value h > -2A() and is slowly decreased towards negative values. This, in turn, causes u* to start at the -solution, then gradually move through a-solutions with decreasing width a until, ultimately, u* becomes the 0-solution. For further analysis of the self-organizing process in a neural field, the reader is referred to Zhang (1991).

Figure 4.7.3. A plot of A(a) of Equation (4.7.16).

Kohonen (1993a and 1993b) proposed a self-organizing map model for which he gives physiological justification. The model is similar to Amari's self-organizing neural field except that it uses a discrete two-dimensional array of units. The model assumes sharp self-on off-surround lateral interconnections so that the neural activity of the map is stabilized where the unit receiving the maximum excitation becomes active and all other units are inactive. Kohonen's model employs unit potential dynamics similar to those of Equation (4.7.8). On the other hand, Kohonen uses a learning equation more complex than those in Equations (4.7.12) and (4.7.15). This equation is given for the ith unit weight vector by the pseudo-Hebbian learning rule

(4.7.17)

where is a positive constant. The term in Equation (4.7.17) models a natural "transient" neighborhood function. It represents a weighted-sum of the output activities yl of nearby units, which describes the strength of the diffuse chemical effect of cell l on cell i; hil is a function of the distance of these units. This weighted-sum of output activities replaces the output activity of the same cell, yi, in Hebb's learning rule. On the other hand, the term acts as a stabilizing term which models "forgetting" effects, or disturbance due to adjacent synapses. Typically, forgetting effects in wi are proportional to the weight wi itself. In addition, if the disturbance caused by synaptic site r is mediated through the post synaptic potential, the forgetting effect must further be proportional to wirxr. This phenomenon is modeled by in the forgetting term. Here, the summation is taken over a subset of the synapses of unit i that are located near the jth synapse wij, and approximately act as one collectively interacting set. The major difference between the learning rules in Equation (4.7.17) and (4.7.12) is that, in the former, the "neighborhood" function is determined by a "transient" activity due to a diffusive chemical effect of nearby cell potentials, whereas it is determined by a stable region of neural field potential in the latter.

Under the assumption that the index r ranges over all components of the input signal x and regarding as a positive scalar independent of w and x, the vector form of Equation (4.7.18) takes the Riccati differential equation form

(4.7.18)

where  =  > 0. Now, multiplying both sides of Equation (4.7.18) by 2wT leads to the differential equation

(4.7.19)

Thus, for arbitrary x with wTx > 0, Equation (4.7.19) converges to . On the other hand, the solution for the direction of w* cannot be determined in closed form from the deterministic differential Equation (4.7.18). However, a solution for the expected value of w may be found if Equation (4.7.18) is treated as a stochastic differential equation with strong mixing in accordance to the discussion of Section 4.3. Taking the expected value of both sides of Equation (4.7.18) and solving for its equilibrium points (by setting ) gives

(4.7.20)

Furthermore, this equilibrium point can be shown to be stable (Kohonen, 1989).

From the above analysis, it can be concluded that the synaptic weight vector w is automatically normalized to the length , independent of the input signal x, and that w rotates such that its average direction is aligned with the mean of x. This is the expected result of a self-organizing map when a uniform nondecreasing neighborhood function is used. In general, though, the neighborhood term is nonuniform and time varying, which makes the analysis of Equation (4.7.17) much more difficult.

In Chapter 2, we analyzed the capabilities of some neural network architectures for realizing arbitrary mappings. We have found that a feedforward neural net with a single hidden layer having an arbitrary number of sigmoidal activation units is capable of approximating any mapping (or continuous multivariate function) to within any desired degree of accuracy. The results of Chapter 2 on the computational capabilities of layered neural networks say nothing about the synthesis/learning procedure needed to set the interconnection weights of these networks. What remains to be seen is whether such networks are capable of finding the necessary weight configuration for a given mapping by employing a suitable learning algorithm.

Later chapters in this book address the question of learning in specific neural network architectures by extending appropriate learning rules covered in Chapter 3. In the remainder of this chapter, we address two important issues related to learning in neural networks: generalization and complexity. The following discussion is general in nature, and thus it holds for a wide range of neural network paradigms.

Generalization is measured by the ability of a trained network to generate the correct output for a new randomly chosen input belonging to the same probability density p(x) governing the training set. In this section, two cases are considered: Average generalization and worst case generalization. This section also considers the generalization capabilities of stochastic neural networks.

One important performance measure of trainable neural networks is the size of the training set needed to bound their generalization error below some specified number. Schwartz et al. (1990) gave a theoretical framework for calculating the average probability of correct generalization for a neural net trained with a training set of size m. Here, the averaging is over all possible networks (of fixed architecture) consistent with the training set, and the only assumptions about the network's architecture is that it is deterministic (employs deterministic units) and that it is a universal architecture (or faithful model) of the class of functions/mappings being learned. It is assumed that these functions are of the form : Rn {0, 1}, but the ideas can be extended to multiple and/or continuous-valued outputs as well.

The following analysis is based on the theoretical framework of Schwartz et al. (1990), and it also draws on some clarifications given by Hertz et al. (1991). The main result of this analysis is rather surprising; one can calculate the average probability of correct generalization for any training set of size m if one knows a certain function that can (in theory) be calculated before training begins. However, it should be kept in mind that this result is only meaningful when interpreted in an average sense, and does not necessarily represent the typical situation encountered in a specific training scheme.

Consider a class of networks with a certain fixed architecture specified by the number of layers, the number of units within each layer, and the interconnectivity pattern between layers. Let us define the quantity V0 as

(4.8.1)

which stands for the total "volume" of the weight space, where w represents the weights of an arbitrary network, and is some a priori weight probability density function. Thus, each network is represented as a point w in weight space which implements a function fw(x). We may now partition the weight space into a set of disjoint regions, one for each function fw that this class of networks can implement. The volume of the region of weight space that implements a particular function f(x) is given by

(4.8.2)

where

Each time an example {xkfd(xk)} of a desired function fd is presented and is successfully learned (supervised learning is assumed), the weight vector w is modified so that it enters the region of weight space that is compatible with the presented example. If m examples are learned, then the volume of this region is given by

(4.8.3)

where

The region Vm represents the total volume of weight space which realizes the desired function fd as well as all other functions f that agree with fd on the desired training set. Thus, if a new input is presented to the trained network it will be ambiguous with respect to a number of functions represented by Vm (recall the discussion on ambiguity for the simple case of a single threshold gate in Section 1.5). As the number of learned examples m is increased, the expected ambiguity decreases.

Next, the volume of weight space consistent with both the training examples and a "particular" function f is given by

(4.8.4)

where Equation (4.8.2) was used. Note that fw in I has been replaced by f and that the product term factors outside of the integral. Now, assuming independent input vectors xk generated randomly from distribution p(x), the factors I(f, xk) in Equation (4.8.4) are independent. Thus, averaging Vm() over all xk gives

(4.8.5)

The quantity g() takes on values between 0 and 1. It is referred to as the generalization ability of ; i.e., g(f ) may be viewed as the probability that for an input x randomly chosen from p(x). As an example, for a completely specified n-input Boolean function fd, g() is given by (assuming that all 2n inputs are equally likely)

where dH(f, fd) is the number of bits by which f and fd differ.

Let us define the probability Pm() that a particular function f can be implemented after training on m examples of fd. This probability is equal to the average fraction of the remaining weight space that f occupies:

(4.8.6)

The approximation in Equation (4.8.6) is based on the assumption that Vm does not vary much with the particular training sequence; i.e., Vm   for each probable sequence. This assumption is expected to be valid as long as m is small compared to the total number of possible input combinations.

Good generalization requires that Pm() be small. Let us use Equation (4.8.6) to compute the distribution of generalization ability g() across all possible 's after successful training with m examples:

(4.8.7)

Note that an exact m(g) can be derived by dividing the right-hand side of Equation (4.8.7) by . The above result is interesting since it allows us to compute, before learning, the distribution of generalization ability after training with m examples. The form of Equation (4.8.7) shows that the distribution m(g) tends to get concentrated at higher and higher values of g as more and more examples are learned. Thus, during learning, although the allowed volume of weight (or function) space shrinks, the remaining regions tend to have large generalization ability.

Another useful measure of generalization is the average generalization ability G(m) given by

(4.8.8)

which is the ratio between the + 1 and the mth moments of 0(g) and can be computed if 0(g) is given or estimated. G(m) gives the entire "learning curve"; i.e., it gives the average expected success rate as a function of m. Equation (4.8.8) allows us to predict the number of examples, m, necessary to train the network to a desired average generalization performance. We may also define the average prediction error as 1 - G(m). The asymptotic behavior (m  ) of the average prediction error is determined by the form of the initial distribution 0(g) near g = 1. If a finite gap between g = 1 and the next highest g for which 0(g) is nonzero exists, then the prediction error decays to 1 exponentially as . If, on the other hand, there is no such gap in 0(g), then the prediction error decays as . These two behaviors of the learning curve have also been verified through numerical experiments. The nature of the gap in the distribution of generalizations near the region of perfect generalization (g = 1) is not completely understood. These gaps have been detected in experiments involving the learning of binary mappings (Cohn and Tesauro, 1991 and 1992). It is speculated that such a gap could be due to the dynamic effects of the learning process where the learning algorithm may, for some reason, avoid the observed near-perfect solutions. Another possibility is that the gap is inherent in the nature of the binary mappings themselves.

The above approach, though theoretically interesting, is of little practical use for estimating m, since it requires the knowledge of the distribution 0(g), whose estimation is computationally expensive. It also gives results that are only valid in an average sense, and does not necessarily represent the typical situation encountered in a specific training scheme.

Next, we summarize a result that tells us about the generalization ability of a deterministic feedforward neural network in the "worst" case. Here, also, the case of learning a binary-valued output function :Rn  {0, 1} is treated. Consider a set of m labeled training example pairs (x, y) selected randomly from some arbitrary probability distribution p(x, y), with x Rn and y = f(x) {0, 1}. Also, consider a single hidden layer feedforward neural net with k LTG's and d weights that has been trained on the m examples so that at least a fraction 1 - , where , of the examples are correctly classified. Then, with a probability approaching 1, this network will correctly classify the fraction 1 - of future random test examples drawn from p(x, y), as long as (Baum and Haussler, 1989)

(4.8.9)

Ignoring the log term, we may write Equation (4.8.9), to a first order approximation, as

(4.8.10)

which requires m d for good generalization. It is interesting to note that this is the same condition for "good" generalization (low ambiguity) for a single LTG derived by Cover (1965) (refer to Section 1.5) and obtained empirically by Widrow (1987). Therefore, one may note that in the limit of large m the architecture of the network is not important in determining the worst case generalization behavior; what matters is the ratio of the number of degrees of freedom (weights) to the training set size. On the other hand, none of the above theories may hold for the case of a small training set. In this later case, the size and architecture of the network and the learning scheme all play a role in determining generalization quality (see the next chapter for more details). It should also be noted that the architecture of the net can play an important role in determining the speed of convergence of a given class of learning methods, as discussed later in Section 4.9.

Similar results for worst case generalization are reported in Blumer et al. (1989). A more general learning curve based on statistical physics and VC dimension theories (Vapnik and Chervonenkis, 1971) which applies to a general class of networks can be found in Haussler et al. (1992). For generalization results with noisy target signals the reader is referred to Amari et al. (1992).

This section deals with the asymptotic learning behavior of a general stochastic learning dichotomy machine (classifier). We desire a relation between the generalization error and the training error in terms of the number of free parameters of the machine (machine complexity) and the size of the training set. The results in this section are based on the work of Amari and Murata (1993). Consider a parametric family of stochastic machines where a machine is specified by a d-dimensional parameter vector w such that the probability of output y, given an input x, is specified by P(y | x, w). As an example, one may assume the machine to be a stochastic multilayer neural network parameterized by a weight vector w Rd, which, for a given input x Rn, emits a binary output with probability

(4.8.11)

and

where

(4.8.12)

and g(x, w) may be considered as a smooth deterministic function (e.g., superposition of multivariate sigmoid functions typically employed in layered neural nets). Thus, in this example, the stochastic nature of the machine is determined by its stochastic output unit.

Assume that there exists a true machine that can be faithfully represented by one of the above family of stochastic machines with parameter w0. The true machine receives inputs xk, k = 1, 2, ..., m, which are randomly generated according to a fixed but unknown probability distribution p(x), and emits yk. The maximum likelihood estimator (refer to Section 3.1.5 for definition) characterized by the machine will be our first candidate machine. This machine predicts y for a given x with probability P(y | x, ). An entropic loss function is used to evaluate the generalization of a trained machine for a new example (xm+1ym+1).

Let gen be the average predictive entropy (also known as the average entropic loss) of a trained machine parameterized by for a new example (xm+1ym+1):

(4.8.13)

Similarly, we define train as the average entropic loss over the training examples used to obtain

(4.8.14)

Finally, let H0 be the average entropic error of the true machine

(4.8.15)

Amari and Murata proved the following theorem for training and generalization error.

Theorem 4.8.1 (Amari and Murata, 1993): The asymptotic learning curve for the entropic training error is given by

(4.8.16)

and for the entropic generalization error by

(4.8.17)

The proof of Theorem 4.8.1 uses standard techniques of asymptotic statistics and is omitted here. (The reader is referred to the original paper by Amari and Murata (1993) for such proof).

In general, H0 is unknown and it can be eliminated from Equation (4.8.17) by substituting its value from Equation (4.8.16). This gives

(4.8.18)

which shows that for a faithful stochastic machine and in the limit of m d, the generalization error approaches that of the trained machine on m examples, which from Equation (4.8.16) is the classification error H0 of the true machine.

Again, the particular network architecture is of no importance here as long as it allows for a faithful realization of the true machine and m >> d. It is interesting to note that this result is similar to the worst case learning curve for deterministic machines [Equation (4.8.10)], when the training error is zero. The result is also in agreement with Cover's result on classifier ambiguity in Equation (1.5.3), where the term in Equation (4.8.18) may be viewed as the probability of ambiguous response on the + 1 input. In fact, Amari (1993) also proved that the average predictive entropy gen(m) for a general deterministic dichotomy machine (e.g., feedforward neural net classifier) converges to 0 as in the limit m  .

This section deals with the computational complexity of learning: How much computation is required to learn (exactly or approximately to some "acceptable" degree) an arbitrary mapping in a multilayer neural network? In other words, is there an algorithm that is computationally "efficient" for training layered neural networks? Here, we will assume that the desired learning algorithm is a supervised one, which implies that the training set is labeled. Also, we will assume that the neural network has an arbitrary architecture but with no feedback connections.

Learning in artificial neural networks is hard. More precisely, loading of an arbitrary mapping onto a "faithful" neural network architecture requires exponential time irrespective of the learning algorithm used (batch or adaptive). Judd (1987, 1990) showed that the learning problem in neural networks is NP-complete, even for approximate learning; i.e., in the worst case, we will not be able to do much better than just randomly exhausting all combinations of weight settings to see if one happens to work. Therefore, as the problem size increases (that is, as the input pattern dimension or the number of input patterns increases) the training time scales up exponentially in the size of the problem. Moreover, it has been shown (Blum and Rivest, 1989) that training a simple n-input, three-unit two layer net of LTG's can be NP-complete in the worst-case when learning a given set of examples. Consider the class of functions defined on a collection of m arbitrary points in Rn. It has been shown that the problem of whether there exist two hyperplanes that separate them is NP-complete (Megiddo, 1986); i.e., the training of a net with two-hidden n-input LTG's and a single output LTG on examples of such functions is exponential in time, in the worst case, even if a solution exists. Blum and Rivest (1992) extend this result to the case of Boolean functions. They also showed that learning Boolean functions with a two-layer feedforward network of k-hidden units (k bounded by some polynomial in n) and one output unit (which computes the AND function) is NP-complete.

However, these theoretical results do not rule out the possibility of finding a polynomial-time algorithm for the training of certain classes of problems onto certain carefully selected architectures. Blum and Rivest (1992) gave an example of two networks trained on the same task, such that training the first is NP-complete but the second can be trained in polynomial time. Also, the class of linearly separable mappings can be trained in polynomial time if single layer LTG nets are employed (only a single unit is needed if the mapping has a single output). This is easy to prove since one can use linear programming (Karmarkar, 1984) to compute the weights and thresholds of such nets in polynomial time. One can also use this fact and construct layered networks that have polynomial learning time complexity for certain classes of non-linearly separable mappings. This is illustrated next.

Consider a set F of non-linearly separable functions or which has the following two properties: (1) There exists at least one layered neural net architecture for which loading m training pairs {x, yd} of f(x) is NP-complete; (2) there exists a fixed dimensionality expansion process D that maps points x in Rn to points z in Rd, such that d is bounded by some polynomial in n [e.g., ], and that the m training examples {zyd} representing (x) in the expanded space Rd are linearly separable. This set F is not empty; Blum and Rivest (1992) gave examples of functions in F. Figure 4.9.1 depicts a layered architecture which can realize any function in F. Here, a fixed preprocessing layer, labeled D in Figure 4.9.1, implements the above dimensionality expansion process. The output node is a d-input LTG. It can be easily shown that the learning complexity of this network for functions in F is polynomial. This can be seen by noting that the training of the trainable part of this network (the output LTG) has polynomial complexity for m linearly separable examples in Rd and that as n increases, d remains polynomial in n.

Figure 4.9.1. A layered architecture consisting of a fixed preprocessing layer D followed by an adaptive LTG.

The efficiency of learning linearly separable classification tasks in a single threshold gate should not be surprising. We may recall from Chapter One that the average amount of necessary and sufficient information for the characterization of the set of separating surfaces for a random, separable dichotomy of m points grows slowly with m and asymptotically approaches 2d (twice the number of degrees of freedom of the class of separating surfaces). This implies that for a random set of linear inequalities in d unknowns, the expected number of extreme inequalities which are necessary and sufficient to cover the whole set, tends to 2d as the number of consistent inequalities tends to infinity, thus bounding the (expected) necessary number of training examples for learning algorithms in separable problems. Moreover, this limit of 2d consistent inequalities is within the learning capacity of a single d-input LTG.

Another intuitive reason that the network in Figure 4.9.1 is easier to train than a fully adaptive two layer feedforward net is that we are giving it predefined nonlinearities. The former net does not have to start from scratch, but instead is given more powerful building blocks to work with. However, there is a trade off. By using the network in Figure 4.9.1, we gain in a worst-case computational sense, but lose in that the number of weights increases from n to O(n2) or higher. This increase in the number of weights implies that the number of training examples must increase so that the network can meaningfully generalize on new examples (recall the results of the previous section).

The problem of NP-complete learning in multilayer neural networks may be attributed to the use of fixed network resources (Baum, 1989). Learning an arbitrary mapping can be achieved in polynomial time for a network that allocates new computational units as more patterns are learned. Mukhopadhyay et al. (1993) gave a polynomial-time training algorithm for the general class of classification problems (defined by mappings of the form ) based on clustering and linear programming models. This algorithm simultaneously designs and trains an appropriate network for a given classification task. The basic idea of this method is to cover class regions with a minimal number of dynamically allocated hyperquadratic volumes (e.g., hyperspheres) of varying size. The resulting network has a layered structure consisting of a simple fixed preprocessing layer, a hidden layer of LTG's, and an output layer of logical OR gates. This and other efficiently trainable nets are considered in detail in Section 6.3.

Learning in artificial neural networks is viewed as a search for parameters (weights) which optimize a predefined criterion function. A general learning equation is presented which implements a stochastic steepest gradient descent search on a general criterion function with (or without) a regularization term. This learning equation serves to unify a wide variety of learning rules, regardless of whether they are supervised, unsupervised, or reinforcement rules.

The learning equation is a first order stochastic differential equation. This allows us to employ an averaging technique to study its equilibria and its convergence characteristics. The use of averaging, under reasonable assumptions, allows us to approximate the stochastic learning equation by a deterministic first-order dynamical system. In most cases, a well-defined criterion function exists which allows us to treat the deterministic systems as a gradient system. This enables us to exploit the global stability property of gradient systems, and determine the nature of the solutions evolved by the average learning equation. These stable solutions are then taken to be the possible solutions sought by the associated stochastic learning equation. The averaging technique is employed in characterizing several basic rules for supervised, unsupervised, and reinforcement learning. For unsupervised learning, we present analysis and insights into the theory of Hebbian, competitive, and self-organizing learning. In particular, self-organizing neural fields are introduced and analyzed.

The chapter also looks at some important results on generalization of learning in general feedforward neural architectures. The asymptotic behavior of generalization error is derived for deterministic and stochastic networks. Generalization in the average and in the worst-case are considered. The main result here is that the number of training examples necessary for "good" generalization on test samples must far exceed the number of adjustable parameters of the network used.

Finally, the issue of complexity of learning in neural networks is addressed. It is found that learning an arbitrary mapping in a layered neural network is NP-complete in the worst-case. However, it is also found that efficient (polynomial-time) learning is possible if appropriate network architectures and corresponding learning algorithms are found for certain classes of mappings/learning tasks.

4.1.1 Identify the "regularization" term, if any, in the learning rules listed in Table 3.1 of Chapter 3.

4.2.1 Characterize the LMS learning rule with weight decay by analyzing its corresponding average differential equation as in Section 4.2. Find the underlying instantaneous criterion J and its expected value .

4.2.2 Employ Liapunov's first method (see footnote #4) to study the stability of the nonlinear system

for the equilibrium point x* = [0 1]T. Is this an asymptotically stable point? Why?

4.2.3 Study the stability of the lossless pendulum system with the nonlinear dynamics

about the equilibrium points x* = [0 0]T and x* = [ 0]T. Here, measures the angle of the pendulum with respect to its vertical rest position, g is gravitational acceleration, and L is the length of the pendulum.

4.2.4 Liapunov's first method for studying the stability of nonlinear dynamical systems (see footnote #4) is equivalent to studying the asymptotic stability of a linearized version of these systems about an equilibrium point. Linearize the second order nonlinear system in Problem 4.2.2 about the equilibrium point x* = [0 1]T and write the system equations in the form . Show that the system matrix A is identical to the Jacobian matrix f '(x) at x = [0 1]T of the original nonlinear system and thus both matrices have the same eigenvalues. (Note that the asymptotic stability of a linear system requires the eigenvalues of its system matrix A to have strictly negative real parts).

4.2.5 The linearization method for studying the stability of a nonlinear system at a given equilibrium point may fail when the linearized system is stable, but not asymptotically; that is, if all eigenvalues of the system matrix A have nonpositive real parts, and if the real part of one or more eigenvalues of A has zero real part. Demonstrate this fact for the nonlinear system

at the equilibrium point x* = [0, 0]T. (Hint: Simulate the above dynamical system for the three cases: a < 0, a = 0, and a > 0, for an initial condition x(0) of your choice).

* 4.3.1 Show that the Hessian of J in Equation (4.3.6) is given by

Also, show that

* 4.3.2 Study the stability of the equilibrium point(s) of the dynamical system , where J(w) is given by

Note that the above system corresponds to the average learning equation of Linsker's learning rule (see Section 3.3.4) without weight clipping.

4.3.3 Verify the Hessian matrices given in Section 4.3 for Oja's, Yuille et al., and Hassoun's learning rules, given in Equations (4.3.14), (4.3.21), and (4.3.30), respectively.

4.3.4 Show that the average learning equation for Hassoun's rule is given by Equation (4.3.26).

4.3.5 Study the stability of the equilibrium points of the stochastic differential equation/learning rule (Riedel and Schild, 1992)

where is a positive integer. ( Note: this rule is equivalent to Yuille et al. rule for = 2 ).

4.3.6 Study, via numerical simulations, the stability of the learning rule ( Riedel and Schild, 1992 )

where . Assume a training set {x} of twenty vectors in R10 whose components are generated randomly and independently according to the normal distribution N(0, 1). Is there a relation between the stable point(s) w* (if such point(s) exist) and the eigenvectors of the input data autocorrelation matrix? Is this learning rule local? Why?

4.3.7 Show that the discrete-time version of Oja's rule is a good approximation of the normalized Hebbian rule in Equation (4.3.33) for small values. Hint: Start by showing that

* 4.3.8 Consider the general learning rule described by the following discrete-time gradient system:

(1)

with  > 0. Assume that w* is an equilibrium point for this dynamical system.

a. Show that in the neighborhood of w*, the gradient J(w) can be approximated as

(2)

where H(w*) is the Hessian of J evaluated at w*.

b. Show that the gradient in Equation (2) is exact when J(w) is quadratic; i.e., where Q is a symmetric matrix, and b is a vector of constants.

c. Show that the linearized gradient system at w* is given by

(3)

where I is the identity matrix.

d. What are the conditions on H(w*) and for local asymptotic stability of w* in Equation (3)?

e. Use the above results to show that, in an average sense, the -LMS rule in Equation (3.1.35) converges asymptotically to the equilibrium solution in Equation (3.1.45) if 0 <  < , where max is the largest eigenvalue of the autocorrelation matrix C in Equation (3.3.4). (Hint: Start with the gradient system in Equation (1) and use Equation (3.1.44) for J). Now, show that is a sufficient condition for convergence of the -LMS rule. (Hint: The trace of a matrix (the sum of all diagonal elements) is equal to the sum of its eigenvalues).

* 4.3.9 Use the results from the previous problem and Equation (4.3.14) to find the range of values for for which the discrete-time Oja's rule is stable (in an average sense). Repeat for Hassoun's rule which has the Hessian matrix given by Equation (4.3.30), and give a justification for the choice  = 1 which has led to Equation (4.3.32).

4.3.10 Consider a training set of 40 15-dimensional vectors whose components are independently generated according to a normal distribution N(0, 1). Employ the stochastic discrete-time version of Oja's, Yuille et al., and Hassoun's rules (replace by wk+1 - wk in Equations (4.3.11), (4.3.17), and (4.3.24), respectively) to extract the principal eigenvector of this training set. Use a fixed presentation order of the training vectors. Compare the convergence behavior of the three learning rules by generating plots similar to those in Figures 4.3.1 (a) and (b). Use  = 0.005,  = 100, and a random initial w. Repeat using the corresponding discrete-time average learning equations with the same learning parameters and initial weight vector as before and compare the two sets of simulations.

4.3.11 This problem illustrates an alternative approach to the one of Section 4.3 for proving the stability of equilibrium points of an average learning equation (Kohonen, 1989). Consider a stochastic first order differential equation of the form where x(t) is governed by a stationary stochastic process. Furthermore, assume that the vectors x are statistically independent from each other and that strong mixing exists. Let z be an arbitrary constant vector having the same dimension as x and w. Now, the "averaged" trajectories of w(t) are obtained by taking the expected value of ,

a. Show that

where  is the angle between vectors z and w.

b. Let z = c(i), the ith unity-norm eigenvector of the autocorrelation matrix C = <xxT>. Show that, the average rate of change of the cosine of the angle between w and c(i) for Oja's rule [Equation (4.3.12)] is given by

where i is the eigenvalue associated with c(i). Note that the c(i)'s are the equilibria of Oja's rule.

c. Use the result in part b to show that if then w(t) will converge to the solution w* = c(1), where c(1) is the eigenvector with the largest eigenvalue 1 (Hint: Recall the bounds on the Rayleigh quotient given in Section 4.3.2).

* 4.3.12 Use the technique outlined in Problem 4.3.11 to study the convergence properties (in the average) of the following stochastic learning rules which employ a generalized forgetting law:

a. .

b. .

Assume that  > 0 and y = wTx, and that g(y) is an arbitrary scalar function of y such that exists. Note that Equation (4.7.18) and Oja's rule are special cases of the learning rules in a and b, respectively.

4.4.1 Show that Equation (4.4.7) is the Hessian for the criterion function implied by Equation (4.4.5).

4.6.1 Study (qualitatively) the competitive learning behavior which minimizes the criterion function

where is as defined in Equation (4.6.3). Can you think of a physical system (for some integer value of N) which is governed by this "energy" function J?

4.6.2 Derive a stochastic competitive learning rule whose corresponding average learning equation maximizes the criterion function in Equation (4.6.16).

* 4.7.1 Show that for the one-dimensional feature map, Equation (4.7.6) can be derived from Equation (4.7.4). See Hertz et al. (1991) for hints.

4.7.2 Show that Equation (4.7.5) satisfies Equation (4.7.6).

4.7.3 Solve Equation (4.7.5) for p(x)  x, where   R. For which input distribution p(x) do we have a zero-distortion feature map?

4.7.4 Prove the stability of the equilibrium point in Equation (4.7.20). (Hint: Employ the technique outlined in Problem 4.3.11).