Equation (4.2.5) [or (4.2.6)] is a powerful tool which can be used in the characterization of the average behavior of stochastic learning equations. We will employ it in this section in order to characterize some unsupervised learning rules which were considered in Chapter 3. The following analysis is made easier if one assumes that w and x are uncorrelated, and that is averaged with respect to x (denoted by ), with w replaced by its mean . This assumption leads to the "approximate" average learning equation

(4.3.1)

A more rigorous characterization of stochastic learning requires the more advanced theory of stochastic differential equations and will not be considered here [see Kushner (1977) and Ljung (1978)]. Rather, we may proceed with a deterministic analysis using the "average versions" of the stochastic equations. It may be shown that a necessary condition for the stochastic learning rule to converge (in the mean-square sense) is that the average version of the learning rule must converge. In addition, and under certain assumptions, the exact solution of a stochastic equation is guaranteed to "stay close," in a probabilistic sense, to the solution of the associated average equation. It has been shown (Geman, 1979) that under strong mixing conditions (and some additional assumptions), . This result implies that if sufficiently small learning rates are used, the behavior of a stochastic learning equation may be well approximated, in a mean-square sense, by the deterministic dynamics of its corresponding average equation. Oja (1983) pointed out that the convergence of constrained gradient descent- (or ascent-) based stochastic learning equations (the type of equations considered in this chapter) can be studied with averaging techniques; i.e., the asymptotically stable equilibria of the average equation are the possible limits of the stochastic equation. Several examples of applying the averaging technique to the characterization of learning rules can be found in Kohonen (1989).

Before proceeding with further analysis of learning rules, we make the following important observations regarding the nature of the learning parameter in the stochastic learning equation (Heskes and Kappen, 1991). When a neural network interacts with a fixed unchanging (stationary) environment, the aim of the learning algorithm is to adjust the weights of the network in order to produce an optimal response; i.e., an optimal representation of the environment. To produce such an optimal and static representation, we require the learning parameter, which controls the amount of learning, to eventually approach zero. Otherwise, fluctuations in the representation will persist, due to the stochastic nature of the learning equation, and asymptotic convergence to optimal representation is never achieved. For a large class of stochastic algorithms, asymptotic convergence can be guaranteed (with high probability) by using the learning parameter ( Ljung, 1977; Kushner and Clark, 1978).

On the other hand, consider biological neural nets. Human beings, of course, are able to continually learn throughout their entire lifetime. In fact, human learning is able to proceed on two different time scales; humans learn with age (very large time scale adaptation/learning) and are also capable of discovering regularities and are attentive for details (short time scale learning). This constant tendency to learn accounts for the adaptability of natural neural systems to a changing environment. Therefore, it is clear that the learning processes in biological neural networks does not allow for asymptotically vanishing learning parameters.

In order for artificial neural networks to be capable of adapting to a changing (nonstationary) environment, the learning parameter must take a constant nonzero value. The larger the learning parameter, the faster the response of the network to the changing environment. On the other hand, a large learning parameter has a negative effect on the accuracy of the network's representation of the environment at a given time; a large gives rise to large fluctuations around the desired optimal representation. In practice, though, one might be willing to trade some degree of fluctuation about the optimal representation (solution) for adaptability to a nonstationary process. Similar ideas have been proposed in connection with stochastic adaptive linear filtering. Here, an adaptive algorithm with a constant step size is used because it has the advantage of a limited memory, which enables it to track time fluctuations in the incoming data. These ideas date back to Wiener (1956) in connection with his work on linear prediction theory.

We have already analyzed one version of the Hebbian learning rule in the previous section. However, here we consider the most simple form of Hebbian learning which is given by Equation (4.2.18) with = 0; namely,

(4.3.2)

The above equation is a continuous-time version of the unsupervised Hebbian learning rule introduced in Chapter 3. Employing Equation (4.3.1), we get the approximate average learning equation

(4.3.3)

In Equation (4.3.3) and in the remainder of this chapter, the subscript x in is dropped in order to simplify notation. Now, the average gradient of J in Equation (4.3.3) may be written as

(4.3.4)

from which we may determine the average instantaneous criterion function

(4.3.5)

Note that is minimized by maximizing the unit's output variance. Again, C = is the autocorrelation matrix, which is positive semidefinite having orthonormal eigenvectors c(i) with corresponding eigenvalues . That is, Cc(i) = ic(i) for i = 1, 2, ..., n. The dynamics of Equation (4.3.3) are unstable. To see this, we first find the equilibrium points by setting = 0, giving C = 0 or w* = 0. w* is unstable because the Hessian (in an average sense) = = -C is non-positive for all . Therefore, Equation (4.3.3) is unstable and results in . Note, however, that the direction of is not random; it will tend to point in the direction of c(1), since if one assumes a fixed weight vector magnitude, is minimized when is parallel to the eigenvector with the largest corresponding eigenvalue.

In the following, we will characterize other versions of the Hebbian learning rule some of which were introduced in Chapter 3. These rules are well-behaved, and hence solve the divergence problem encountered with simple Hebbian learning. For simplifying mathematical notation and terminology, the following sections will use J, J, and H to designate , , and , respectively. We will refer to as simply the criterion function, to as the gradient of J, and to as the Hessian of J. Also, the quantity w in the following average equations should be interpreted as the state of the average learning equation.

Consider the criterion function

(4.3.6)

It is a well established property of quadratic forms that if w is constrained to the surface of the unit hypersphere, then Equation (4.3.6) is minimized when w = c(1) with (e.g. see Johnson and Wichern, 1988). Also, for any real symmetric n  n matrix A, the Rayleigh quotient satisfies where 1 and n are the largest and smallest eigenvalues of A, respectively. Let us start from the above criterion and derive an average learning equation. Employing Equation (4.3.1), we get

(4.3.7)

which can be shown to be the average version of the nonlinear stochastic learning rule

(4.3.8)

If we heuristically set for the two terms in the above equation, Equation (4.3.8) reduces to the continuous version of Oja's rule [refer to Equation (3.3.6)]. Let us continue with the characterization of (4.3.8) and defer the analysis of Oja's rule to Section 4.3.3. At equilibrium, Equation (4.3.7) gives

(4.3.9)

Hence, the equilibria of Equation (4.3.7) are the solutions of Equation (4.3.9) given by w* = c(i), i = 1, 2, ..., n. Here, J takes its smallest value of at w* = c(1). This can be easily verified by direct substitution in Equation (4.3.6).

Next, consider the Hessian of J at w* = c(i) (assuming, without loss of generality, = 1) and multiply it by c(j) , namely, H(c(i)c(j). It can be shown that this quantity is given by (see Problem 4.3.1. For a reference on matrix differential calculus, the reader is referred to the book by Magnus and Neudecker, 1988):

(4.3.10)

This equation implies that H(w*) has the same eigenvectors as C but with different eigenvalues. H(w*) is positive semi-definite only when w* = c(1). Thus, by following the dynamics of Equation (4.3.7), w will eventually point in the direction of c(1) (since none of the other directions c(i) is stable). Although the direction of w will eventually stabilize, it is entirely possible for ||w|| to approach infinity, and Equation (4.3.7) will appear never to converge. We may artificially constrain ||w|| to finite values by normalizing w after each update of Equation (4.3.8). Alternatively, we may set the two terms in Equation (4.3.8) equal to 1. This latter case is considered next.

Oja's rule was defined by Equation (3.3.6). Its continuous-time version is given by the nonlinear stochastic equation

(4.3.11)

(4.3.12)

which has its equilibria at w satisfying

(4.3.13)

The solutions of Equation (4.3.13) are w* = c(i), i = 1, 2, ..., n. All of these equilibria are unstable except for . This can be seen by noting that the Hessian

(4.3.14)

is positive definite only at w* = c(1) (or -c(1)). Note that Equation (4.3.14) is derived starting from , with given in Equation (4.3.12). Although J is not known, the positive definiteness of H can be seen from

(4.3.15)

and by noting that the eigenvalues of C satisfy (we assume 1  2). Therefore, Oja's rule is equivalent to a stable version of the Hebbian rule given in Equation (4.3.8). A formal derivation of Oja's rule is explored in Problem 4.3.7.

A single unit employing Oja's rule (Oja's unit) is equivalent to a linear matched filter. To see this, assume that for all x, , where is a fixed vector (without loss of generality let ) and v is a vector of symmetrically distributed zero-mean noise with uncorrelated components having variance 2. Then . The largest eigenvalue of C is 1 + 2 and the corresponding eigenvector is . Oja's unit then becomes a matched filter for the data, since in Equation (4.3.12). Here, the unit responds maximally to the data mean. Further characterization of Oja's rule can be found in Xu (1993).

Oja's rule is interesting because it results in a local learning rule which is biologically plausible. The locality property is seen by considering the component weight adaptation rule of Equation (4.3.11), namely

(4.3.16)

and by noting that the change in the ith weight is not an "explicit" function of any other weight except the ith weight itself. Of course, does depend on w via y = wTx. However, this dependence does not violate the concept of locality.

It is also interesting to note that Oja's rule is similar to Hebbian learning with weight decay as in Equation (4.2.18). For Oja's rule, though, the growth in ||w|| is controlled by a "forgetting" or weight decay term, -y2w, which has nonlinear gain; the forgetting becomes stronger with stronger response, thus preventing ||w|| from diverging.

Example 4.3.1: This example shows typical simulation results comparing the evolution of the weight vector w according to the stochastic Oja rule in Equation (4.3.11) and its corresponding average rule in Equation (4.3.12).

Consider a training set {x} of forty 15-dimensional column vectors with independent random components generated by a normal distribution N(0, 1). In the following simulations, the training vectors autocorrelation matrix has the following set of eigenvalues: {2.561, 2.254, 2.081, 1.786, 1.358, 1.252, 1.121, 0.963, 0.745, 0.633, 0.500, 0.460, 0.357, 0.288, 0.238}.

During training, one of the forty vectors is selected at random and is used in the learning rule to compute the next weight vector. Discretized versions of Equations (4.3.11) and (4.3.12) are used where is replaced by wk+1 - wk. A learning rate  = 0.005 is used. This is equivalent to integrating these equations with Euler's method (e.g., see Gerald, 1978) using a time step t = 0.005 and  = 1. The initial weight vector is set equal to one of the training vectors. Figures 4.3.1 (a) and (b) show the evolution of the cosine of the angle between wk and c(1) and the evolution of the norm of wk, respectively. The solid line corresponds to the stochastic rule and the dashed line corresponds to the average rule.

(a)

(b)

Figure 4.3.1. (a) Evolution of the cosine of the angle between the weight vector wk and the principal eigenvector of the autocorrelation matrix C for the stochastic Oja rule (solid line) and for the average Oja rule (dashed line). (b) Evolution of the magnitude of the weight vector. The training set consists of forty 15-dimensional real-valued vectors whose components are independently generated according to a normal distribution N(0, 1). The presentation order of the training vectors is random during training.

The above simulation is repeated, but with a fixed presentation order of the training set. Results are shown in Figures 4.3.2 (a) and (b). Note that the results for the average learning equation (dashed line) are identical in both simulations since they are not affected by the order of presentation of input vectors. These simulations agree with the theoretical results on the appropriateness of using the average learning equation to approximate the limiting behavior of its corresponding stochastic learning equation. Note that a monotonically decreasing learning rate (say proportional to or with k  1) can be used to force the convergence of the direction of wk in the first simulation. It is also interesting to note that better approximations are possible when the training vectors are presented in a fixed deterministic order (or in a random order but with each vector guaranteed to be selected once every training cycle of m = 40 presentations). Here, a sufficiently small, constant learning rate is sufficient for making the average dynamics approximate, in a practical sense, the stochastic dynamics for all time.

(a)

(b)

Figure 4.3.2. (a) Evolution of the cosine of the angle between the weight vector wk and the principal eigenvector of the autocorrelation matrix C for the stochastic Oja rule (solid line) and for the average Oja rule (dashed line). (b) Evolution of the magnitude of the weight vector. The training set consists of forty 15-dimensional real-valued vectors whose components are independently generated according to a normal distribution N(0, 1). The presentation order of the training vectors is fixed.

The continuous-time version of the Yuille et al. (1989) learning rule is

(4.3.17)

and the corresponding average learning equation is

(4.3.18)

with equilibria at

(4.3.19)

From Equation (4.3.18), the gradient of J is

(4.3.20)

which leads to the Hessian

(4.3.21)

Note that one could have also computed H directly from the known criterion function

(4.3.22)

Now, evaluating H(wi*)wj* we get

(4.3.23)

which implies that the wj* are eigenvectors of H(wi*) with eigenvalues i - j for i j and 2i for i = j. Therefore, H(wi*) is positive definite if and only if i > j, for i j. In this case, w* = are the only stable equilibria, and the dynamics of the stochastic equation are expected to approach w*.

In the following, the unsupervised Hebbian-type learning rule

(4.3.24)

with ||w||  0 is analyzed. Another way for stabilizing Equation (4.3.2) is to start from a criterion function that explicitly penalizes the divergence of w. For example, if we desire to be satisfied, we may utilize J given by

(4.3.25)

with > 0. It can be easily shown that steepest gradient descent on the above criterion function leads to the learning equation

(4.3.26)

Equation (4.3.26) is the average learning equation for the stochastic rule of Equation (4.3.24). Its equilibria are solutions of the equation

(4.3.27)

Thus, it can be seen that the solution vectors of Equation (4.3.27), denoted by , must be parallel to one of the eigenvectors of C, say c(i),

(4.3.28)

where i is the ith eigenvalue of C. From Equation (4.3.28) it can be seen that the norm of the ith equilibrium state is given by

(4.3.29)

which requires  > i for all i. Note that if >> i, then approaches one for all equilibrium points. Thus, the equilibria of Equation (4.3.26) approach unity norm eigenvectors of the correlation matrix C when is large.

Next, we investigate the stability of these equilibria. Starting from the Hessian

(4.3.30)

we have

(4.3.31)

which implies that is positive definite if and only if is parallel to c(1) and > 1. Therefore, the only stable equilibria of Equation (4.3.26) are which approach c(1) for >> 1. Like the Yuille et al. rule, this rule preserves the information about the size of 1 [1 can be computed from Equation (4.3.29)].

For the discrete-time version, it is interesting to note that for the case >> 1, and = 1, Equation (4.3.24) reduces to

(4.3.32)

which is very similar to the discrete-time simple Hebbian learning rule with weight vector normalization of Equation (3.3.5) and expressed here as

(4.3.33)

Note that the weight vector in in Equation (4.3.33) need not be normalized to prevent divergence.

In practice, discrete-time versions of the stochastic learning rules in Equations (4.3.11), (4.3.17), and (4.3.24) are used where is replaced by wk+1 - wk and w(t) by wk. Here, the stability of these discrete-time stochastic dynamical systems critically depends on the value of the learning rate . Although the stability analysis is difficult, one can resort to the discrete-time versions of the average learning equations corresponding to these rules and derive conditions on for asymptotic convergence (in the mean) in the neighborhood of equilibrium states w*. Such analysis is explored in Problems 4.3.8 and 4.3.9.

In concluding this section, another interpretation of the regularization effects on the stabilization of Hebbian-type rules is presented. In the stochastic learning Equations (4.2.18), (4.3.11), (4.3.17), and (4.3.24), regularization appears as weight decay terms -w, -y2w, -||w||2w, and -w, respectively. Therefore, one may think of weight decay as a way of stabilizing unstable learning rules. However, one should carefully design the gain coefficient in the weight decay term for proper performance. For example, it has been shown earlier that a simple positive constant gain in Equation (4.2.18) does not stabilize Hebb's rule. On the other hand, the nonlinear dynamic gains y2, ||w||2, and lead to stability. Note that the weight decay gain in Oja's rule utilizes more information, in the form of y2, than the Yuille et al. rule or Hassoun's rule. The regularization in these latter rules is only a function of the current weight vector magnitude.

Goto [4.0][4.1] [4.2] [4.4] [4.5] [4.6] [4.7] [4.8] [4.9] [4.10]