4.2 Mathematical Theory of Learning in a Single Unit Setting

In this section, instead of dealing separately with the various learning rules proposed in the previous chapter, we seek to study a single learning rule, called the general learning equation (Amari, 1977a and 1990), which captures the salient features of several of the different single-unit learning rules. Two forms of the general learning equation will be presented: a discrete-time version, in which the weight vector evolves according to a discrete dynamical system of the form w(k+1) = g(w(k)), and a continuous-time version, in which the weight vector evolves according to a smooth dynamical system of the form . Statistical analysis of the continuous-time version of the general learning equation is then performed for selected learning rules, including correlation, LMS, and Hebbian learning rules.

4.2.1 General Learning Equation

Consider a single unit which is characterized by a weight vector w Rn, an input vector x Rn, and, in some cases, a scalar teacher signal z. In a supervised learning setting, the teacher signal is taken as the desired target associated with a particular input vector. The input vector (signal) is assumed to be generated by an environment or an information source according to the probability density p(x, z), or p(x) if z is missing (as in unsupervised learning). Now, consider the following discrete-time dynamical process which governs the evolution of the unit's weight vector w

(4.2.1)

and the continuous-time version

(4.2.2)

where and are positive real, and . Here, r(wxz) is referred to as a "learning signal." One can easily verify that the above two equations lead to discrete-time and continuous-time versions, respectively, of the perceptron learning rule in Equation (3.1.2) if = 0, y = sgn(wTx), and r(wxz) = z - y (here z is taken as bipolar binary). The -LMS (or Widrow-Hoff) rule of Equation (3.1.35) can be obtained by setting  = 0 and r(wxz) = wTx in Equation (4.2.1). Similarly, substituting ,  = 1, and r(wxz) = z in Equation (4.2.1) lead to the simple correlation rule in Equation (3.1.50), and  = 0 and r(wxz) = y leads to the Hebbian rule in Equation (3.3.1). In the remainder of this section, Equation (4.2.2) is adopted and is referred to as the "general learning equation." Note that in Equation (4.2.2) the state w* = 0 is an asymptotically stable equilibrium point if either r(wxz) and/or x are identically zero. Thus, the term -w in Equation (4.2.2) plays the role of a "forgetting term" which tends to "erase" those weights not receiving sufficient reinforcement during learning.

From the point of view of analysis, it is useful to think of Equation (4.2.2) as implementing a fixed increment steepest gradient descent search of an instantaneous criterion function J, or formally,

(4.2.3)

For the case r(wxz) = r(wTx, z) = r(u, z), the right-hand side of Equation (4.2.2) can be integrated to yield

(4.2.4)

where = . This type of criterion function (which has the classic form of a potential function) is appropriate for learning rules such as perceptron, LMS, Hebbian, and correlation rules. In the most general case, however, r(wxz)  r(wTx, z), and a suitable criterion function J satisfying Equation (4.2.3) may not readily be determined (or may not even exist). It is interesting to note that finding the equilibrium points w* of the general learning equation does not require the knowledge of J explicitly if the gradient of J is known.

The criterion function J in Equation (4.2.4) fits the general form of the constrained criterion function in Equation (4.1.3). Therefore, we may view the task of minimizing J as an optimization problem with the objective of maximizing subject to regularization which penalizes solution vectors w* with large norm. It is interesting to note that by maximizing , one is actually maximizing the amount of information learned from a given example pair {xz}. In other words, the general learning equation is designed so that it extracts the maximum amount of "knowledge" present in the learning signal r(wxz).

4.2.2 Analysis of the Learning Equation

In a stochastic environment where the information source is ergodic, the sequence of inputs x(t) is an independent stochastic process governed by p(x). The general learning equation in Equation (4.2.2) then becomes a stochastic differential equation or a stochastic approximation algorithm. The weight vector w is changed in random directions depending on the random variable x. From Equation (4.2.3), the average value of becomes proportional to the average gradient of the instantaneous criterion function J. Formally, we write

(4.2.5)

where implies averaging over all possible inputs x with respect to the probability distribution p(x). We will refer to this equation as the "average learning equation." Equation (4.2.5) may be viewed as a steepest gradient descent search for w* which (locally) minimizes the expected criterion function, , because the linear nature of the averaging operation allows us to express Equation (4.2.5) as

(4.2.6)

It is interesting to note that finding w* does not require the knowledge of J explicitly if the gradient of J is known. Equation (4.2.6) is useful from a theoretical point of view in determining the equilibrium state(s) and in characterizing the stochastic learning equation [Equation (4.2.2)] in an "average" sense. In practice, the stochastic learning equation is implemented and its average convergence behavior is characterized by the "average learning equation" given as

= (4.2.7)

The gradient system in Equation (4.2.6) has special properties that makes its dynamics rather simple to analyze. First, note that the equilibria w* are solutions to <J> = 0. This means that the equilibria w* are local minima, local maxima, and/or saddle points of <J>. Furthermore, it is a well established result that, for any  > 0, these local minima are asymptotically stable points (attractors) and that the local maxima are unstable points (Hirsch and Smale, 1974). Thus, one would expect the stochastic dynamics of the system in Equation (4.2.3), with sufficiently small , to approach a local minimum of <J>.

In practice, discrete-time versions of the stochastic dynamical system in Equation (4.2.3) are used for weight adaptation. Here, the stability of the corresponding discrete-time average learning equation (discrete-time gradient system) is ensured if 0 <  < , where max is the largest eigenvalue of the Hessian matrix H = J, evaluated at the current point in the search space (the proof of this statement is outlined in Problem 4.3.8). These discrete-time "learning rules" and their associated average learning equations have been extensively studied in more general context than that of neural networks. The book by Tsypkin (1971) gives an excellent treatment of these iterative learning rules and their stability.

4.2.3 Analysis of Some Basic Learning Rules

By utilizing Equation (4.2.7), we are now ready to analyze some basic learning rules. These are the correlation, LMS, and Hebbian learning rules.

Correlation Learning

Here, r(w, x, z) = z, which represents the desired target associated with the input x. From Equation (4.2.2) we have the stochastic equation

(4.2.8)

which leads to the average learning equation

(4.2.9)

Now, by setting = 0, one arrives at the (only) equilibrium point

(4.2.10)

The stability of w* may now be systematically studied through the "expected" Hessian matrix < H(w*) > which is computed, by first employing Equations (4.2.5) and (4.2.9) to identify , as

(4.2.11)

This equation shows that the Hessian of <J> is positive definite; i.e., its eigenvalues are strictly positive or, equivalently, the eigenvalues of are strictly negative. This makes the system locally asymptotically stable at the equilibrium solution w* by virtue of Liapunov's first method (see Gill et al., 1981; Dickinson, 1991). Thus, w* is a stable equilibrium of Equation (4.2.9). In fact, the positive definite Hessian implies that w* is a minimum of , and therefore the gradient system converges globally and asymptotically to w*, its only minimum from any initial state. Thus, the trajectory w(t) of the stochastic system in Equation (4.2.8) is expected to approach and then fluctuate about the state .

From Equation (4.2.4), the underlying instantaneous criterion function J is given by

(4.2.12)

which may be minimized by maximizing the correlation zy subject to the regularization term . Here, the regularization term is needed in order to keep the solution bounded.

LMS Learning

For r(wxz) = z - wTx (the output error due to input x) and = 0, Equation (4.2.2) leads to the stochastic equation

(4.2.13)

In this case, the average learning equation becomes

(4.2.14)

with equilibria satisfying

or

(4.2.15)

Let C denote the positive semidefinite autocorrelation matrix , defined in Equation (3.3.4), and . If we have , then is the equilibrium state. Note that w* approaches the minimum SSE solution in the limit of a large training set, and that this analysis is identical to the analysis of the -LMS rule in Chapter 3. Let us now check the stability of w*. The Hessian matrix is

(4.2.16)

which is positive definite if 0. Therefore, w* = C-1P is the only (asymptotically) stable solution for Equation (4.2.14), and the stochastic dynamics in Equation (4.2.13) are expected to approach this solution.

Finally, note that with  = 0 Equation (4.2.4) leads to

or

(4.2.17)

which is the instantaneous SSE (or MSE) criterion function.

Hebbian Learning

Here, upon setting r(wxz) = y = wTx, Equation (4.2.2) gives the Hebbian rule with decay

(4.2.18)

whose average is

(4.2.19)

Setting in Equation (4.2.19) leads to the equilibria

(4.2.20)

So if C happens to have as an eigenvalue then w* will be the eigenvector of C corresponding to . In general, though, will not be an eigenvalue of C, so Equation (4.2.19) will have only one equilibrium at w* = 0. This equilibrium solution is asymptotically stable if is greater than the largest eigenvalue of C since this makes the Hessian

(4.2.21)

positive definite. Now, employing Equation (4.2.4) we get the instantaneous criterion function minimized by the Hebbian rule in Equation (4.2.18):

(4.2.22)

The regularization term is not adequate here to stabilize the Hebbian rule at a solution which maximizes y2. However, other more appropriate regularization terms can insure stability, as we will see in the next section.

Goto [4.0][4.1] [4.3] [4.4] [4.5] [4.6] [4.7] [4.8] [4.9] [4.10]

Back to the Table of Contents

Back to Main Menu