4.10 Summary

Learning in artificial neural networks is viewed as a search for parameters (weights) which optimize a predefined criterion function. A general learning equation is presented which implements a stochastic steepest gradient descent search on a general criterion function with (or without) a regularization term. This learning equation serves to unify a wide variety of learning rules, regardless of whether they are supervised, unsupervised, or reinforcement rules.

The learning equation is a first order stochastic differential equation. This allows us to employ an averaging technique to study its equilibria and its convergence characteristics. The use of averaging, under reasonable assumptions, allows us to approximate the stochastic learning equation by a deterministic first-order dynamical system. In most cases, a well-defined criterion function exists which allows us to treat the deterministic systems as a gradient system. This enables us to exploit the global stability property of gradient systems, and determine the nature of the solutions evolved by the average learning equation. These stable solutions are then taken to be the possible solutions sought by the associated stochastic learning equation. The averaging technique is employed in characterizing several basic rules for supervised, unsupervised, and reinforcement learning. For unsupervised learning, we present analysis and insights into the theory of Hebbian, competitive, and self-organizing learning. In particular, self-organizing neural fields are introduced and analyzed.

The chapter also looks at some important results on generalization of learning in general feedforward neural architectures. The asymptotic behavior of generalization error is derived for deterministic and stochastic networks. Generalization in the average and in the worst-case are considered. The main result here is that the number of training examples necessary for "good" generalization on test samples must far exceed the number of adjustable parameters of the network used.

Finally, the issue of complexity of learning in neural networks is addressed. It is found that learning an arbitrary mapping in a layered neural network is NP-complete in the worst-case. However, it is also found that efficient (polynomial-time) learning is possible if appropriate network architectures and corresponding learning algorithms are found for certain classes of mappings/learning tasks.


4.1.1 Identify the "regularization" term, if any, in the learning rules listed in Table 3.1 of Chapter 3.

4.2.1 Characterize the LMS learning rule with weight decay by analyzing its corresponding average differential equation as in Section 4.2. Find the underlying instantaneous criterion J and its expected value .

4.2.2 Employ Liapunov's first method (see footnote #4) to study the stability of the nonlinear system

for the equilibrium point x* = [0 1]T. Is this an asymptotically stable point? Why?

4.2.3 Study the stability of the lossless pendulum system with the nonlinear dynamics

about the equilibrium points x* = [0 0]T and x* = [ 0]T. Here, measures the angle of the pendulum with respect to its vertical rest position, g is gravitational acceleration, and L is the length of the pendulum.

4.2.4 Liapunov's first method for studying the stability of nonlinear dynamical systems (see footnote #4) is equivalent to studying the asymptotic stability of a linearized version of these systems about an equilibrium point. Linearize the second order nonlinear system in Problem 4.2.2 about the equilibrium point x* = [0 1]T and write the system equations in the form . Show that the system matrix A is identical to the Jacobian matrix f '(x) at x = [0 1]T of the original nonlinear system and thus both matrices have the same eigenvalues. (Note that the asymptotic stability of a linear system requires the eigenvalues of its system matrix A to have strictly negative real parts).

4.2.5 The linearization method for studying the stability of a nonlinear system at a given equilibrium point may fail when the linearized system is stable, but not asymptotically; that is, if all eigenvalues of the system matrix A have nonpositive real parts, and if the real part of one or more eigenvalues of A has zero real part. Demonstrate this fact for the nonlinear system

at the equilibrium point x* = [0, 0]T. (Hint: Simulate the above dynamical system for the three cases: a < 0, a = 0, and a > 0, for an initial condition x(0) of your choice).

* 4.3.1 Show that the Hessian of J in Equation (4.3.6) is given by

Also, show that

* 4.3.2 Study the stability of the equilibrium point(s) of the dynamical system , where J(w) is given by

Note that the above system corresponds to the average learning equation of Linsker's learning rule (see Section 3.3.4) without weight clipping.

4.3.3 Verify the Hessian matrices given in Section 4.3 for Oja's, Yuille et al., and Hassoun's learning rules, given in Equations (4.3.14), (4.3.21), and (4.3.30), respectively.

4.3.4 Show that the average learning equation for Hassoun's rule is given by Equation (4.3.26).

4.3.5 Study the stability of the equilibrium points of the stochastic differential equation/learning rule (Riedel and Schild, 1992)

where is a positive integer. ( Note: this rule is equivalent to Yuille et al. rule for = 2 ).

4.3.6 Study, via numerical simulations, the stability of the learning rule ( Riedel and Schild, 1992 )

where . Assume a training set {x} of twenty vectors in R10 whose components are generated randomly and independently according to the normal distribution N(0, 1). Is there a relation between the stable point(s) w* (if such point(s) exist) and the eigenvectors of the input data autocorrelation matrix? Is this learning rule local? Why?

4.3.7 Show that the discrete-time version of Oja's rule is a good approximation of the normalized Hebbian rule in Equation (4.3.33) for small values. Hint: Start by showing that

* 4.3.8 Consider the general learning rule described by the following discrete-time gradient system:


with  > 0. Assume that w* is an equilibrium point for this dynamical system.

a. Show that in the neighborhood of w*, the gradient J(w) can be approximated as


where H(w*) is the Hessian of J evaluated at w*.

b. Show that the gradient in Equation (2) is exact when J(w) is quadratic; i.e., where Q is a symmetric matrix, and b is a vector of constants.

c. Show that the linearized gradient system at w* is given by


where I is the identity matrix.

d. What are the conditions on H(w*) and for local asymptotic stability of w* in Equation (3)?

e. Use the above results to show that, in an average sense, the -LMS rule in Equation (3.1.35) converges asymptotically to the equilibrium solution in Equation (3.1.45) if 0 <  < , where max is the largest eigenvalue of the autocorrelation matrix C in Equation (3.3.4). (Hint: Start with the gradient system in Equation (1) and use Equation (3.1.44) for J). Now, show that is a sufficient condition for convergence of the -LMS rule. (Hint: The trace of a matrix (the sum of all diagonal elements) is equal to the sum of its eigenvalues).

* 4.3.9 Use the results from the previous problem and Equation (4.3.14) to find the range of values for for which the discrete-time Oja's rule is stable (in an average sense). Repeat for Hassoun's rule which has the Hessian matrix given by Equation (4.3.30), and give a justification for the choice  = 1 which has led to Equation (4.3.32).

4.3.10 Consider a training set of 40 15-dimensional vectors whose components are independently generated according to a normal distribution N(0, 1). Employ the stochastic discrete-time version of Oja's, Yuille et al., and Hassoun's rules (replace by wk+1 - wk in Equations (4.3.11), (4.3.17), and (4.3.24), respectively) to extract the principal eigenvector of this training set. Use a fixed presentation order of the training vectors. Compare the convergence behavior of the three learning rules by generating plots similar to those in Figures 4.3.1 (a) and (b). Use  = 0.005,  = 100, and a random initial w. Repeat using the corresponding discrete-time average learning equations with the same learning parameters and initial weight vector as before and compare the two sets of simulations.

4.3.11 This problem illustrates an alternative approach to the one of Section 4.3 for proving the stability of equilibrium points of an average learning equation (Kohonen, 1989). Consider a stochastic first order differential equation of the form where x(t) is governed by a stationary stochastic process. Furthermore, assume that the vectors x are statistically independent from each other and that strong mixing exists. Let z be an arbitrary constant vector having the same dimension as x and w. Now, the "averaged" trajectories of w(t) are obtained by taking the expected value of ,

a. Show that

where  is the angle between vectors z and w.

b. Let z = c(i), the ith unity-norm eigenvector of the autocorrelation matrix C = <xxT>. Show that, the average rate of change of the cosine of the angle between w and c(i) for Oja's rule [Equation (4.3.12)] is given by

where i is the eigenvalue associated with c(i). Note that the c(i)'s are the equilibria of Oja's rule.

c. Use the result in part b to show that if then w(t) will converge to the solution w* = c(1), where c(1) is the eigenvector with the largest eigenvalue 1 (Hint: Recall the bounds on the Rayleigh quotient given in Section 4.3.2).

* 4.3.12 Use the technique outlined in Problem 4.3.11 to study the convergence properties (in the average) of the following stochastic learning rules which employ a generalized forgetting law:

a. .

b. .

Assume that  > 0 and y = wTx, and that g(y) is an arbitrary scalar function of y such that exists. Note that Equation (4.7.18) and Oja's rule are special cases of the learning rules in a and b, respectively.

4.4.1 Show that Equation (4.4.7) is the Hessian for the criterion function implied by Equation (4.4.5).

4.6.1 Study (qualitatively) the competitive learning behavior which minimizes the criterion function

where is as defined in Equation (4.6.3). Can you think of a physical system (for some integer value of N) which is governed by this "energy" function J?

4.6.2 Derive a stochastic competitive learning rule whose corresponding average learning equation maximizes the criterion function in Equation (4.6.16).

* 4.7.1 Show that for the one-dimensional feature map, Equation (4.7.6) can be derived from Equation (4.7.4). See Hertz et al. (1991) for hints.

4.7.2 Show that Equation (4.7.5) satisfies Equation (4.7.6).

4.7.3 Solve Equation (4.7.5) for p(x)  x, where   R. For which input distribution p(x) do we have a zero-distortion feature map?

4.7.4 Prove the stability of the equilibrium point in Equation (4.7.20). (Hint: Employ the technique outlined in Problem 4.3.11).

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [4.8] [4.9]

Back to the Table of Contents

Back to Main Menu