4.10 Summary
Learning in artificial neural networks is viewed
as a search for parameters (weights) which optimize a predefined
criterion function. A general learning equation is presented
which implements a stochastic steepest gradient descent search
on a general criterion function with (or without) a regularization
term. This learning equation serves to unify a wide variety of
learning rules, regardless of whether they are supervised, unsupervised,
or reinforcement rules.
The learning equation is a first order stochastic
differential equation. This allows us to employ an averaging
technique to study its equilibria and its convergence characteristics.
The use of averaging, under reasonable assumptions, allows us
to approximate the stochastic learning equation by a deterministic
first-order dynamical system. In most cases, a well-defined criterion
function exists which allows us to treat the deterministic systems
as a gradient system. This enables us to exploit the global stability
property of gradient systems, and determine the nature of the
solutions evolved by the average learning equation. These stable
solutions are then taken to be the possible solutions sought by
the associated stochastic learning equation. The averaging technique
is employed in characterizing several basic rules for supervised,
unsupervised, and reinforcement learning. For unsupervised learning,
we present analysis and insights into the theory of Hebbian, competitive,
and self-organizing learning. In particular, self-organizing
neural fields are introduced and analyzed.
The chapter also looks at some important results
on generalization of learning in general feedforward neural architectures.
The asymptotic behavior of generalization error is derived for
deterministic and stochastic networks. Generalization in the
average and in the worst-case are considered. The main result
here is that the number of training examples necessary for "good"
generalization on test samples must far exceed the number of adjustable
parameters of the network used.
Finally, the issue of complexity of learning in
neural networks is addressed. It is found that learning an arbitrary
mapping in a layered neural network is NP-complete in the worst-case.
However, it is also found that efficient (polynomial-time) learning
is possible if appropriate network architectures and corresponding
learning algorithms are found for certain classes of mappings/learning
tasks.
Problems
4.1.1 Identify the "regularization"
term, if any, in the learning rules listed in Table 3.1 of Chapter
3.
4.2.1 Characterize the
LMS learning rule with weight decay by analyzing its corresponding
average differential equation as in Section 4.2. Find the underlying
instantaneous criterion J and its expected value .
4.2.2 Employ Liapunov's
first method (see footnote #4) to study the stability of the nonlinear
system
for the equilibrium point x* = [0 1]T.
Is this an asymptotically stable point? Why?
4.2.3 Study the stability
of the lossless pendulum system with the nonlinear dynamics
about the equilibrium points x* = [0 0]T
and x* = [ 0]T.
Here, measures the angle of the pendulum with respect to its
vertical rest position, g is gravitational acceleration,
and L is the length of the pendulum.
4.2.4 Liapunov's first
method for studying the stability of nonlinear dynamical systems
(see footnote #4) is equivalent to studying the asymptotic stability
of a linearized version of these systems about an equilibrium
point. Linearize the second order nonlinear system in Problem
4.2.2 about the equilibrium point x* = [0 1]T
and write the system equations in the form . Show that the system
matrix A is identical to the Jacobian matrix f '(x)
at x = [0 1]T
of the original nonlinear system and thus both matrices have the
same eigenvalues. (Note that the asymptotic stability of a linear
system requires the eigenvalues of its system matrix A
to have strictly negative real parts).
4.2.5 The linearization
method for studying the stability of a nonlinear system at a given
equilibrium point may fail when the linearized system is stable,
but not asymptotically; that is, if all eigenvalues of the system
matrix A have nonpositive real parts, and if the real part
of one or more eigenvalues of A has zero real part. Demonstrate
this fact for the nonlinear system
at the equilibrium point x* = [0, 0]T.
(Hint: Simulate the above dynamical system for the three cases:
a < 0, a = 0, and a > 0,
for an initial condition x(0) of your choice).
* 4.3.1 Show that the
Hessian of J in Equation (4.3.6) is given by
Also, show that
* 4.3.2 Study the stability
of the equilibrium point(s) of the dynamical system , where J(w)
is given by
Note that the above system corresponds to the average
learning equation of Linsker's learning rule (see Section 3.3.4)
without weight clipping.
4.3.3 Verify the Hessian
matrices given in Section 4.3 for Oja's, Yuille et al., and Hassoun's
learning rules, given in Equations (4.3.14), (4.3.21), and (4.3.30),
respectively.
4.3.4 Show that the average
learning equation for Hassoun's rule is given by Equation (4.3.26).
4.3.5 Study the stability
of the equilibrium points of the stochastic differential equation/learning
rule (Riedel and Schild, 1992)
where is a positive integer. ( Note: this rule
is equivalent to Yuille et al. rule for = 2 ).
4.3.6 Study, via
numerical simulations, the stability of the learning rule ( Riedel
and Schild, 1992 )
where . Assume a training set {x}
of twenty vectors in R10
whose components are generated randomly and independently according
to the normal distribution N(0, 1). Is there a relation between
the stable point(s) w*
(if such point(s) exist) and the eigenvectors of the input data
autocorrelation matrix? Is this learning rule local? Why?
4.3.7 Show that the discrete-time
version of Oja's rule is a good approximation of the normalized
Hebbian rule in Equation (4.3.33) for small values. Hint: Start
by showing that
* 4.3.8 Consider the general
learning rule described by the following discrete-time gradient
system:
(1)
with > 0. Assume that w*
is an equilibrium point for this dynamical system.
a. Show that in the neighborhood of w*,
the gradient J(w) can be approximated as
(2)
where H(w*)
is the Hessian of J evaluated at w*.
b. Show that the gradient in Equation (2) is exact
when J(w) is quadratic; i.e., where Q is
a symmetric matrix, and b is a vector of constants.
c. Show that the linearized gradient system at w*
is given by
(3)
where I is the identity matrix.
d. What are the conditions on H(w*)
and for local asymptotic stability of w*
in Equation (3)?
e. Use the above results to show that, in an average
sense, the -LMS rule in Equation (3.1.35) converges asymptotically
to the equilibrium solution in Equation (3.1.45) if 0 < < ,
where max is the largest
eigenvalue of the autocorrelation matrix C in Equation
(3.3.4). (Hint: Start with the gradient system in Equation (1)
and use Equation (3.1.44) for J). Now, show that is a
sufficient condition for convergence of the -LMS rule. (Hint:
The trace of a matrix (the sum of all diagonal elements) is equal
to the sum of its eigenvalues).
* 4.3.9 Use the results
from the previous problem and Equation (4.3.14) to find the range
of values for for which the discrete-time Oja's rule is stable
(in an average sense). Repeat for Hassoun's rule which has the
Hessian matrix given by Equation (4.3.30), and give a justification
for the choice = 1 which has led to Equation (4.3.32).
4.3.10 Consider
a training set of 40 15-dimensional vectors whose components are
independently generated according to a normal distribution N(0, 1).
Employ the stochastic discrete-time version of Oja's, Yuille
et al., and Hassoun's rules (replace by wk+1 - wk
in Equations (4.3.11), (4.3.17), and (4.3.24), respectively) to
extract the principal eigenvector of this training set. Use a
fixed presentation order of the training vectors. Compare the
convergence behavior of the three learning rules by generating
plots similar to those in Figures 4.3.1 (a) and (b). Use = 0.005,
= 100, and a random initial w. Repeat using
the corresponding discrete-time average learning equations with
the same learning parameters and initial weight vector as before
and compare the two sets of simulations.
4.3.11 This problem illustrates
an alternative approach to the one of Section 4.3 for proving
the stability of equilibrium points of an average learning equation
(Kohonen, 1989). Consider a stochastic first order differential
equation of the form where x(t) is governed by
a stationary stochastic process. Furthermore, assume that the
vectors x are statistically independent from each other
and that strong mixing exists. Let z be an arbitrary constant
vector having the same dimension as x and w. Now,
the "averaged" trajectories of w(t) are
obtained by taking the expected value of ,
a. Show that
where is the angle between vectors z
and w.
b. Let z = c(i),
the ith unity-norm eigenvector of the autocorrelation matrix
C = <xxT>.
Show that, the average rate of change of the cosine of the angle
between w and c(i)
for Oja's rule [Equation (4.3.12)] is given by
where i
is the eigenvalue associated with c(i).
Note that the c(i)'s
are the equilibria of Oja's rule.
c. Use the result in part b to show that if then
w(t) will converge to the solution w* = c(1),
where c(1) is the
eigenvector with the largest eigenvalue 1
(Hint: Recall the bounds on the Rayleigh quotient given in Section
4.3.2).
* 4.3.12
Use the technique outlined in Problem 4.3.11 to study the convergence
properties (in the average) of the following stochastic learning
rules which employ a generalized forgetting law:
a. .
b. .
Assume that > 0 and y = wTx,
and that g(y) is an arbitrary scalar function of
y such that exists. Note that Equation (4.7.18) and Oja's
rule are special cases of the learning rules in a and b, respectively.
4.4.1 Show that Equation
(4.4.7) is the Hessian for the criterion function implied by Equation
(4.4.5).
4.6.1 Study (qualitatively)
the competitive learning behavior which minimizes the criterion
function
where is as defined in Equation (4.6.3). Can you
think of a physical system (for some integer value of N)
which is governed by this "energy" function J?
4.6.2 Derive a stochastic
competitive learning rule whose corresponding average learning
equation maximizes the criterion function in Equation (4.6.16).
* 4.7.1 Show that for
the one-dimensional feature map, Equation (4.7.6) can be derived
from Equation (4.7.4). See Hertz et al. (1991) for hints.
4.7.2 Show that Equation
(4.7.5) satisfies Equation (4.7.6).
4.7.3 Solve Equation (4.7.5)
for p(x) x, where R.
For which input distribution p(x) do we have a zero-distortion
feature map?
4.7.4 Prove the stability
of the equilibrium point in Equation (4.7.20). (Hint: Employ the
technique outlined in Problem 4.3.11).
Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [4.8] [4.9]