This chapter describes a number of basic learning
rules for supervised, reinforcement, and unsupervised learning.
It presents a unifying view of these learning rules for the single
unit setting. Here, the learning process is viewed as steepest
gradientbased search for a set of weights that optimizes an associated
criterion function. This is true for all learning rules regardless
of the usual taxonomy of these rules as supervised, reinforcement,
or unsupervised. Preliminary characterizations of some learning
rules are given, with the in depth mathematical analysis deferred
to the next chapter.
Various forms of criterion functions are discussed
including the perceptron, SSE, MSE, Minkowskir, and relative
entropy criterion functions. The issue of which learning rules
are capable of finding linearly separable solutions led us to
identify the class of wellformed criterion functions. These
functions, when optimized by an iterative search process, guarantee
that the search will enter the region in the weight space which
corresponds to linearly separable solutions.
Steepest gradientbased learning is extended to
the stochastic unit, which serves as the foundation for reinforcement
learning. Reinforcement learning is viewed as a stochastic process
which attempts to maximize the average reinforcement.
Unsupervised learning is treated in the second half
of the chapter. Here, Hebbian learning is discussed and examples
of stable Hebbiantype learning rules are presented along with
illustrative simulations. It is pointed out that Hebbian learning
applied to a single linear unit tends to maximize the unit's output
variance. Finally, simple, single layer networks of multiple
interconnected units are considered in the context of competitive
learning, learning vector quantization, principal component analysis,
and selforganizing feature maps. Simulations are also included
which are designed to illustrate the powerful emerging computational
properties of these simple networks and their application. It
is demonstrated that local interactions in a competitive net can
lead to global order. A case in point is the SOFM where simple
incremental interactions among locally neighboring units lead
to a global map which preserves the topology and density of the
input data.
Table 3.1 summarizes all learning rules considered
in this chapter. It gives a quick reference to the learning equations
and their associated criterion functions, appropriate parameter
initializations, type of unit activation function employed, and
some remarks on convergence behavior and the nature of the obtained
solution.

 
 Finite convergence time if training set is linearly separable. w stays bounded for arbitrary training sets.  
1. 2. 3.  Converges to zTw > b if training set is linearly separable. Finite convergence if , where is a finite positive constant.  

 Finite convergence to the solution if the training set is linearly separable.  
 Finite convergence if training set is linearly separable. Places w in a region that tends to minimize the probability of error for nonlinearly separable cases.  

 Converges in the mean square to the minimum SSE or LMS solution if xi = xj for all i, j.  

 Converges in the mean square to the minimum SSE or LMS solution.  
 1. 2. 3. 
mean operator. (At each learning step the training vector xk is drawn at random.) Converges in the mean square to the minimum SSE or LMS solution.  
 Converges to the minimum SSE solution if the xk's are mutually orthonormal.  

 Extends the LMS rule to cases with differentiable nonlinear activations.  
1 < r < 2 for pseudoGaussian distribution p(x) with pronounced tails. r = 2 gives delta rule. r = 1 arises when p(x) is a Laplace distribution.  

 Eliminates the flat spot suffered by the delta rule. Converges to a linearly separable solution if one exists.  
 Margin
Weight vector:

 bi's can only increase from their initial values. Converges to a robust solution for linearly separable problems.  
 Margin
= Weight vector:

 bi's can take any positive value. Converges to a robust solution for linearly separable problems.  
 Margin
Weight vector:

 Converges for linearly separable as well as nonlinearly separable cases. It automatically identifies and discards the critical points affecting the nonlinear separability, and results in a solution which tends to minimize misclassifications.  
 Stochastic activation:
 Performance in the average is equivalent to the delta rule applied to a unit with deterministic activation: .  

 wk evolves so as to maximize the average reinforcement signal <r>.  

with w* pointing in the direction of c(1) (see comment d).  
 Does not have an exact J(w). However, this rule tends to maximize <y2> subject to the constraint w = 1. 

 Converges in the mean to . which maximizes <y2>.  

 Converges in the mean to
with w* in the direction of c(1). w* maximizes <y2>  

 Maximizes <y2> subject to . Converges to w* whose components are clamped at w or w+, when is large.  


 Converges in the mean to with w* parallel to c(1). For , w* approaches c(1) (see section 4.3.5 for details).  


 1. 2. 3.  Converges to a local minima of J representing some clustering configuration.  


 The weight vectors evolve to a solution which tends to preserve the topology of the input space. The local point density of this solution is of the form g(p(x)), where g is a continuous monotonically increasing function and p(x) is a stationary probability density function governing xk. 
aNote: 
 dc(i) is the ith normalized eigenvector of the autocorrelation matrix C with a corresponding eigenvalue ( and ).  
bThe general form of the learning equation is: where is the learning rate and is the learning vector.  eThe criterion functions associated with these rules are discussed in Chapter 4.  
c 
a. Show that if xk
is misclassified by a perceptron with weight vector wk,
then
where w*
is a solution which separates all patterns xk
correctly.
c. Show that if xk
is misclassified, then is minimized by
setting to its optimal value
d. Show that the choice = opt
guarantees the convergence of the perceptron learning rule in
a finite number of steps. In other words, show that if the choice
= opt is made,
then wk+1
in Equation (1) stops changing after a finite number of corrections.
e. Show that the use of = opt in Equation (3.1.2) leads to the learning rule
(Note that this learning rule is impractical, since
it requires a solution, w*,
to be known!)
3.1.3 Show that the perceptron
criterion function J in Equation (3.1.20) is piecewise
linear. Next, consider the four training pairs {4, 1},
{2, 1},
{1, +1},
and {+1, +1}. Note that these pairs take the form {x, d}
where x, d R. The following is
a guided exploration into the properties of the function J
for this linearly separable training set. This exploration should
give some additional insights into the convergence process of
the perceptron learning rule.
a. Plot the criterion function J for a twoinput perceptron with weights w1 and w2, over the range and . Here, w2 is the weight associated with the bias input (assume a bias of +1).
b. Identify the solution region in the weight space containing all linearly separable solutions w*. What is the value of J in this region?
c. Based on the above results, describe the evolution process of the weight vector in Equation (3.1.23), starting from an arbitrary initial weight vector.
d. For comparative purposes, plot the quadratic criterion
function in Equation (3.1.25) with b = 0. Comment
on the differentiability of this function.
3.1.4 Show that the aLMS rule in Equation (3.1.30) can be obtained from an incremental gradient descent on the criterion function
† 3.1.5
For the simple twodimensional linearly separable twoclass problem
in Figure P3.1.5, compute and plot the dynamic margins (bi)
vs. learning cycles using the AHK I learning rule with the following
initial values and parameters: w1
= 0, r1
= 0.1, r2
= 0.05, and initial margins of 0.1. Repeat using the AHK II rule.
Compare the convergence speed and quality of the solution of
these two rules.
3.1.6 Draw the two decision
surfaces for Problem 3.1.5 and compare them to the decision surfaces
generated by the perceptron rule, mLMS
learning rule (use m
= 0.05), and Butz's rule (use = 0.1 and a reinforcement
factor ). Assume w1 = 0
for these rules.
† 3.1.7 Repeat Problem
3.1.5 for the twoclass problem in Figure P3.1.7, using the AHK
III rule.
3.1.8 Draw the decision
surface for Problem 3.1.7 and compare it to the decision surfaces
generated by the perceptron rule, mLMS
learning rule (use m
= 0.05), and Butz's rule (use = 0.1 and a reinforcement
factor ). Assume w1 = 0
for these rules.
3.1.9 Check the "wellformedness" of the criterion functions of the following learning rules.
a. Mays' rule.
b. mLMS rule.
c. AHK I rule.
3.1.10 a. Is there a value for r for which the Minkowskir criterion function is wellformed?
b. Is the relative entropy criterion function wellformed?
3.1.11 Consider the training
set {xk, dk},
k = 1, 2, ..., m, with
. Define a stability measure for pattern
k as
where w is a perceptron weight vector which
is constrained to have a constant length .
Employing statistical mechanics arguments, it is possible to show (Gordon et al., 1993) that the mean number of errors made by the perceptron on the training set at "temperature" T (T is a monotonically decreasing positive function of training time) is:
where b is an imposed
stability factor specified by the user.
Employing gradient descent on J(w), show that an appropriate update rule for the ith weight is given by:
Give a qualitative analysis of the behavior of this
learning rule and compare it to the correlation and AHK III rules.
Explain the effects of b and T on the placement
of the separating hyperplane. How would this rule behave for
nonlinearly separable training sets?
3.1.12 Consider the learning procedure (Polyak, 1990):
where and .
Discuss qualitatively the difference between this learning procedure
and the LMS learning rule (Equation 3.1.35).
3.1.13 Consider the Minkowskir
criterion function in Equation (3.1.68). If no prior knowledge
is available about the distribution of the training data, then
extensive experimentation with various r values must be
done in order to estimate an appropriate value for r.
Alternatively, an automatic method for estimating r is
possible by adaptively updating r in the direction of decreasing
E. Employ steepest gradient descent on E(r)
in order to derive an update rule for r.
† 3.2.1 Use the Arp
rule in Equation (3.2.1) with = 1, to find the stochastic
unit separating surface wTx = 0,
arrived at after 10, 50, 100 and 200 cycles through the training
set in Figure P3.2.1. Start with , and
assume x3 = +1.
Use reinforcement rk
equal to +1 if xk
is correctly classified, and 1
otherwise. Assume + = 0.1
and = 0.01. Repeat by subjectively assigning a +1
or a 1
to rk based
on whether the movement of the separating surface after each presentation
is good or bad, respectively. Use + = 0.6
and = 0.06 and plot the generated separating surface
after the first twenty presentations.
† 3.3.1 Consider
a data set of twodimensional vectors x generated as follows:
x1 and x2
are distributed randomly and independently according to the normal
distributions N(0, 0.1) and N(0, 0.01), respectively.
Generate and plot 1000 data points x in the input plane.
Compute and plot (in the input plane) the solution trajectories
wk generated
by the normalized Hebbian rule, Oja's rule, and Yuille et al.
rule upon training a single linear unit with weight vector w
on samples x drawn randomly and independently from the
above distribution. In your simulations, assume = 0.01
and stop training after 1500 iterations. Also, assume w1 = [1.5, 1]T
in all simulations. Verify graphically that, for large k,
the average direction of wk
is that of the maximum data variance. Study, via simulation,
the effects of larger learning rates (e.g., = 0.05,
0.1) and various initial weight vectors on the wk
trajectories.
3.3.2 Show that the average
Oja rule has as its equilibria, the eigenvectors of C.
Start by showing that is the average
of Oja's rule in Equation (3.3.6).
* 3.3.3 Starting
from , show that
for Oja's rule. Evaluate the Hessian matrix
at the equilibria found in Problem 3.3.2. Study the stability
of these equilibria states in terms of the eigenvalues of the
Hessian matrix.
3.3.4 Show that the average
weight change, , in Yuille et al. rule
is given by steepest gradient descent on the criterion function
3.3.5 Consider Sanger's
rule (Equation 3.3.19) with m = 2 units. Assume
that the first unit has already converged to c(1),
the principal eigenvector of C. Show that the second unit's
update equation for the average weight vector change is given
by:
and that c(1)
is not an equilibrium point.
* 3.3.6 Show that the
average weight change of the ith
unit for Sanger's rule is given by (Hertz et al., 1991)
Now assume that the weight vectors for the first
i  1
units have already converged to their appropriate eigenvectors,
so wk = c(k)
for k < i. Show that the first two
terms in give the projection of
onto the space orthogonal to the first i  1
eigenvectors of C. Employ these results to show that wi
converges to c(i).
3.3.7 Approximate the
output of a sigmoid unit having the transfer characteristics y = tanh(wTx)
by the first four terms in a Taylor series expansion. Show that
this unit approximates a third order unit if it is operated near
wTx = 0.
Comment on the effects of the saturation region on the quality
of the higherorder principal component analysis realized when
the unit is trained using Hebbian learning.
3.4.1 Let xk {0, 1}n and the initial weights , for all i, in a single layer competitive network. Assume the following learning rule for the ith unit
Show
that this rule preserves for all k.
† 3.4.2
Consider the 200 sample twodimensional data set {xk}
generated randomly and independently as follows: 50 samples generated
according to the normal distribution N([5, +5]T,
1), 75 samples generated according to N([+5, +5]T,
2), and 75 samples generated according to a uniform distribution
in the region and 7.5 x2 2.5.
Use a five unit competitive net in order to discover the underlying
clusters in this data set, as in Example 3.4.1 [use Equations
(3.4.5) and (3.4.2)]. Assume a learning rate = 0.01,
and start with random weight vectors distributed uniformly in
the region w1 10
and w2 10.
Stop training after 2000 iterations (steps); at each iteration,
the vector xk
is to be chosen from the data set at random. Depict the evolution
of the weight vectors of all five units as in Figure 3.4.5. Finally,
plot the clusters discovered by the net (as in Figure 3.4.6) and
compare this solution to the real solution.
† 3.4.3 Repeat Problem
3.4.2 using the similarity measure in Equation (3.4.3), with no
weight vector normalization, for determining the winning unit.
Make sure that you use the same initial weight vectors as in
the previous problem. Discuss any differences in the number and/or
shape of the generated clusters, compared to the solution in Problem
3.4.2.
† 3.4.4 Consider a Voronoi quantizer with the following 10 reconstruction vectors: [0, 0]T, [1, 0]T, [1, 0]T, [0, 1]T, [0, 1]T, [3, 3]T, [4, 3]T, [2, 3]T, [3, 2]T, and [3, 4]T.
a. Draw the input space partitions (Voronoi tessellation) realized by the above quantizer.
b. Starting with the Voronoi quantizer of part a, use the LVQ method described in Section 3.4.2 in order to design a twoclass classifier for data generated randomly according to the probability density functions p1(x) = N([0, 0]T, 1) and p2(x) = N([3, 3]T, 2) for classes 1 and 2, respectively. Assume equal a priori probability of the two classes. During training, use the adaptive learning rates in Equation (3.4.8), initialized to . Stop training after 1000 iterations.
c. Draw the Voronoi tessellations realized by the weight vectors (reconstruction vectors) which resulted from the LVQ training in part b. Compare these tessellations to the ones drawn in part a.
d. Draw the decision boundary for the classifier
in part b.
* 3.5.1 Consider the following
criterion function, suitable for solving the TSP in an elastic
net architecture (Durbin and Willshaw, 1987):
where m is the number of units in the net
and > 0. Here, xk
represents the position of city k and wi
represents the ith stop.
a. Show that gradient descent on J leads to
the update rule:
where
b. Give qualitative interpretations for the various
terms in J and wi.
c. Show that J is bounded below, and that
as 0 and m , J is minimized by the shortest possible
tour.
† 3.5.2 Consider
the Lshaped region shown in Figure P3.5.2. Train a 2dimensional
Kohonen feature map on points generated randomly and uniformly
from this region. Assume a 15 × 15 array of units and the
following learning parameters, kmax = 80,000,
0 = 0.8, f = 0.01,
, . Choose all
initial weights randomly inside the region. Display the trained
net as in Figure 3.5.1 at times 0, 1000, 20,000, 40,000, and 80,000.
Figure P3.5.2. Lshaped region representing a uniform
distribution of inputs for the feature map of Problems 3.5.2 and
3.5.3.
† 3.5.3 Repeat problem
3.5.2 (but with 0 = 20)
for a 1dimensional chain consisting of 60 units and initialized
randomly inside the Lshaped region. Display the trained map
as in Figure 3.5.5 at various training times.
† 3.5.4 The selforganizing
map simulation in Figure 3.5.5 employs 0 = 0.8,
f = 0.01,
, , and kmax = 70,000.
Repeat the simulation and plot the map (chain) at iterations
50,000 and 70,000. (Note: the chain configuration in Figure 3.5.5
(d) is not a stable configuration for the above parameters.)
Repeat the simulation with and compare
the resulting chain configuration at 70,000 iterations to that
in the previous simulation. Discuss your results.
† 3.5.5 Repeat the
SOFM simulation in Figure 3.5.1 (see Section 3.5.3 for details)
assuming p(x) = N([0, 0]T, 0.1),
x1 1,
and x2 1.
Use the learning parameters given in the caption of Figure 3.5.1.
What is the general shape of the point distribution p(w)
of the weight vectors? Estimate the variance of p(w).
Is p(w) proportional to p(x)?
Goto [3.0] [3.1] [3.2] [3.3] [3.4] [3.5]