3.6 Summary

This chapter describes a number of basic learning rules for supervised, reinforcement, and unsupervised learning. It presents a unifying view of these learning rules for the single unit setting. Here, the learning process is viewed as steepest gradient-based search for a set of weights that optimizes an associated criterion function. This is true for all learning rules regardless of the usual taxonomy of these rules as supervised, reinforcement, or unsupervised. Preliminary characterizations of some learning rules are given, with the in depth mathematical analysis deferred to the next chapter.

Various forms of criterion functions are discussed including the perceptron, SSE, MSE, Minkowski-r, and relative entropy criterion functions. The issue of which learning rules are capable of finding linearly separable solutions led us to identify the class of well-formed criterion functions. These functions, when optimized by an iterative search process, guarantee that the search will enter the region in the weight space which corresponds to linearly separable solutions.

Steepest gradient-based learning is extended to the stochastic unit, which serves as the foundation for reinforcement learning. Reinforcement learning is viewed as a stochastic process which attempts to maximize the average reinforcement.

Unsupervised learning is treated in the second half of the chapter. Here, Hebbian learning is discussed and examples of stable Hebbian-type learning rules are presented along with illustrative simulations. It is pointed out that Hebbian learning applied to a single linear unit tends to maximize the unit's output variance. Finally, simple, single layer networks of multiple interconnected units are considered in the context of competitive learning, learning vector quantization, principal component analysis, and self-organizing feature maps. Simulations are also included which are designed to illustrate the powerful emerging computational properties of these simple networks and their application. It is demonstrated that local interactions in a competitive net can lead to global order. A case in point is the SOFM where simple incremental interactions among locally neighboring units lead to a global map which preserves the topology and density of the input data.

Table 3.1 summarizes all learning rules considered in this chapter. It gives a quick reference to the learning equations and their associated criterion functions, appropriate parameter initializations, type of unit activation function employed, and some remarks on convergence behavior and the nature of the obtained solution.

SUMMARY OF BASIC LEARNING RULES
LEARNING RULE

(type)

CRITERION FUNCTIONa

LEARNING VECTORb

CONDITIONS

ACTIVATION

FUNCTIONc

REMARKS

Perceptron Rule (supervised)



Finite convergence time if training set is linearly separable. ||w|| stays bounded for arbitrary training sets.




Perceptron Rule with variable learning rate and fixed margin (supervised)










> 0

k satisfies:

1. 

2. 

3. 











Converges to zTw > b if training set is linearly separable. Finite convergence if , where is a finite positive constant.



Mays' Rule (supervised)




> 0





Finite convergence to the solution

if the training set is linearly separable.



Butz's Rule (supervised)






 > 0


Finite convergence if training set is linearly separable. Places w in a region that tends to minimize the probability of error for non-linearly separable cases.

Widrow-Hoff Rule

(-LMS) (supervised)




Converges in the mean square to the minimum SSE or LMS solution if ||xi|| = ||xj|| for all i, j.

-LMS (supervised)

Converges in the mean square to the minimum SSE or LMS solution.



Stochastic

-LMS Rule (supervised)






k satisfies:

1. 

2.  

3. 





mean operator. (At each learning step the training vector xk is drawn at random.) Converges in the mean square to the minimum SSE or LMS solution.

Correlation Rule (supervised)



Converges to the minimum SSE solution if the xk's are mutually orthonormal.

Delta Rule (supervised)





y = f(net) where f is a sigmoid function.

Extends the -LMS rule to cases with differentiable non-linear activations.


Minkowski-r Delta rule (supervised)






y = f(net) where f is a sigmoid function.
1 < < 2 for pseudo-Gaussian distribution p(x) with pronounced tails. = 2 gives delta rule. = 1 arises when p(x) is a Laplace distribution.

Relative Entropy Delta Rule (supervised)







y = tanh( net)

Eliminates the flat spot suffered by the delta rule. Converges to a linearly separable solution if one exists.




AHK I (supervised)


with margin vector
Margin


Weight vector:








bi's can only increase from their initial values. Converges to a robust solution for linearly separable problems.





AHK II (supervised)



with margin vector
Margin

 =

Weight vector:














bi's can take any positive value. Converges to a robust solution for linearly separable problems.




AHK III (supervised)


with margin vector
Margin


Weight vector:







Converges for linearly separable as well as non-linearly separable cases. It automatically identifies and discards the critical points affecting the non-linear separability, and results in a solution which tends to minimize misclassifications.

Delta Rule with Stochastic neuron (supervised)




Stochastic activation:


Performance in the average is equivalent to the delta rule applied to a unit with deterministic activation:

.


Simple Associative Reward/Penalty Rule (reinforcement)






Stochastic activation

(as above)

wk evolves so as to maximize the average reinforcement signal <r>.

Hebbian Rule

(unsupervised)



 with w* pointing in the direction of c(1) (see comment d).


Oja's Rule

(unsupervised)

Does not have an exact J(w). However, this rule tends to maximize <y2> subject to the constraint ||w|| = 1.



Converges in the mean to 

. which maximizes <y2>.


Yuille's et al. Rule (unsupervised)


Converges in the mean to 

with w* in the direction of c(1). w* maximizes <y2>



Linsker's Rule (unsupervised)

 > 0


Maximizes <y2> subject to . Converges to w* whose components are clamped at w- or w+, when is large.



Hassoun's Rule

(unsupervised)


(see comment e)



For this rule reduces to:







Converges in the mean to

with w* parallel to c(1). For , w* approaches c(1) (see section 4.3.5 for details).



Standard Competitive Learning Rule (unsupervised)


(see comment e)


i* is the index of the winning unit:



k satisfies:

1. 

2. 

3. 







Converges to a local minima of J representing some clustering configuration.


Kohonen's Feature Map Rule

(unsupervised)




(see comment e)



k and k evolve according to: or



The weight vectors evolve to a solution which tends to preserve the topology of the input space. The local point density of this solution is of the form g(p(x)), where g is a continuous monotonically increasing function and p(x) is a stationary probability density function governing xk.

aNote:

dc(i) is the ith normalized eigenvector of the autocorrelation matrix C with a corresponding eigenvalue ( and ).
bThe general form of the learning equation is: where is the learning rate and is the learning vector. eThe criterion functions associated with these rules are discussed in Chapter 4.
c

Problems:

3.1.1 Show that the choice , in the convergence proof for the perceptron learning rule, minimizes the maximum number of corrections k0 if w1 = 0.

* 3.1.2 This problem explores an alternative convergence proof (Kung, 1993) for the perceptron rule in Equation (3.1.2). Here, we follow the notation in Section 3.1.1.

a. Show that if xk is misclassified by a perceptron with weight vector wk, then

(1)

where w* is a solution which separates all patterns xk correctly.

b. Show that if we restrict to sufficiently small values, then wk+1 converges in a finite number of steps (recall that no weight update is needed if all xk are classified correctly). Does wk+1 have to converge to the particular solution w*? Explain.

c. Show that if xk is misclassified, then is minimized by setting to its optimal value


d. Show that the choice  = opt guarantees the convergence of the perceptron learning rule in a finite number of steps. In other words, show that if the choice  = opt is made, then wk+1 in Equation (1) stops changing after a finite number of corrections.

e. Show that the use of  = opt in Equation (3.1.2) leads to the learning rule


(Note that this learning rule is impractical, since it requires a solution, w*, to be known!)

3.1.3 Show that the perceptron criterion function J in Equation (3.1.20) is piecewise linear. Next, consider the four training pairs {-4, -1}, {-2, -1}, {-1, +1}, and {+1, +1}. Note that these pairs take the form {xd} where xd  R. The following is a guided exploration into the properties of the function J for this linearly separable training set. This exploration should give some additional insights into the convergence process of the perceptron learning rule.

a. Plot the criterion function J for a two-input perceptron with weights w1 and w2, over the range  and . Here, w2 is the weight associated with the bias input (assume a bias of +1).

b. Identify the solution region in the weight space containing all linearly separable solutions w*. What is the value of J in this region?

c. Based on the above results, describe the evolution process of the weight vector in Equation (3.1.23), starting from an arbitrary initial weight vector.

d. For comparative purposes, plot the quadratic criterion function in Equation (3.1.25) with b = 0. Comment on the differentiability of this function.

3.1.4 Show that the a-LMS rule in Equation (3.1.30) can be obtained from an incremental gradient descent on the criterion function


3.1.5 For the simple two-dimensional linearly separable two-class problem in Figure P3.1.5, compute and plot the dynamic margins (bi) vs. learning cycles using the AHK I learning rule with the following initial values and parameters: w1 = 0, r1 = 0.1, r2 = 0.05, and initial margins of 0.1. Repeat using the AHK II rule. Compare the convergence speed and quality of the solution of these two rules.


Figure P3.1.5. A simple two-class linearly separable pattern for Problem 3.1.5.

3.1.6 Draw the two decision surfaces for Problem 3.1.5 and compare them to the decision surfaces generated by the perceptron rule, m-LMS learning rule (use m = 0.05), and Butz's rule (use  = 0.1 and a reinforcement factor ). Assume w1 = 0 for these rules.

3.1.7 Repeat Problem 3.1.5 for the two-class problem in Figure P3.1.7, using the AHK III rule.


Figure P3.1.7. A simple two-class nonlinearly separable training set for Problem 3.1.7.

3.1.8 Draw the decision surface for Problem 3.1.7 and compare it to the decision surfaces generated by the perceptron rule, m-LMS learning rule (use m = 0.05), and Butz's rule (use  = 0.1 and a reinforcement factor ). Assume w1 = 0 for these rules.

3.1.9 Check the "well-formedness" of the criterion functions of the following learning rules.

a. Mays' rule.

b. m-LMS rule.

c. AHK I rule.

3.1.10 a. Is there a value for r for which the Minkowski-r criterion function is well-formed?

b. Is the relative entropy criterion function well-formed?

3.1.11 Consider the training set {xkdk}, k = 1, 2, ..., m, with . Define a stability measure for pattern k as


where w is a perceptron weight vector which is constrained to have a constant length .

Employing statistical mechanics arguments, it is possible to show (Gordon et al., 1993) that the mean number of errors made by the perceptron on the training set at "temperature" T (T is a monotonically decreasing positive function of training time) is:


where b is an imposed stability factor specified by the user.

Employing gradient descent on J(w), show that an appropriate update rule for the ith weight is given by:


Give a qualitative analysis of the behavior of this learning rule and compare it to the correlation and AHK III rules. Explain the effects of b and T on the placement of the separating hyperplane. How would this rule behave for nonlinearly separable training sets?

3.1.12 Consider the learning procedure (Polyak, 1990):

,

,

where and . Discuss qualitatively the difference between this learning procedure and the -LMS learning rule (Equation 3.1.35).

3.1.13 Consider the Minkowski-r criterion function in Equation (3.1.68). If no prior knowledge is available about the distribution of the training data, then extensive experimentation with various r values must be done in order to estimate an appropriate value for r. Alternatively, an automatic method for estimating r is possible by adaptively updating r in the direction of decreasing E. Employ steepest gradient descent on E(r) in order to derive an update rule for r.

3.2.1 Use the Arp rule in Equation (3.2.1) with  = 1, to find the stochastic unit separating surface wT= 0, arrived at after 10, 50, 100 and 200 cycles through the training set in Figure P3.2.1. Start with , and assume x3 = +1. Use reinforcement rk equal to +1 if xk is correctly classified, and -1 otherwise. Assume + = 0.1 and  = 0.01. Repeat by subjectively assigning a +1 or a -1 to rk based on whether the movement of the separating surface after each presentation is good or bad, respectively. Use + = 0.6 and  = 0.06 and plot the generated separating surface after the first twenty presentations.


Figure P3.2.1 Two-class linearly separable training set for Problem 3.2.1.

3.3.1 Consider a data set of two-dimensional vectors x generated as follows: x1 and x2 are distributed randomly and independently according to the normal distributions N(0, 0.1) and N(0, 0.01), respectively. Generate and plot 1000 data points x in the input plane. Compute and plot (in the input plane) the solution trajectories wk generated by the normalized Hebbian rule, Oja's rule, and Yuille et al. rule upon training a single linear unit with weight vector w on samples x drawn randomly and independently from the above distribution. In your simulations, assume  = 0.01 and stop training after 1500 iterations. Also, assume w1 = [1.5, 1]T in all simulations. Verify graphically that, for large k, the average direction of wk is that of the maximum data variance. Study, via simulation, the effects of larger learning rates (e.g.,  = 0.05, 0.1) and various initial weight vectors on the wk trajectories.

3.3.2 Show that the average Oja rule has as its equilibria, the eigenvectors of C. Start by showing that is the average of Oja's rule in Equation (3.3.6).

* 3.3.3 Starting from , show that for Oja's rule. Evaluate the Hessian matrix at the equilibria found in Problem 3.3.2. Study the stability of these equilibria states in terms of the eigenvalues of the Hessian matrix.

3.3.4 Show that the average weight change, , in Yuille et al. rule is given by steepest gradient descent on the criterion function


3.3.5 Consider Sanger's rule (Equation 3.3.19) with = 2 units. Assume that the first unit has already converged to c(1), the principal eigenvector of C. Show that the second unit's update equation for the average weight vector change is given by:


and that c(1) is not an equilibrium point.

* 3.3.6 Show that the average weight change of the ith unit for Sanger's rule is given by (Hertz et al., 1991)


Now assume that the weight vectors for the first - 1 units have already converged to their appropriate eigenvectors, so wk = c(k) for k < i. Show that the first two terms in give the projection of onto the space orthogonal to the first - 1 eigenvectors of C. Employ these results to show that wi converges to c(i).

3.3.7 Approximate the output of a sigmoid unit having the transfer characteristics y = tanh(wTx) by the first four terms in a Taylor series expansion. Show that this unit approximates a third order unit if it is operated near wTx = 0. Comment on the effects of the saturation region on the quality of the higher-order principal component analysis realized when the unit is trained using Hebbian learning.

3.4.1 Let xk {0, 1}n and the initial weights , for all i, in a single layer competitive network. Assume the following learning rule for the ith unit


           Show that this rule preserves for all k.

3.4.2 Consider the 200 sample two-dimensional data set {xk} generated randomly and independently as follows: 50 samples generated according to the normal distribution N([-5, +5]T, 1), 75 samples generated according to N([+5, +5]T, 2), and 75 samples generated according to a uniform distribution in the region and -7.5  x2  -2.5. Use a five unit competitive net in order to discover the underlying clusters in this data set, as in Example 3.4.1 [use Equations (3.4.5) and (3.4.2)]. Assume a learning rate  = 0.01, and start with random weight vectors distributed uniformly in the region |w1|  10 and |w2|  10. Stop training after 2000 iterations (steps); at each iteration, the vector xk is to be chosen from the data set at random. Depict the evolution of the weight vectors of all five units as in Figure 3.4.5. Finally, plot the clusters discovered by the net (as in Figure 3.4.6) and compare this solution to the real solution.

3.4.3 Repeat Problem 3.4.2 using the similarity measure in Equation (3.4.3), with no weight vector normalization, for determining the winning unit. Make sure that you use the same initial weight vectors as in the previous problem. Discuss any differences in the number and/or shape of the generated clusters, compared to the solution in Problem 3.4.2.

3.4.4 Consider a Voronoi quantizer with the following 10 reconstruction vectors: [0, 0]T, [1, 0]T, [-1, 0]T, [0, -1]T, [0, 1]T, [3, 3]T, [4, 3]T, [2, 3]T, [3, 2]T, and [3, 4]T.

a. Draw the input space partitions (Voronoi tessellation) realized by the above quantizer.

b. Starting with the Voronoi quantizer of part a, use the LVQ method described in Section 3.4.2 in order to design a two-class classifier for data generated randomly according to the probability density functions p1(x) = N([0, 0]T, 1) and p2(x) = N([3, 3]T, 2) for classes 1 and 2, respectively. Assume equal a priori probability of the two classes. During training, use the adaptive learning rates in Equation (3.4.8), initialized to . Stop training after 1000 iterations.

c. Draw the Voronoi tessellations realized by the weight vectors (reconstruction vectors) which resulted from the LVQ training in part b. Compare these tessellations to the ones drawn in part a.

d. Draw the decision boundary for the classifier in part b.

* 3.5.1 Consider the following criterion function, suitable for solving the TSP in an elastic net architecture (Durbin and Willshaw, 1987):


where m is the number of units in the net and > 0. Here, xk represents the position of city k and wi represents the ith stop.

a. Show that gradient descent on J leads to the update rule:

where


b. Give qualitative interpretations for the various terms in J and wi.

c. Show that J is bounded below, and that as 0 and m , J is minimized by the shortest possible tour.

3.5.2 Consider the L-shaped region shown in Figure P3.5.2. Train a 2-dimensional Kohonen feature map on points generated randomly and uniformly from this region. Assume a 15 × 15 array of units and the following learning parameters, kmax = 80,000, 0 = 0.8, f = 0.01, , . Choose all initial weights randomly inside the region. Display the trained net as in Figure 3.5.1 at times 0, 1000, 20,000, 40,000, and 80,000.


Figure P3.5.2. L-shaped region representing a uniform distribution of inputs for the feature map of Problems 3.5.2 and 3.5.3.

3.5.3 Repeat problem 3.5.2 (but with 0 = 20) for a 1-dimensional chain consisting of 60 units and initialized randomly inside the L-shaped region. Display the trained map as in Figure 3.5.5 at various training times.

3.5.4 The self-organizing map simulation in Figure 3.5.5 employs 0 = 0.8, f = 0.01, , , and kmax = 70,000. Repeat the simulation and plot the map (chain) at iterations 50,000 and 70,000. (Note: the chain configuration in Figure 3.5.5 (d) is not a stable configuration for the above parameters.) Repeat the simulation with and compare the resulting chain configuration at 70,000 iterations to that in the previous simulation. Discuss your results.

3.5.5 Repeat the SOFM simulation in Figure 3.5.1 (see Section 3.5.3 for details) assuming p(x) = N([0, 0]T, 0.1), |x1|  1, and |x2|  1. Use the learning parameters given in the caption of Figure 3.5.1. What is the general shape of the point distribution p(w) of the weight vectors? Estimate the variance of p(w). Is p(w) proportional to p(x)?




Goto [
3.0] [3.1] [3.2] [3.3] [3.4] [3.5]

Back to the Table of Contents

Back to Main Menu