3.1 Supervised Learning in a Single Unit Setting

Supervised learning is treated first. Here, two groups of rules are discussed: Error correction rules and gradient descent-based rules. By the end of this section it will be established that all of these learning rules can be systematically derived as minimizers of an appropriate criterion function.

Error correction rules were initially proposed as ad hoc rules for single unit training. These rules essentially drive the output error of a given unit to zero. We start with the classical perceptron learning rule and give a proof for its convergence. Then, other error correction rules such as Mays' rule and the -LMS rule are covered. Throughout this section, an attempt is made to point out criterion functions that are minimized by using each rule. We will also cast these learning rules as relaxation rules, thus unifying them with the other gradient-based search rules, such as the ones presented in Section 3.1.2.

Perceptron Learning Rule

Consider the following version of a linear threshold gate shown in Figure 3.1.1. We will refer to it as the perceptron. The perceptron maps an input vector x = [x1 x2 ... xn+1]T to a bipolar binary output y, and thus it may be viewed as a simple two-class classifier. The input signal xn+1 is usually set to 1 and plays the role of a bias to the perceptron. We will denote by w the vector w = [w1 w2 ... wn+1]T  Rn+1 consisting of the free parameters (weights) of the perceptron. The input/output relation for the perceptron is given by y = sgn(xTw), where sgn is the "sign" function which returns +1 or -1 depending on whether the sign of its scalar argument is positive or negative, respectively.

Figure 3.1.1. The perceptron computational unit.

Assume we are training the above perceptron to load (learn) the training pairs: {x1, d1}, {x2, d2}, ..., {xmdm} where is the kth input vector and , k = 1, 2, ..., m, is the desired target for the kth input vector (usually, the order of these training pairs is random). The entire collection of these pairs is called the training set. The goal, then, is to design a perceptron such that for each input vector xk of the training set, the perceptron output yk matches the desired target dk; that is, we require y= sgn(wTxk) = dk, for each k = 1, 2, ..., m. In this case, we say that the perceptron correctly classifies the training set. Of course, "designing" an appropriate perceptron to correctly classify the training set amounts to determining a weight vector w* such that the following relations are satisfied:

(3.1.1)

Recall that the set of all x which satisfy xTw= 0 defines a hyperplane in Rn. Thus, in the context of the above discussion, finding a solution vector w* to Equation (3.1.1) is equivalent to finding a separating hyperplane which correctly classifies all vectors xk, k = 1, 2, ..., m. In other words, we desire a hyperplane xTw* = 0 which partitions the input space into two distinct regions, one containing all points xk with dk = +1 and the other region containing all points xk with dk = -1.

k = 1, 2, ... (3.1.2)

where is a positive constant, called the learning rate. The incremental learning process given in Equation (3.1.2) proceeds as follows. First, an initial weight vector w1 is selected (usually at random) to begin the process. Then the m pairs {xk, dk} of the training set are used to successively update the weight vector until (hopefully) a solution w* is found which correctly classifies the training set. This process of sequentially presenting the training patterns is usually referred to as "cycling" through the training set, and a complete presentation of the m training pairs is referred to as a cycle (or pass) through the training set. In general, more than one cycle through the training set is required to determine an appropriate solution vector. Hence, in Equation (3.1.2), the superscript k in wk refers to the iteration number. On the other hand, the superscript k in xk (and dk) is the label of the training pair presented at the kth iteration. To be more precise, if the number of training pairs, m, is finite, then the superscripts in xk and dk should be replaced by [(k - 1) mod m] + 1. Here, mod b returns the remainder of the division of a by b (e.g., 5 mod 8 = 5, 8 mod 8 = 0, and 19 mod 8 = 3). This observation is valid for all incremental learning rules presented in this chapter.

Notice that for  = 0.5, the perceptron learning rule can be written as:

(3.1.3)

where

(3.1.4)

That is, a correction is done if and only if a misclassification, indicated by

(3.1.5)

occurs. The addition of vector zk to wk in Equation (3.1.3) moves the weight vector directly toward and perhaps across the hyperplane . The new inner product is larger than by the amount of and the correction wk = wk+1 - wk is clearly moving wk in a good direction, the direction of increasing , as can be seen from Figure 3.1.2. Thus, the perceptron learning rule attempts to find a solution w* for the following system of inequalities

for k = 1, 2, ..., m (3.1.6)

Figure 3.1.2. Geometric representation of the perceptron learning rule with  = 0.5.

In an analysis of any learning algorithm, an in particular the perceptron learning algorithm of Equation (3.1.2), there are two main issues to consider: (1) The existence of solutions and (2) convergence of the algorithm to the desired solutions (if they exist). In the case of the perceptron, it is clear that a solution vector (i.e., a vector w* which correctly classifies the training set) exists if and only if the given training set is linearly separable. Assuming, then, that the training set is linearly separable, we may proceed to show that the perceptron learning rule converges to a solution (Novikoff, 1962; Ridgway, 1962; Nilsson, 1965) as follows. Let w* be any solution vector, so that

for k = 1, 2, ..., m (3.1.7)

Then, from Equation (3.1.3), if the kth pattern is misclassified we may write

wk+1 - a w* = wk - a w* + zk (3.1.8)

where is a positive scale factor, and hence

(3.1.9)

Since zk is misclassified, we have , and thus

(3.1.10)

Now, let and ( is positive since ) and substitute into Equation (3.1.10) to get

(3.1.11)

If we choose sufficiently large, in particular , we obtain

(3.1.12)

Thus, the square distance between wk and aw* is reduced by at least b2 at each correction, and after k corrections we may write Equation (3.1.12) as

(3.1.13)

It follows that the sequence of corrections must terminate after no more than k0 corrections, where

(3.1.14)

Therefore, if a solution exists, it is achieved in a finite number of iterations. When corrections cease, the resulting weight vector must classify all the samples correctly since a correction occurs whenever a sample is misclassified and since each sample appears infinitely often in the sequence. In general, a linearly separable problem admits an infinite number of solutions. The perceptron learning rule in Equation (3.1.2) converges to one of these solutions. This solution, though, is sensitive to the value of the learning rate, , used and to the order of presentation of the training pairs. This sensitivity is responsible for the varying quality of the perceptron generated separating surface observed in simulations.

The bound on the number of corrections, k0, given by Equation (3.1.14) depends on the choice of the initial weight vector w1. If w1 = 0, we get

or

(3.1.15)

Here, k0 is a function of the initially unknown solution weight vector w*. Therefore, Equation (3.1.15) is of no help for predicting the maximum number of corrections. However, the denominator of Equation (3.1.15) implies that the difficulty of the problem is essentially determined by the samples most nearly orthogonal to the solution vector.

Generalizations of the Perceptron Learning Rule

The perceptron learning rule may be generalized to include a variable increment rk and a fixed, positive margin b. This generalized learning rule updates the weight vector whenever fails to exceed the margin b. Here, the algorithm for weight vector update is given by

(3.1.16)

The margin b is useful because it gives a dead-zone robustness to the decision boundary. That is, the perceptron's decision hyperplane is constrained to lie in a region between the two classes such that sufficient clearance is realized between this hyperplane and the extreme points (boundary patterns) of the training set. This makes the perceptron robust with respect to noisy inputs. It can be shown (Duda and Hart, 1973) that if the training set is linearly separable and if the following three conditions are satisfied:

(3.1.17a)

(3.1.17b)

(3.1.17c)

(e.g., rk = or even rk = rk) then wk converges to a solution w* which satisfies for i = 1, 2, ..., m. Furthermore, when k is fixed at a positive constant , this learning rule converges in finite time.

Another variant of the perceptron learning rule is given by the "batch" update procedure

(3.1.18)

where Z(wk) is the set of patterns z misclassified by wk. Here, the weight vector change is along the direction of the resultant vector of all misclassified patterns. In general, this update procedure converges faster than the perceptron rule, but it requires more storage.

In the nonlinearly separable case, the above algorithms do not converge. Few theoretical results are available on the behavior of these algorithms for nonlinearly separable problems [see Minsky and Papert (1969) and Block and Levin (1970) for some preliminary results]. For example, it is known that the length of w in the perceptron rule is bounded, i.e., tends to fluctuate near some limiting value ||w*|| (Efron, 1964). This information may be used to terminate the search for w*. Another approach is to average the weight vectors near the fluctuation point w*. Butz (1967) proposed the use of a reinforcement factor , 0 1, in the perceptron learning rule. This reinforcement places w in a region that tends to minimize the probability of error for nonlinearly separable cases. Butz's rule is as follows

(3.1.19)

The Perceptron Criterion Function

It is interesting to see how the above error correction rules can be derived by a gradient descent on an appropriate criterion (objective) function. For the perceptron, we may define the following criterion function (Duda and Hart, 1973):

(3.1.20)

where Z(w) is the set of samples misclassified by w (i.e., zTw  0). Note that if Z(w) is empty then J(w) = 0, otherwise J(w) > 0. Geometrically, J(w) is proportional to the sum of the distances from the misclassified samples to the decision boundary. The smaller J is, the better the weight vector w will be.

Given this objective function J(w), we can incrementally improve the search point wk at each iteration by sliding downhill on the surface defined by J(w) in w space. Specifically, we may use J to perform a discrete gradient descent search, which updates wk so that a step is taken downhill in the "steepest" direction along the search surface J(w) at wk. This can be achieved by making wk proportional to the gradient of J at the present location wk, formally we may write

(3.1.21)

Here, the initial search point, w1, and the learning rate (step size) are to be specified by the user. We will refer to Equation (3.1.21) as the steepest gradient descent search rule or, simply, gradient descent. Next, substituting the gradient

(3.1.22)

in Equation (3.1.21), leads to the weight update rule

(3.1.23)

The learning rule given in Equation (3.1.23) is identical to the multiple-sample (batch) perceptron rule of Equation (3.1.18). The original perceptron learning rule of Equation (3.1.3) can be thought of as an "incremental" gradient descent search rule for minimizing the perceptron criterion function in Equation (3.1.20). Following a similar procedure as in Equations (3.1.21) through (3.1.23), it can be shown that

(3.1.24)

is the appropriate criterion function for the modified perceptron rule in Equation (3.1.16).

Before moving on, we should note that the gradient of J in Equation (3.1.22) is not mathematically precise. Due to the piecewise linear nature of J, sudden changes in the gradient of J occur every time the perceptron output y goes through a transition at . Therefore, the gradient of J is not defined at "transition" points w satisfying , k = 1, 2, ..., m. However, because of the discrete nature of Equation (3.1.21), the likelihood of wk overlapping with one of these transition points is negligible, and thus we may still express J as in Equation (3.1.22). The reader is referred to Problem 3.1.3 for further exploration into gradient descent on the perceptron criterion function.

Mays Learning Rule

The criterion functions in Equations (3.1.20) and (3.1.24) are by no means the only functions we can construct that are minimized when w is a solution vector. For example, an alternative function is the quadratic function

(3.1.25)

where b is a positive constant margin. Like the previous criterion functions, the function J(w) in Equation (3.1.25) focuses attention on the misclassified samples. Its major difference is that its gradient is continuous, whereas the gradient of the perceptron criterion function, with or without the use of margin, is not. Unfortunately, the present function can be dominated by the input vectors with the largest magnitudes. We may eliminate this undesirable effect by dividing by :

(3.1.26)

The gradient of J(w) in Equation (3.1.26) is given by

(3.1.27)

which, upon substituting in Equation (3.1.21), leads to the following learning rule

(3.1.28)

If we consider the incremental update version of Equation (3.1.28), we arrive at Mays' rule (Mays, 1964):

(3.1.29)

If the training set is linearly separable, Mays' rule converges in a finite number of iterations, for 0 < r < 2 (Duda and Hart, 1973). In the case of a nonlinearly separable training set, the training procedure in Equation (3.1.29) will never converge. To fix this problem a decreasing learning rate such as rk = may be used to force convergence to some approximate separating surface (Duda and Singleton, 1964).

Widrow-Hoff (a-LMS) Learning Rule

Another example of an error correcting rule with a quadratic criterion function is the Widrow-Hoff rule (Widrow and Hoff, 1960). This rule was originally used to train the linear unit, also known as the adaptive linear combiner element (ADALINE), shown in Figure 3.1.3. In this case, the output of the linear unit in response to the input xk is simply yk = (xk)Tw. The Widrow-Hoff rule was originally proposed as an ad hoc rule, which embodies the so-called minimal disturbance principle. Later, it was discovered (Widrow and Stearns, 1985) that this rule converges in the mean square to the solution w* which corresponds to the least-mean-square (LMS) output error, if all input patterns are of the same length (i.e., ||xk|| is the same for all k). Therefore, this rule is sometimes referred to as the -LMS rule (the "" is used here to distinguish this rule from another very similar rule which is discussed in Section 3.1.2).

The -LMS rule is given by

(3.1.30)

where dk R is the desired response, and  > 0. Equation (3.1.30) is similar to the perceptron rule if one sets , in Equation (3.1.2), as

(3.1.31)

Though, the error in Equation (3.1.30) is measured at the linear output; not after the nonlinearity as in the perceptron. The constant a controls the stability and speed of convergence (Widrow and Stearns, 1985; Widrow and Lehr, 1990). If the input vectors are independent over time, stability is insured for most practical purposes if 0 < < 2.

As for May's rule, this rule is self-normalizing in the sense that the choice of does not depend on the magnitude of the input vectors. Since the -LMS rule selects wk to be collinear with xk, the desired error correction is achieved with a weight change of the smallest possible magnitude. Thus, when adapting to learn a new training sample, the responses to previous training samples are minimally disturbed, on the average. This is the basic idea behind the minimal disturbance principle on which the -LMS is founded. Alternatively, one can show that the -LMS learning rule is a gradient descent minimizer of an appropriate quadratic criterion function (see Problem 3.1.4).

In the following, additional learning rules for single unit training are derived. These rules are systematically derived by first defining an appropriate criterion function and then optimizing such a function by an iterative gradient search procedure.

m-LMS Learning Rule

The -LMS learning rule (Widrow and Hoff, 1960) represents the most analyzed and most applied simple learning rule. It is also of special importance due to its possible extension to learning in multiple unit neural nets. Therefore, special attention is given to this rule in this chapter. In the following, the -LMS rule is described in the context of the linear unit in Figure 3.1.3.

Let

(3.1.32)

be the sum of squared error (SSE) criterion function, where

(3.1.33)

Now, using steepest gradient descent search to minimize J(w) in Equation (3.1.32) gives

(3.1.34)

The criterion function J(w) in Equation (3.1.32) is quadratic in the weights because of the linear relation between yi and w. In fact, J(w) defines a convex hyperparaboloidal surface with a single minimum w* (the global minimum). Therefore, if the positive constant is chosen sufficiently small, the gradient descent search implemented by Equation (3.1.34) will asymptotically converge towards the solution w*, regardless of the setting of the initial search point, w1. The learning rule in Equation (3.1.34) is sometimes referred to as the "batch" LMS rule.

The incremental version of Equation (3.1.34), known as the m-LMS or LMS rule, is given by

(3.1.35)

Note that this rule becomes identical to the a-LMS learning rule in Equation (3.1.30) upon setting as

(3.1.36)

Also, when the input vectors have the same length, as would be the case when x  {-1, +1}n, then the -LMS rule becomes identical to the -LMS rule. Since the a-LMS learning algorithm converges when 0 < a < 2, we can start from Equation (3.1.36) and calculate the required range on for ensuring the convergence of the -LMS rule:

(3.1.37)

For input patterns independent over time and generated by a stationary process, convergence of the mean of the weight vector, <wk> is ensured if the fixed learning rate is chosen to be smaller than . (Widrow and Stearns, 1985. Also see Problem 4.3.8 in the next chapter for further exploration.) In this case, <wk> approaches a solution w* as k  . Here, "" signifies the "mean" or expected value. Note that the bound is less restrictive than the one in Equation (3.1.37). Horowitz and Senne (1981) showed that the bound on guarantees the convergence of w in the mean square (i.e., as k  ), for input patterns generated by a zero-mean Gaussian process independent over time. It should be noted that convergence in the mean square implies convergence in the mean; however, the converse is not necessarily true. The assumptions of decorrelated patterns and stationarity are not necessary conditions for the convergence of -LMS (Widrow et al., 1976; Farden, 1981). For example, Macchi and Eweda (1983) have a much stronger result regarding convergence of the -LMS rule which is even valid when a finite number of successive training patterns are strongly correlated.

In practical problems, m > + 1; hence, it becomes impossible to satisfy all requirements (xk)Tw = dk, k = 1, 2, ..., m. Therefore, Equation (3.1.35) never converges. Thus, for convergence, is set to > 0, where 0 is a small positive constant. In applications such as linear filtering, though, the decreasing step size is not very valuable, because it cannot accommodate non-stationarity in the input signal. Indeed, wk will essentially stop changing for large k, which precludes the tracking of time variations. Thus, the fixed increment (constant ) LMS learning rule has the advantage of limited memory, which enables it to track time fluctuations in the input data.

When the learning rate is sufficiently small, the -LMS rule becomes a "good" approximation to the gradient descent rule in Equation (3.1.34). This means that the weight vector wk will tend to move towards the global minimum w* of the convex SSE criterion function. Next, we show that w* is given by

w* = Xd (3.1.38)

where X = [x1 x2 ... xm], d = [d1 d2 ... dm]T, and X = (XXT)-1X is the generalized inverse or pseudo-inverse (Penrose, 1955) of X for m > + 1.

The extreme points (minima and maxima) of the function J(w) are solutions to the equation

(3.1.39)

Therefore, any minimum of the SSE criterion function in Equation (3.1.32) must satisfy

(3.1.40)

Equation (3.1.40) can be rewritten as

X XT w = X d (3.1.41)

which for a nonsingular matrix X XT gives the solution in Equation (3.1.38), or explicitly

(3.1.42)

Recall that just because w* in Equation (3.1.42) satisfies the condition J(w*) = 0, this does not guarantee that w* is a local minimum of the criterion function J. It does, however, considerably narrow the choices in that such w* represents (in a local sense) either a point of minimum, maximum, or saddle point of J. To verify that w* is actually a minimum of J(w), we may evaluate the second derivative or Hessian matrix of J at w* and show that it is positive definite. This can be readily achieved after noting that J is equal to the positive-definite matrix XXT. Thus, w* is a minimum of J.

The LMS rule may also be applied to synthesize the weight vector, w, of a perceptron for solving two-class classification problems. Here, one starts by training the linear unit in Figure 3.1.3 with the given training pairs {xkdk}, k = 1, 2, ..., m, using the LMS rule. During training the desired target dk is set to +1 for one class and to for the other class. (In fact, any positive constant can be used as the target for one class, and any negative constant can be used as the target for the other class). After convergence of the learning process, the solution vector obtained may now be used in the perceptron for classification. Due to the thresholding nonlinearity in the perceptron, the output of the classifier will now be properly restricted to the set {-1, +1}.

When used as a perceptron weight vector, the minimum SSE solution in Equation (3.1.42) does not generally minimize the perceptron classification error rate. This should not be surprising, since the SSE criterion function is not designed to constrain its minimum inside the linearly separable solution region. Therefore, this solution does not necessarily represent a linear separable solution, even when the training set is linearly separable (this is further explored in Section 3.1.5). However, when the training set is nonlinearly separable, the solution arrived at may still be a useful approximation. Therefore, by employing the LMS rule for perceptron training, linear separability is sacrificed for good compromise performance on both separable and nonseparable problems.

The -LMS as a Stochastic Process

Stochastic approximation theory may be employed as an alternative to the deterministic gradient descent analysis presented thus far. It has the advantage of naturally arriving at a learning rate schedule rk for asymptotic convergence in the mean square. Here, one starts with the mean-square error (MSE) criterion function:

(3.1.43)

where denotes the mean (expectation) over all training vectors. Now, one may compute the gradient of J as

(3.1.44)

which upon setting to zero, allows us to find the minimum w* of J in Equation (3.1.43) as the solution of

which gives

(3.1.45)

where and . Note that the expected value of a vector or a matrix is found by taking the expected values of its components. We refer to C as the autocorrelation matrix of the input vectors and to P as the cross-correlation vector between the input vector x and its associated desired target d (more to follow on the properties of C in Section 3.3.1). In Equation (3.1.45), the determinant of C, |C|, is assumed different from zero. The solution w* in Equation (3.1.45) is sometimes called the Wiener weight vector (Widrow and Stearns, 1985). It represents the minimum MSE solution, also known as the least mean square (LMS) solution.

It is interesting to note here, the close relation between the minimum SSE solution in Equation (3.1.42) and the LMS or minimum MSE solution in Equation (3.1.45). In fact, one can show that when the size of the training set m is large, then the minimum SSE solution converges to the minimum MSE solution.

First, let us express XXT as the sum of vector outer products We can also rewrite Xd as . This representation allows us to express Equation (3.1.42) as

Now, multiplying the right hand side of the above equation by allows us to express it as

Finally, if m is large, the averages and become very good approximations of the expectations C = <xxT> and P = <dx>, respectively. Thus, we have established the equivalence of the minimum SSE and minimum MSE for a large training set.

Next, in order to minimize the MSE criterion, one may employ a gradient descent procedure, where, instead of the expected gradient in Equation (3.1.44), the instantaneous gradient [(xk)T wk - dk] xk is used. Here, at each learning step the input vector x is drawn at random. This leads to the stochastic process:

(3.1.46)

which is the same as the m-LMS rule in Equation (3.1.35) except for a variable learning rate k. It can be shown, that if |C| 0 and rk satisfies the three conditions:

1. (3.1.47a)

2. (3.1.47b)

3. (3.1.47c)

then, wk converges to w* in Equation (3.1.45) asymptotically in the mean square; i.e.,

(3.1.48)

The criterion function in Equation (3.1.43) is of the form and is known as a regression function. The iterative algorithm in Equation (3.1.46) is also known as a stochastic approximation procedure (or Kiefer-Wolfowitz or Robbins-Monro procedure). For a thorough discussion of stochastic approximation theory, the reader is referred to Wasan (1969).

Example 3.1.1: In this example, we present the results of a set of simulations which should help give some insight into the dynamics of the batch and incremental LMS learning rules. Specifically, we are interested in comparing the convergence behavior of the discrete-time dynamical systems in Equations (3.1.34) and (3.1.35). Consider the training set depicted in Figure 3.1.4 for a simple mapping problem. The ten squares and ten filled circles in this figure are positioned at the points whose coordinates (x1x2) specify the two components of the input vectors. The squares and circles are to be mapped to the targets +1 and -1, respectively. For example, the left most square in the figure represents the training pair {[0, 1]T, 1}. Similarly, the right most circle represents the training pair {[2, 2]T-1}. Figure 3.1.5 shows plots for the evolution of the square of the distance between the vector wk and the (computed) minimum SSE solution, w* for batch LMS (dashed line) and incremental LMS (solid line). In both simulations, the learning rate (step size) was set to 0.005. The initial search point w1 was set to [0, 0]T. For the incremental LMS rule, the training examples are selected randomly from the training set. The batch LMS rule converges to the optimal solution w* in less than 100 steps. Incremental LMS requires more learning steps, on the order of 2,000 steps, to converge to a small neighborhood of w*. The fluctuations in ||wk - w*||2 in this neighborhood are less than 0.02 as can be seen from Figure 3.1.5. The effect of a deterministic order of presentation of the training examples on the incremental LMS rule is shown by the solid line in Figure 3.1.6. Here, the training examples are presented in a predefined order, which did not change during training. The same initialization and step size are used as before. In order to allow for a more meaningful comparison between the two LMS rule versions, one learning step of incremental LMS is taken to mean a full cycle through the 20 samples. For comparison, the simulation result with batch LMS learning is plotted in the figure (see dashed line). These results indicate a very similar behavior in the convergence characteristics of incremental and batch LMS learning. This is so because of the small step size used. Both cases show asymptotic convergence toward the optimal solution w*, but with a relatively faster convergence of the batch LMS rule near w*, this is attributed to the use of more accurate gradient information.

Figure 3.1.4. A 20 sample training set used in the simulations associated with Example 3.1.1. Points signified by a square and a filled circle should map into +1 and , respectively.

Figure 3.1.5. Plots (learning curves) for the distance square between the search point wk and the minimum SSE solution w* generated using two versions of the LMS learning rule. The dashed line corresponds to the batch LMS rule in Equation (3.1.34). The solid line corresponds to the incremental LMS rule in Equation (3.1.35) with a random order of presentation of the training patterns. In both cases w1 = 0 and  = 0.005 are used. Note the logarithmic scale for the iteration number k.

Figure 3.1.6. Learning curves for the batch LMS (dashed line) and incremental LMS (solid line) learning rules for the data in Figure 3.1.4. The result for the batch LMS rule shown here is identical to the one shown in Figure 3.1.5 (this result looks different only because of the present use of a linear scale for the horizontal axis). The incremental LMS rule results shown assume a deterministic, fixed order of presentation of the training patterns. Also, for the incremental LMS case, wk represents the weight vector after the completion of the kth learning "cycle". Here, one cycle corresponds to 20 consecutive learning iterations.

Correlation Learning Rule

The Correlation rule is derived by starting from the criterion function

(3.1.49)

where yi = (xi)Tw, and performing gradient descent to minimize J. Note that minimizing J(w) is equivalent to maximizing the correlation between the desired target and the corresponding linear unit's output for all xi, i = 1, 2, ..., m. Now, employing steepest gradient descent to minimize J(w) leads to the learning rule:

(3.1.50)

By setting to 1 and completing one learning cycle using Equation (3.1.50), we arrive at the weight vector w* given by:

(3.1.51)

where X and d are as defined earlier. Note that Equation (3.1.51) leads to the minimum SSE solution in Equation (3.1.38) if X = X. This is only possible if the training vectors xk are encoded such that XXT is the identity matrix (i.e., the xk's are orthonormal). Correlation learning is further explored in Section 7.1.1 in Chapter 7.

Another version of this type of learning is the covariance learning rule. This rule is obtained by steepest gradient descent on the criterion function . Here, <y> and <d> are computed averages, over all training pairs, for the unit's output and the desired target, respectively. Covariance learning provides the basis of the cascade-correlation net presented in Section 6.3.2.

The following rule is similar to the -LMS rule except that it allows for units with a differentiable nonlinear activation function f. Figure 3.1.7 illustrates a unit with a sigmoidal activation function. Here, the unit's output is with net defined as the vector inner product xTw.

Figure 3.1.7. A computational unit with a differentiable sigmoidal activation function.

Again, consider the training pairs {xi, di}, i = 1, 2, ..., m, with xi Rn+1 ( = 1 for all i) and di [-1, +1]. Performing gradient descent on the instantaneous SSE criterion function , whose gradient is given by

(3.1.52)

(3.1.53)

where and . If f is defined by f(net) = tanh(b net), then its derivative is . For the "logistic" function, , the derivative is . Figure 3.1.8 plots f and for the hyperbolic tangent activation function with . Note how f asymptotically approaches +1 and -1 in the limit as net approaches + and , respectively.

Figure 3.1.8. Hyperbolic tangent activation function f and its derivative , plotted for .

One disadvantage of the delta learning rule is immediately apparent upon inspection of the graph of '(net) in Figure 3.1.8. In particular, notice how '(net)  0 when net has large magnitude (i.e., |net| > 3); these regions are called "flat spots" of '. In these flat spots, we expect the delta learning rule to progress very slowly (i.e., very small weight changes even when the error (d - y) is large), because the magnitude of the weight change in Equation (3.1.53) directly depends on the magnitude of '(net). Since slow convergence results in excessive computation time, it would be advantageous to try to eliminate the flat spot phenomenon when using the delta learning rule. One common flat spot elimination technique involves replacing ' by ' plus a small positive bias . In this case, the weight update equation reads as:

(3.1.54)

One of the primary advantages of the delta rule is that it has a natural extension which may be used to train multilayered neural nets. This extension, known as back error propagation, will be discussed in Chapter 5.

Hassoun and Song (1992) proposed a set of adaptive learning rules for classification problems as enhanced alternatives to the LMS and perceptron learning rules. In the following, three learning rules: AHK I, AHK II, and AHK III are derived based on gradient descent strategies on an appropriate criterion function. Two of the proposed learning rules, AHK I and AHK II, are well suited for generating robust decision surfaces for linearly separable problems. The third training rule, AHK III, extends these capabilities to find "good" approximate solutions for nonlinearly separable problems. The three AHK learning rules preserve the simple incremental nature found in the LMS and perceptron learning rules. The AHK rules also possess additional processing capabilities, such as the ability to automatically identify critical cluster boundaries and place a linear decision surface in such a way that it leads to enhanced classification robustness.

Consider a two-class {c1, c2} classification problem with m labeled feature vectors (training vectors) {xi, di}, i = 1, 2, ..., m. Assume that xi belongs to Rn+1 (with the last component of xi being a constant bias of value 1) and that di = +1 (-1) if xi  c1 (c2). Then, a single perceptron can be trained to correctly classify the above training pairs if an (n + 1)-dimensional weight vector w is computed which satisfies the following set of m inequalities (the sgn function is assumed to be the perceptron's activation function):

(3.1.55)

Next, if we define a set of m new vectors, zi, according to

(3.1.56)

and we let

Z = [ z1  z2  ... zm] (3.1.57)

then Equation (3.1.55) may be rewritten as the single matrix equation

(3.1.58)

Now, defining an m-dimensional positive-valued margin vector (b > 0) and using it in Equation (3.1.58), we arrive at the following equivalent form of Equation (3.1.55):

(3.1.59)

Thus, the training of the perceptron is now equivalent to solving Equation (3.1.59) for w, subject to the constraint b > 0. Ho and Kashyap (1965) proposed an iterative algorithm for solving Equation (3.1.59). In the Ho-Kashyap algorithm, the components of the margin vector are first initialized to small positive values, and the pseudo-inverse is used to generate a solution for w (based on the initial guess of b) which minimizes the SSE criterion function, J(w, b) = || ZTw - b ||2:

w = Zb (3.1.60)

where Z = (Z ZT)-1 Z, for m > + 1. Next, a new estimate for the margin vector is computed by performing the constrained (b > 0) gradient descent

(3.1.61)

where denotes the absolute value of the components of the argument vector and bk is the "current" margin vector. A new estimate of w can now be computed using Equation (3.1.60) and employing the updated margin vector from Equation (3.1.61). This process continues until all the components of e are zero (or are sufficiently small and positive), which is an indication of linear separability of the training set, or until e < 0, which is an indication of nonlinear separability of the training set (no solution is found). It can be shown (Ho and Kashyap, 1965; Slansky and Wassel, 1981) that the Ho-Kashyap procedure converges in a finite number of steps if the training set is linearly separable. For simulations comparing the above training algorithm to the LMS and perceptron training procedures, the reader is referred to Hassoun and Clark (1988), Hassoun and Youssef (1989), and Hassoun (1989a). We will refer to the above algorithm as the direct Ho-Kashyap (DHK) algorithm.

The direct synthesis of the w estimate in Equation (3.1.60) involves a one-time computation of the pseudo-inverse of Z. However, such computation can be computationally expensive and requires special treatment when Z ZT is ill-conditioned (i.e., |ZZT| close to zero). An alternative algorithm that is based on gradient descent principles and which does not require the direct computation of Z can be derived. This derivation is presented next.

Starting with the criterion function , gradient descent may be performed (Slansky and Wassel, 1981) with respect to b and w so that J is minimized subject to the constraint b > 0. The gradient of J with respect to w and b is given by

(3.1.62a)

(3.1.62b)

where the superscripts k and k + 1 represent current and updated values, respectively. One analytic method for imposing the constraint b > 0 is to replace the gradient in Equation (3.1.62a) by -0.5(e + |e|) with e as defined in Equation (3.1.61). This leads to the following gradient descent formulation of the Ho-Kashyap procedure:

(3.1.63a)

and

(3.1.63b)

where 1 and 2 are strictly positive constant learning rates. Because of the requirement that all training vectors zk (or xk) be present and included in Z, we will refer to the above procedure as the batch mode adaptive Ho-Kashyap (AHK) procedure. It can be easily shown that if r1 = 0 and b1 = 1, Equation (3.1.63) reduces to the -LMS learning rule. Furthermore, convergence can be guaranteed (Duda and Hart, 1973) if 0 < r1 < 2 and 0 < r2 < , where lmax is the largest eigenvalue of the positive definite matrix Z ZT.

A completely adaptive Ho-Kashyap procedure for solving Equation (3.1.59) is arrived at by starting from the instantaneous criterion function , which leads to the following incremental update rules:

(3.1.64a)

and

(3.1.64b)

Here bi represents a scalar margin associated with the xi input. In all of the above Ho-Kashyap learning procedures, the margin values are initialized to small positive values, and the perceptron weights are initialized to zero (or small random) values. If full margin error correction is assumed in Equation (3.1.64a), i.e., r1 = 1, the incremental learning procedure in Equation (3.1.64) reduces to the heuristically derived procedure reported in Hassoun and Clark (1988). An alternative way of writing Equation (3.1.64) is

and (3.1.65a)

and (3.1.65b)

where Db and Dw signifies the difference between the updated and current values of b and w, respectively. We will refer to this procedure as the AHK I learning rule. For comparison purposes, it may be noted that the -LMS rule in Equation (3.1.35) can be written as , with bi held fixed at +1.

The implied constraint bi > 0 in Equation (3.1.64) and Equation (3.1.65) was realized by starting with a positive initial margin and restricting the change in Db to positive real values. An alternative, more flexible way to realize this constraint is to allow both positive and negative changes in Db, except for the cases where a decrease in bi results in a negative margin. This modification results in the following alternative AHK II learning rule:

and (3.1.66a)

and (3.1.66b)

In the general case of an adaptive margin, as in Equation (3.1.66), Hassoun and Song (1992) showed that a sufficient condition for the convergence of the AHK rules, is given by

(3.1.67a)

(3.1.67b)

Another variation results in the AHK III rule which is appropriate for both linearly separable and nonlinearly separable problems. Here, Dw is set to 0 in Equation (3.1.66b). The advantages of the AHK III rule are that (1) it is capable of adaptively identifying difficult-to-separate class boundaries and (2) it uses such information to discard nonseparable training vectors and speed up convergence (Hassoun and Song, 1992). The reader is invited to apply the AHK III as in Problem 3.1.7 for gaining insight into the dynamics and separation behavior of this learning rule.

Example 3.1.2: In this example, the perceptron, LMS, and AHK learning rules are compared in terms of the quality of the solution they generate. Consider the simple two-class linearly separable problem shown earlier in Figure 3.1.4. The -LMS rule of Equation (3.1.35) is used to obtain the solution shown as a dashed line in Figure 3.1.9. Here, the initial weight vector was set to 0 and a learning rate  = 0.005 is used. This solution is not one of the linearly separable solutions for this problem. Four examples of linearly separable solutions are shown as solid lines in the figure. These solutions are generated using the perceptron learning rule of Equation (3.1.2), with varying order of input vector presentations and with a learning rate of  = 0.1. Here, it should be noted that the most robust solution, in the sense of tolerance to noisy input, is given by , which is shown as a dotted line in Figure 3.1.9. This robust solution was in fact automatically generated by the AHK I learning rule of Equation (3.1.65).

Figure 3.1.9. LMS generated decision boundary (dashed line) for a two-class linearly separable problem. For comparison, four solutions generated using the perceptron learning rule are shown (solid lines). The dotted line is the solution generated by the AHK I rule.

(3.1.68)

or its instantaneous version

(3.1.69)

Figure 3.1.10 shows a plot of |d - y|r for r = 1, 2, and 20. The general form of the gradient of this criterion function is given by

(3.1.70)

Note that for r = 2 this reduces to the gradient of the SSE criterion function given by Equation (3.1.52). If r = 1, then J(w) = |d - y| with the gradient (note that the gradient of J(w) does not exist at the solution points d = y)

(3.1.71)

In this case, the criterion function in Equation (3.1.68) is known as the Manhattan norm. For , a supremum error measure is approached.

Figure 3.1.10. A family of instantaneous Minkowski-r criterion functions.

A small r gives less weight for large deviations and tends to reduce the influence of the outermost points in the input space during learning. It can be shown, for a linear unit with normally distributed inputs, that r = 2 is an appropriate choice in the sense of both minimum SSE and minimum probability of prediction error (maximum likelihood). The proof is as follows.

Consider the training pairs {xidi}, i = 1, 2, ..., m. Assume that the vectors xi are drawn randomly and independently from a normal distribution. Then a linear unit with a fixed but unknown weight vector w outputs the estimate when presented with input xi. Since a weighted sum of independent normally distributed random variables is itself normally distributed [e.g., see Mosteller et al. (1970)], then is normally distributed. Thus the prediction error is normally distributed with mean zero and some variance, . This allows us to express the conditional probability density for observing target di, given w, upon the presentation of xi as

(3.1.72)

This function is also known as the likelihood of w with respect to observation di. The maximum likelihood estimate of w is that value of w which maximizes the probability of occurrence of observation di for input xi. The likelihood of w with respect to the whole training set is the joint distribution

(3.1.73)

Maximizing the above likelihood is equivalent to maximizing the log-likelihood function:

(3.1.74)

Since the term is a constant, maximizing the log-likelihood function in Equation (3.1.74) is equivalent to minimizing the SSE criterion

(3.1.75)

Therefore, with the assumption of a linear unit (ADALINE) with normally distributed inputs, the SSE criterion is optimal in the sense of minimizing prediction error. However, if the input distribution is non-Gaussian, then the SSE criterion will not possess maximum likelihood properties. See Mosteller and Tukey (1980) for a more thorough discussion on the maximum likelihood estimation technique.

If the distribution of the training patterns has a heavy tail such as the Laplace-type distribution, r = 1 will be a better criterion function choice. This criterion function is known as robust regression since it is more robust to an outlier training sample than r = 2. Finally, 1 < r < 2 is appropriate to use for pseudo-Gaussian distributions where the distribution tails are more pronounced than in the Gaussian.

Another criterion function that can be used (Baum and Wilczek, 1988; Hopfield, 1987; Solla et al., 1988) is the instantaneous relative entropy error measure (Kullback, 1959) defined by

(3.1.76)

where d belongs to the open interval (-1, +1). As before, J(w)  0, and if y = d then J(w) = 0. If y = f(net) = tanh(b net), the gradient of Equation (3.1.76) is

(3.1.77)

The factor f '(net) in Equations (3.1.53) and (3.1.70) is missing from Equation (3.1.77). This eliminates the flat-spot encountered in the delta rule and makes the training here more like m-LMS (note, however, that y here is given by y = ). This entropy criterion is "well formed" in the sense that gradient descent over such a function will result in a linearly separable solution, if one exists (Wittner and Denker, 1988; Hertz et al., 1991). On the other hand, gradient descent on the SSE criterion function does not share this property, since it may fail to find a linearly separable solution as demonstrated in Example 3.1.2.

In order for gradient descent search to find a solution w* in the desired linearly separable region, we need to use a well-formed criterion function. Consider the following general criterion function

(3.1.78)

where

Let . Now, we say is "well-formed" if g(s) is differentiable and satisfies the following conditions (Wittner and Denker, 1988):

1. For all s, ; i.e., g does not push in the wrong direction.

2. There exists e > 0, such that for all s 0; i.e., g keeps pushing if there is a misclassification.

3. g(s) is bounded below.

For a single unit with weight vector w, it can be shown (Wittner and Denker, 1988) that if the criterion function is well-formed, then gradient descent is guaranteed to enter the region of linearly separable solutions w*, provided that such a region exists.

Example 3.1.3: The perceptron criterion function in Equation (3.1.20) is a well-formed criterion function since it satisfies the above conditions:

1. , thus g(s) = -s and = 1 > 0 for all s.

2. = 1 e > 0 for all s 0.

3. g(s) is bounded below, since .

The linear threshold gate, perceptron, and ADALINE are examples of deterministic units; for a given input, the unit always responds with the same output. On the other hand, a stochastic unit has a binary-valued output which is a probabilistic function of the input activity net, as depicted in Figure 3.1.11.

Figure 3.1.11. A stochastic unit.

Formally, y is given by

(3.1.79)

One possible probability function is . Hence, . Also, note that the expected value of y, <y>, is given by

(3.1.80)

Stochastic units are the basis for reinforcement learning networks, as is shown in the next section. Also, these units allow for a natural mapping of optimal stochastic learning and retrieval methods onto neural networks, as discussed in Section 8.3.

Let us now define a SSE criterion function in terms of the mean output of the stochastic unit:

(3.1.81)

Employing gradient descent, we arrive at the update rule

(3.1.82)

In the incremental update mode, we have the following update rule:

This learning rule is identical in form to the delta learning rule given in Equation (3.1.53), which used a deterministic unit with an activation function (net) = tanh(net). Therefore, in an average sense, the stochastic unit learning rule in Equation (3.1.83) leads to a weight vector which is equal to that obtained using the delta rule for a deterministic unit with a hyperbolic tangent activation function.

Goto [3.0] [3.2] [3.3] [3.4] [3.5] [3.6]