Supervised learning is treated first. Here, two
groups of rules are discussed: Error correction rules and gradient
descent-based rules. By the end of this section it will be established
that all of these learning rules can be systematically derived
as minimizers of an appropriate criterion function.

Assume we are training the above perceptron to load
(learn) the training pairs: {**x**1,
*d*1}, {**x**2,
*d*2}, ..., {**x***m*, *d**m*}
where is the *k*th input vector
and , *k* = 1, 2, ..., *m*,
is the desired target for the *k*th input vector (usually,
the order of these training pairs is random). The entire collection
of these pairs is called the training set. The goal, then, is
to design a perceptron such that for each input vector **x***k*
of the training set, the perceptron output *y**k*
matches the desired target *d**k*;
that is, we require *y**k *= sgn(**w**T**x***k*) = *d**k*,
for each *k* = 1, 2, ..., *m*. In this case, we say
that the perceptron correctly classifies the training set. Of
course, "designing" an appropriate perceptron to correctly
classify the training set amounts to determining a weight vector
**w*** such that the
following relations are satisfied:

(3.1.1)

Recall that the set of all **x** which satisfy
**x**T**w*** = 0
defines a hyperplane in R*n*.
Thus, in the context of the above discussion, finding a solution
vector **w*** to Equation
(3.1.1) is equivalent to finding a separating hyperplane which
correctly classifies all vectors **x***k*,
*k* = 1, 2, ..., *m*. In other words, we desire a hyperplane
**x**T**w***
= 0 which partitions the input space into two distinct regions,
one containing all points **x***k*
with *d**k*
= +1 and the other region containing all points **x***k*
with *d**k*
= -1.

One possible incremental method for arriving at
a solution **w*** is
to invoke the perceptron learning rule (Rosenblatt, 1962):

*k* = 1, 2,
... (3.1.2)

where is a positive constant, called the learning
rate. The incremental learning process given in Equation (3.1.2)
proceeds as follows. First, an initial weight vector **w**1
is selected (usually at random) to begin the process. Then the
*m* pairs {**x***k*,
*d**k*} of the
training set are used to successively update the weight vector
until (hopefully) a solution **w***
is found which correctly classifies the training set. This process
of sequentially presenting the training patterns is usually referred
to as "cycling" through the training set, and a complete
presentation of the *m* training pairs is referred to as
a cycle (or pass) through the training set. In general, more
than one cycle through the training set is required to determine
an appropriate solution vector. Hence, in Equation (3.1.2), the
superscript *k* in **w***k*
refers to the iteration number. On the other hand, the superscript
*k* in **x***k*
(and *d**k*)
is the label of the training pair presented at the *k*th
iteration. To be more precise, if the number of training pairs,
*m*, is finite, then the superscripts in **x***k*
and *d**k* should
be replaced by [(*k* - 1) mod *m*] + 1.
Here, *a *mod *b* returns the remainder of
the division of *a* by *b* (e.g., 5 mod 8 = 5,
8 mod 8 = 0, and 19 mod 8 = 3).
This observation is valid for all incremental learning rules
presented in this chapter.

Notice that for = 0.5, the perceptron
learning rule can be written as:

(3.1.3)

where

(3.1.4)

That is, a correction is done if and only if a misclassification,
indicated by

(3.1.5)

occurs. The addition of vector **z***k*
to **w***k* in
Equation (3.1.3) moves the weight vector directly toward and perhaps
across the hyperplane . The new inner
product is larger than
by the amount of and the correction **w***k* = **w***k*+1 - **w***k*
is clearly moving **w***k*
in a good direction, the direction of increasing ,
as can be seen from Figure 3.1.2. Thus, the perceptron learning
rule attempts to find a solution **w***
for the following system of inequalities

for *k* = 1, 2, ..., *m*
(3.1.6)

In an analysis of any learning algorithm, an in
particular the perceptron learning algorithm of Equation (3.1.2),
there are two main issues to consider: (1) The existence of solutions
and (2) convergence of the algorithm to the desired solutions
(if they exist). In the case of the perceptron, it is clear that
a solution vector (i.e., a vector **w***
which correctly classifies the training set) exists if and only
if the given training set is linearly separable. Assuming, then,
that the training set is linearly separable, we may proceed to
show that the perceptron learning rule converges to a solution
(Novikoff, 1962; Ridgway, 1962; Nilsson, 1965) as follows. Let
**w*** be any solution
vector, so that

for *k* = 1, 2, ..., *m
*(3.1.7)

Then, from Equation (3.1.3), if the *k*th pattern
is misclassified we may write

**w***k*+1
- a **w***
= **w***k* -
a **w*** + **z***k*
(3.1.8)

where is a positive scale factor, and hence

(3.1.9)

Since **z***k*
is misclassified, we have , and thus

(3.1.10)

Now, let
and
( is positive since )
and substitute into Equation (3.1.10) to get

(3.1.11)

If we choose sufficiently large, in particular ,
we obtain

(3.1.12)

Thus, the square distance between **w***k*
and a**w***
is reduced by at least b2
at each correction, and after *k* corrections we may write
Equation (3.1.12) as

(3.1.13)

It follows that the sequence of corrections must
terminate after no more than *k*0
corrections, where

(3.1.14)

Therefore, if a solution exists, it is achieved in
a finite number of iterations. When corrections cease, the resulting
weight vector must classify all the samples correctly since a
correction occurs whenever a sample is misclassified and since
each sample appears infinitely often in the sequence. In general,
a linearly separable problem admits an infinite number of solutions.
The perceptron learning rule in Equation (3.1.2) converges to
one of these solutions. This solution, though, is sensitive to
the value of the learning rate, , used and to the order of presentation
of the training pairs. This sensitivity is responsible for the
varying quality of the perceptron generated separating surface
observed in simulations.

The bound on the number of corrections, *k*0,
given by Equation (3.1.14) depends on the choice of the initial
weight vector **w**1.
If **w**1 = **0**,
we get

or

(3.1.15)

Here, *k*0
is a function of the initially unknown solution weight vector
**w***. Therefore,
Equation (3.1.15) is of no help for predicting the maximum number
of corrections. However, the denominator of Equation (3.1.15)
implies that the difficulty of the problem is essentially determined
by the samples most nearly orthogonal to the solution vector.

Generalizations of the Perceptron Learning Rule

The perceptron learning rule may be generalized
to include a variable increment r*k*
and a fixed, positive margin *b*. This generalized learning
rule updates the weight vector whenever
fails to exceed the margin *b*. Here, the algorithm for
weight vector update is given by

(3.1.16)

The margin *b* is useful because it gives a
dead-zone robustness to the decision boundary. That is, the perceptron's
decision hyperplane is constrained to lie in a region between
the two classes such that sufficient clearance is realized between
this hyperplane and the extreme points (boundary patterns) of
the training set. This makes the perceptron robust with respect
to noisy inputs. It can be shown (Duda and Hart, 1973) that if
the training set is linearly separable and if the following three
conditions are satisfied:

(3.1.17a)

(3.1.17b)

(3.1.17c)

(e.g., r*k*
= or even r*k*
= r*k*)
then **w***k*
converges to a solution **w***
which satisfies for *i* = 1, 2, ..., *m*.
Furthermore, when *k*
is fixed at a positive constant , this learning rule converges
in finite time.

Another variant of the perceptron learning rule
is given by the "batch" update procedure

(3.1.18)

where *Z*(**w***k*)
is the set of patterns **z** misclassified by **w***k*.
Here, the weight vector change is along
the direction of the resultant vector of all misclassified patterns.
In general, this update procedure converges faster than the perceptron
rule, but it requires more storage.

In the nonlinearly separable case, the above algorithms
do not converge. Few theoretical results are available on the
behavior of these algorithms for nonlinearly separable problems
[see Minsky and Papert (1969) and Block and Levin (1970) for some
preliminary results]. For example, it is known that the length
of **w** in the perceptron rule is bounded, i.e., tends to
fluctuate near some limiting value ||**w***||
(Efron, 1964). This information may be used to terminate the
search for **w***.
Another approach is to average the weight vectors near the fluctuation
point **w***. Butz
(1967) proposed the use of a reinforcement factor , 0 1, in
the perceptron learning rule. This reinforcement places **w**
in a region that tends to minimize the probability of error for
nonlinearly separable cases. Butz's rule is as follows

(3.1.19)

The Perceptron Criterion Function

It is interesting to see how the above error correction
rules can be derived by a gradient descent on an appropriate criterion
(objective) function. For the perceptron, we may define the following
criterion function (Duda and Hart, 1973):

(3.1.20)

where *Z*(**w**) is the set of samples misclassified
by **w** (i.e., **z**T**w** 0).
Note that if *Z*(**w**) is empty then *J*(**w**)
= 0, otherwise *J*(**w**) > 0. Geometrically, *J*(**w**)
is proportional to the sum of the distances from the misclassified
samples to the decision boundary. The smaller *J* is, the
better the weight vector **w** will be.

Given this objective function *J*(**w**),
we can incrementally improve the search point **w***k*
at each iteration by sliding downhill on the surface defined by
*J*(**w**) in **w** space. Specifically, we may use
*J* to perform a discrete gradient descent search, which
updates **w***k*
so that a step is taken downhill in the "steepest" direction
along the search surface *J*(**w**) at **w***k*.
This can be achieved by making **w***k*
proportional to the gradient of *J* at the present location
**w***k*, formally
we may write

(3.1.21)

Here, the initial search point, **w**1,
and the learning rate (step size) are to be specified by the
user. We will refer to Equation (3.1.21) as the steepest gradient
descent search rule or, simply, gradient descent. Next, substituting
the gradient

(3.1.22)

in Equation (3.1.21), leads to the weight update
rule

(3.1.23)

The learning rule given in Equation (3.1.23) is identical
to the multiple-sample (batch) perceptron rule of Equation (3.1.18).
The original perceptron learning rule of Equation (3.1.3) can
be thought of as an "incremental" gradient descent search
rule for minimizing the perceptron criterion function in Equation
(3.1.20). Following a similar procedure as in Equations (3.1.21)
through (3.1.23), it can be shown that

(3.1.24)

is the appropriate criterion function for the modified
perceptron rule in Equation (3.1.16).

Before moving on, we should note that the gradient
of *J* in Equation (3.1.22) is not mathematically precise.
Due to the piecewise linear nature of *J*, sudden changes
in the gradient of *J* occur every time the perceptron output
*y* goes through a transition at .
Therefore, the gradient of *J* is not defined at "transition"
points **w** satisfying , *k* = 1, 2, ..., *m*.
However, because of the discrete nature of Equation (3.1.21),
the likelihood of **w***k*
overlapping with one of these transition points is negligible,
and thus we may still express *J* as in Equation (3.1.22).
The reader is referred to Problem 3.1.3 for further exploration
into gradient descent on the perceptron criterion function.

Mays Learning Rule

The criterion functions in Equations (3.1.20) and
(3.1.24) are by no means the only functions we can construct that
are minimized when **w** is a solution vector. For example,
an alternative function is the quadratic function

(3.1.25)

where *b* is a positive constant margin. Like
the previous criterion functions, the function *J*(**w**)
in Equation (3.1.25) focuses attention on the misclassified samples.
Its major difference is that its gradient is continuous, whereas
the gradient of the perceptron criterion function, with or without
the use of margin, is not. Unfortunately, the present function
can be dominated by the input vectors with the largest magnitudes.
We may eliminate this undesirable effect by dividing by :

(3.1.26)

The gradient of *J*(**w**) in Equation (3.1.26)
is given by

(3.1.27)

which, upon substituting in Equation (3.1.21), leads
to the following learning rule

(3.1.28)

If we consider the incremental update version of Equation (3.1.28), we arrive at Mays' rule (Mays, 1964):

(3.1.29)

If the training set is linearly separable, Mays'
rule converges in a finite number of iterations, for 0 < r
< 2 (Duda and Hart, 1973). In the case of a nonlinearly separable
training set, the training procedure in Equation (3.1.29) will
never converge. To fix this problem a decreasing learning rate
such as r*k*
= may be used to force convergence to
some approximate separating surface (Duda and Singleton, 1964).

Widrow-Hoff (a-LMS) Learning Rule

Another example of an error correcting rule with
a quadratic criterion function is the Widrow-Hoff rule (Widrow
and Hoff, 1960). This rule was originally used to train the linear
unit, also known as the adaptive linear combiner element (ADALINE),
shown in Figure 3.1.3. In this case, the output of the linear
unit in response to the input **x***k*
is simply *y**k* = (**x***k*)T**w**.
The Widrow-Hoff rule was originally proposed as an ad hoc rule,
which embodies the so-called minimal disturbance principle. Later,
it was discovered (Widrow and Stearns, 1985) that this rule converges
in the mean square to the solution **w***
which corresponds to the least-mean-square (LMS) output error,
if all input patterns are of the same length (i.e., ||**x***k*||
is the same for all *k*). Therefore, this rule is sometimes
referred to as the -LMS rule (the "" is used here to
distinguish this rule from another very similar rule which is
discussed in Section 3.1.2).

The -LMS rule is given by

(3.1.30)

where *d**k*
R is the desired response, and > 0. Equation (3.1.30)
is similar to the perceptron rule if one sets , in Equation (3.1.2),
as

(3.1.31)

Though, the error in Equation (3.1.30) is measured
at the linear output; not after the nonlinearity as in the perceptron.
The constant a
controls the stability and speed of convergence (Widrow and Stearns,
1985; Widrow and Lehr, 1990). If the input vectors are independent
over time, stability is insured for most practical purposes if
0 < a < 2.

As for May's rule, this rule is self-normalizing
in the sense that the choice of does not depend on the magnitude
of the input vectors. Since the -LMS rule selects **w***k*
to be collinear with **x***k*,
the desired error correction is achieved with a weight change
of the smallest possible magnitude. Thus, when adapting to learn
a new training sample, the responses to previous training samples
are minimally disturbed, on the average. This is the basic idea
behind the minimal disturbance principle on which the -LMS is
founded. Alternatively, one can show that the -LMS learning rule
is a gradient descent minimizer of an appropriate quadratic criterion
function (see Problem 3.1.4).

**3.1.2 Other Gradient Descent-Based Learning Rules**

The -LMS learning rule (Widrow and Hoff, 1960) represents
the most analyzed and most applied simple learning rule. It is
also of special importance due to its possible extension to learning
in multiple unit neural nets. Therefore, special attention is
given to this rule in this chapter. In the following, the -LMS
rule is described in the context of the linear unit in Figure
3.1.3.

Let

(3.1.32)

be the sum of squared error (SSE) criterion function,
where

* *
(3.1.33)

Now, using steepest gradient descent search to minimize
*J*(**w**) in Equation (3.1.32) gives

(3.1.34)

The criterion function *J*(**w**) in Equation
(3.1.32) is quadratic in the weights because of the linear relation
between *y**i*
and **w**. In fact, *J*(**w**) defines a convex hyperparaboloidal
surface with a single minimum **w***
(the global minimum). Therefore, if the positive constant is
chosen sufficiently small, the gradient descent search implemented
by Equation (3.1.34) will asymptotically converge towards the
solution **w***, regardless
of the setting of the initial search point, **w**1.
The learning rule in Equation (3.1.34) is sometimes referred
to as the "batch" LMS rule.

The incremental version of Equation (3.1.34), known
as the m-LMS
or LMS rule, is given by

(3.1.35)

Note that this rule becomes identical to the a-LMS
learning rule in Equation (3.1.30) upon setting as

(3.1.36)

Also, when the input vectors have the same length,
as would be the case when **x** {-1, +1}*n*,
then the -LMS rule becomes identical to the -LMS rule. Since
the a-LMS
learning algorithm converges when 0 < a
< 2, we can start from Equation (3.1.36) and calculate the
required range on for ensuring the convergence of the -LMS rule:

(3.1.37)

For input patterns independent over time and generated
by a stationary process, convergence of the mean of the weight
vector, <**w***k*>
is ensured if the fixed learning rate is chosen to be smaller
than . (Widrow and Stearns, 1985. Also
see Problem 4.3.8 in the next chapter for further exploration.)
In this case, <**w***k*>
approaches a solution **w***
as *k* . Here, ""
signifies the "mean" or expected value. Note that the
bound is less restrictive than the one
in Equation (3.1.37). Horowitz and Senne (1981) showed that the
bound on guarantees the convergence
of **w** in the mean square (i.e.,
as *k* ), for input patterns generated by a zero-mean
Gaussian process independent over time. It should be noted that
convergence in the mean square implies convergence in the mean;
however, the converse is not necessarily true. The assumptions
of decorrelated patterns and stationarity are not necessary conditions
for the convergence of -LMS (Widrow et al., 1976; Farden, 1981).
For example, Macchi and Eweda (1983) have a much stronger result
regarding convergence of the -LMS rule which is even valid when
a finite number of successive training patterns are strongly correlated.

In practical problems, *m* > *n *+ 1;
hence, it becomes impossible to satisfy all requirements (**x***k*)T**w**
= *d**k*, *k* = 1, 2, ..., *m*.
Therefore, Equation (3.1.35) never converges. Thus, for convergence,
is set to > 0, where 0
is a small positive constant. In applications such as linear
filtering, though, the decreasing step size is not very valuable,
because it cannot accommodate non-stationarity in the input signal.
Indeed, **w***k*
will essentially stop changing for large *k*, which precludes
the tracking of time variations. Thus, the fixed increment (constant
) LMS learning rule has the advantage of limited memory, which
enables it to track time fluctuations in the input data.

When the learning rate is sufficiently small, the
-LMS rule becomes a "good" approximation to the gradient
descent rule in Equation (3.1.34). This means that the weight
vector **w***k*
will tend to move towards the global minimum **w***
of the convex SSE criterion function. Next, we show that **w***
is given by

** w*** = **X**†**d***
*(3.1.38)

where **X** = [**x**1 **x**2 ... **x***m*],
**d** = [*d*1 *d*2 ... *d**m*]T,
and **X**† = (**XX**T)-1**X**
is the generalized inverse or pseudo-inverse (Penrose, 1955) of
**X** for *m* > *n *+ 1.

The extreme points (minima and maxima) of the function
*J*(**w**) are solutions to the equation

(3.1.39)

Therefore, any minimum of the SSE criterion function
in Equation (3.1.32) must satisfy

(3.1.40)

Equation (3.1.40) can be rewritten as

**X X**T **w** = **X** **d**
(3.1.41)

which for a nonsingular matrix **X X**T
gives the solution in Equation (3.1.38), or explicitly

(3.1.42)

Recall that just because **w***
in Equation (3.1.42) satisfies the condition *J*(**w***) = **0**,
this does not guarantee that **w***
is a local minimum of the criterion function *J*. It does,
however, considerably narrow the choices in that such **w***
represents (in a local sense) either a point of minimum, maximum,
or saddle point of *J*. To verify that **w***
is actually a minimum of *J*(**w**), we may evaluate the
second derivative or Hessian matrix of
*J* at **w***
and show that it is positive definite. This can be readily achieved
after noting that *J* is equal to the positive-definite matrix
**XX**T. Thus, **w***
is a minimum of *J*.

The LMS rule may also be applied to synthesize the
weight vector, **w**, of a perceptron for solving two-class
classification problems. Here, one starts by training the linear
unit in Figure 3.1.3 with the given training pairs {**x***k*, *d**k*},
*k* = 1, 2, ..., *m*, using
the LMS rule. During training the desired target *d**k*
is set to +1 for one class and to for
the other class. (In fact, any positive constant can be used
as the target for one class, and any negative constant can be
used as the target for the other class). After convergence of
the learning process, the solution vector obtained may now be
used in the perceptron for classification. Due to the thresholding
nonlinearity in the perceptron, the output of the classifier will
now be properly restricted to the set {-1, +1}.

When used as a perceptron weight vector, the minimum
SSE solution in Equation (3.1.42) does not generally minimize
the perceptron classification error rate. This should not be
surprising, since the SSE criterion function is not designed to
constrain its minimum inside the linearly separable solution region.
Therefore, this solution does not necessarily represent a linear
separable solution, even when the training set is linearly separable
(this is further explored in Section 3.1.5). However, when the
training set is nonlinearly separable, the solution arrived at
may still be a useful approximation. Therefore, by employing
the LMS rule for perceptron training, linear separability is sacrificed
for good compromise performance on both separable and nonseparable
problems.

The -LMS as a Stochastic Process

Stochastic approximation theory may be employed
as an alternative to the deterministic gradient descent analysis
presented thus far. It has the advantage of naturally arriving
at a learning rate schedule r*k*
for asymptotic convergence in the mean square. Here, one starts
with the mean-square error (MSE) criterion function:

(3.1.43)

where denotes the mean (expectation)
over all training vectors. Now, one may compute the gradient
of *J* as

(3.1.44)

which upon setting to zero, allows us to find the
minimum **w*** of *J*
in Equation (3.1.43) as the solution of

which gives

(3.1.45)

where and .
Note that the expected value of a vector or a matrix is found
by taking the expected values of its components. We refer to
**C** as the autocorrelation matrix of the input vectors and
to **P** as the cross-correlation vector between the input
vector **x** and its associated desired target *d* (more
to follow on the properties of **C** in Section 3.3.1). In
Equation (3.1.45), the determinant of **C**, |**C**|, is
assumed different from zero. The solution **w***
in Equation (3.1.45) is sometimes called the Wiener weight vector
(Widrow and Stearns, 1985). It represents the minimum MSE solution,
also known as the least mean square (LMS) solution.

It is interesting to note here, the close relation
between the minimum SSE solution in Equation (3.1.42) and the
LMS or minimum MSE solution in Equation (3.1.45). In fact, one
can show that when the size of the training set *m* is large,
then the minimum SSE solution converges to the minimum MSE solution.

First, let us express **XX**T
as the sum of vector outer products
We can also rewrite **Xd** as . This
representation allows us to express Equation (3.1.42) as

Now, multiplying the right hand side of the above
equation by allows us to express it as

Finally, if *m* is large, the averages
and become very good approximations of
the expectations **C** = <**xx**T>
and **P** = <*d***x**>, respectively.
Thus, we have established the equivalence of the minimum SSE
and minimum MSE for a large training set.

Next, in order to minimize the MSE criterion, one
may employ a gradient descent procedure, where, instead of the
expected gradient in Equation (3.1.44), the instantaneous gradient
[(**x***k*)T
**w***k* -
*d**k*] **x***k*
is used. Here, at each learning step the input vector **x**
is drawn at random. This leads to the stochastic process:

** **
(3.1.46)

which is the same as the m-LMS
rule in Equation (3.1.35) except for a variable learning rate
*k*. It can be shown,
that if |**C**| 0 and r*k*
satisfies the three conditions:

1. (3.1.47a)

2. (3.1.47b)

3. (3.1.47c)

then, **w***k*
converges to **w***
in Equation (3.1.45) asymptotically in the mean square; i.e.,

(3.1.48)

The criterion function in Equation (3.1.43) is of
the form and is known as a regression
function. The iterative algorithm in Equation (3.1.46) is also
known as a stochastic approximation procedure (or Kiefer-Wolfowitz
or Robbins-Monro procedure). For a thorough discussion of stochastic
approximation theory, the reader is referred to Wasan (1969).

**Example 3.1.1: **In this
example, we present the results of a set of simulations which
should help give some insight into the dynamics of the batch and
incremental LMS learning rules. Specifically, we are interested
in comparing the convergence behavior of the discrete-time dynamical
systems in Equations (3.1.34) and (3.1.35). Consider the training
set depicted in Figure 3.1.4 for a simple mapping problem. The
ten squares and ten filled circles in this figure are positioned
at the points whose coordinates (*x*1, *x*2)
specify the two components of the input vectors. The squares
and circles are to be mapped to the targets +1 and -1,
respectively. For example, the left most square in the figure
represents the training pair {[0, 1]T, 1}.
Similarly, the right most circle represents the training pair
{[2, 2]T, -1}.
Figure 3.1.5 shows plots for the evolution of the square of the
distance between the vector **w***k*
and the (computed) minimum SSE solution, **w***
for batch LMS (dashed line) and incremental LMS (solid line).
In both simulations, the learning rate (step size) was set to
0.005. The initial search point **w**1
was set to [0, 0]T.
For the incremental LMS rule, the training examples are selected
randomly from the training set. The batch LMS rule converges
to the optimal solution **w***
in less than 100 steps. Incremental LMS requires more learning
steps, on the order of 2,000 steps, to converge to a small neighborhood
of **w***. The fluctuations
in ||**w***k* - **w***||2
in this neighborhood are less than 0.02 as can be seen from Figure
3.1.5. The effect of a deterministic order of presentation of
the training examples on the incremental LMS rule is shown by
the solid line in Figure 3.1.6. Here, the training examples are
presented in a predefined order, which did not change during training.
The same initialization and step size are used as before. In
order to allow for a more meaningful comparison between the two
LMS rule versions, one learning step of incremental LMS is taken
to mean a full cycle through the 20 samples. For comparison,
the simulation result with batch LMS learning is plotted in the
figure (see dashed line). These results indicate a very similar
behavior in the convergence characteristics of incremental and
batch LMS learning. This is so because of the small step size
used. Both cases show asymptotic convergence toward the optimal
solution **w***, but
with a relatively faster convergence of the batch LMS rule near
**w***, this is attributed
to the use of more accurate gradient information.

Figure 3.1.4. A 20 sample training set used in the
simulations associated with Example 3.1.1. Points signified by
a square and a filled circle should map into +1 and ,
respectively.

Figure 3.1.5. Plots (learning curves) for the distance
square between the search point **w***k*
and the minimum SSE solution **w***
generated using two versions of the LMS learning rule. The dashed
line corresponds to the batch LMS rule in Equation (3.1.34).
The solid line corresponds to the incremental LMS rule in Equation
(3.1.35) with a random order of presentation of the training patterns.
In both cases **w**1 = **0**
and = 0.005 are used. Note the logarithmic scale for
the iteration number *k*.

Figure 3.1.6. Learning curves for the batch LMS (dashed
line) and incremental LMS (solid line) learning rules for the
data in Figure 3.1.4. The result for the batch LMS rule shown
here is identical to the one shown in Figure 3.1.5 (this result
looks different only because of the present use of a linear scale
for the horizontal axis). The incremental LMS rule results shown
assume a deterministic, fixed order of presentation of the training
patterns. Also, for the incremental LMS case, **w***k*
represents the weight vector after the completion of the *k*th
learning "cycle". Here, one cycle corresponds to 20
consecutive learning iterations.

Correlation Learning Rule

The Correlation rule is derived by starting from
the criterion function

(3.1.49)

where *y**i*
= (**x***i*)T**w**,
and performing gradient descent to minimize *J*. Note that
minimizing *J*(**w**) is equivalent to maximizing the
correlation between the desired target and the corresponding linear
unit's output for all **x***i*,
*i* = 1, 2, ..., *m*. Now,
employing steepest gradient descent to minimize *J*(**w**)
leads to the learning rule:

(3.1.50)

By setting to 1 and completing one learning cycle
using Equation (3.1.50), we arrive at the weight vector **w***
given by:

(3.1.51)

where **X** and **d** are as defined earlier.
Note that Equation (3.1.51) leads to the minimum SSE solution
in Equation (3.1.38) if **X**† = **X**.
This is only possible if the training vectors **x***k*
are encoded such that **XX**T
is the identity matrix (i.e., the **x***k*'s
are orthonormal). Correlation learning is further explored in
Section 7.1.1 in Chapter 7.

Another version of this type of learning is the
covariance learning rule. This rule is obtained by steepest gradient
descent on the criterion function . Here,
<*y*> and <*d*> are computed averages, over
all training pairs, for the unit's output and the desired target,
respectively. Covariance learning provides the basis of the cascade-correlation
net presented in Section 6.3.2.

**3.1.3 Extension of m-LMS
Rule to Units with Differentiable Activation Functions: Delta
Rule**

Again, consider the training pairs {**x***i*,
*d**i*}, *i
*= 1, 2, ..., *m*, with **x***i*
R*n*+1
( = 1 for all *i*) and
*d**i* [-1,
+1]. Performing gradient descent on the instantaneous SSE criterion
function , whose gradient is given by

(3.1.52)

leads to the delta rule

(3.1.53)

where and .
If *f* is defined by *f*(*net*) = tanh(b *net*),
then its derivative is . For the "logistic"
function, , the derivative is .
Figure 3.1.8 plots *f* and for
the hyperbolic tangent activation function with .
Note how *f* asymptotically approaches +1 and -1
in the limit as *net* approaches + and , respectively.

Figure 3.1.8. Hyperbolic tangent activation function
*f* and its derivative , plotted
for .

One disadvantage of the delta learning rule is immediately
apparent upon inspection of the graph of *f *'(*net*)
in Figure 3.1.8. In particular, notice how *f *'(*net*) 0
when *net* has large magnitude (i.e., |*net*| > 3);
these regions are called "flat spots" of *f *'.
In these flat spots, we expect the delta learning rule to progress
very slowly (i.e., very small weight changes even when the error
(*d* - *y*)
is large), because the magnitude of the weight change in Equation
(3.1.53) directly depends on the magnitude of *f *'(*net*).
Since slow convergence results in excessive computation time,
it would be advantageous to try to eliminate the flat spot phenomenon
when using the delta learning rule. One common flat spot elimination
technique involves replacing *f *' by *f *'
plus a small positive bias . In this case, the weight update
equation reads as:

(3.1.54)

One of the primary advantages of the delta rule
is that it has a natural extension which may be used to train
multilayered neural nets. This extension, known as back error
propagation, will be discussed in Chapter 5.

**3.1.4 Adaptive Ho-Kashyap (AHK) Learning Rules**

Hassoun and Song (1992) proposed a set of adaptive learning rules for classification problems as enhanced alternatives to the LMS and perceptron learning rules. In the following, three learning rules: AHK I, AHK II, and AHK III are derived based on gradient descent strategies on an appropriate criterion function. Two of the proposed learning rules, AHK I and AHK II, are well suited for generating robust decision surfaces for linearly separable problems. The third training rule, AHK III, extends these capabilities to find "good" approximate solutions for nonlinearly separable problems. The three AHK learning rules preserve the simple incremental nature found in the LMS and perceptron learning rules. The AHK rules also possess additional processing capabilities, such as the ability to automatically identify critical cluster boundaries and place a linear decision surface in such a way that it leads to enhanced classification robustness.

Consider a two-class {*c*1,
*c*2} classification
problem with *m* labeled feature vectors (training vectors)
{**x***i*, *d**i*},
*i* = 1, 2, ..., *m*. Assume that **x***i*
belongs to R*n*+1
(with the last component of **x***i*
being a constant bias of value 1) and that *d**i* = +1 (-1)
if **x***i* *c*1 (*c*2).
Then, a single perceptron can be trained to correctly classify
the above training pairs if an (*n *+ 1)-dimensional weight
vector **w** is computed which satisfies the following set
of *m* inequalities (the sgn function is assumed to be the
perceptron's activation function):

(3.1.55)

Next, if we define a set of *m* new vectors,
**z***i*, according
to

(3.1.56)

and we let

**Z** = [ **z**1
**z**2 ... **z***m*]
(3.1.57)

then Equation (3.1.55) may be rewritten as the single
matrix equation

** **(3.1.58)

Now, defining an *m*-dimensional positive-valued
margin vector **b **(**b** > **0**) and using
it in Equation (3.1.58), we arrive at the following equivalent
form of Equation (3.1.55):

(3.1.59)

Thus, the training of the perceptron is now equivalent
to solving Equation (3.1.59) for **w**, subject to the constraint
**b** > **0**. Ho and Kashyap (1965) proposed
an iterative algorithm for solving Equation (3.1.59). In the
Ho-Kashyap algorithm, the components of the margin vector are
first initialized to small positive values, and the pseudo-inverse
is used to generate a solution for **w** (based on the initial
guess of **b**) which minimizes the SSE criterion function,
*J*(**w**,** b**) = || **Z**T**w**
- **b**
||2:

** w** = **Z**†**b**
(3.1.60)

where **Z**†
= (**Z Z**T)-1
**Z**, for *m* > *n *+ 1. Next, a new
estimate for the margin vector is computed by performing the constrained
(**b** > **0**) gradient descent

(3.1.61)

where denotes the absolute
value of the components of the argument vector and **b***k*
is the "current" margin vector. A new estimate of **w**
can now be computed using Equation (3.1.60) and employing the
updated margin vector from Equation (3.1.61). This process continues
until all the components of **e**
are zero (or are sufficiently small and positive), which is an
indication of linear separability of the training set, or until
**e**
< **0**, which is an indication of nonlinear separability
of the training set (no solution is found). It can be shown (Ho
and Kashyap, 1965; Slansky and Wassel, 1981) that the Ho-Kashyap
procedure converges in a finite number of steps if the training
set is linearly separable. For simulations comparing the above
training algorithm to the LMS and perceptron training procedures,
the reader is referred to Hassoun and Clark (1988), Hassoun and
Youssef (1989), and Hassoun (1989a). We will refer to the above
algorithm as the direct Ho-Kashyap (DHK) algorithm.

The direct synthesis of the **w** estimate in
Equation (3.1.60) involves a one-time computation of the pseudo-inverse
of **Z**. However, such computation can be computationally
expensive and requires special treatment when **Z Z**T
is ill-conditioned (i.e., |**ZZ**T|
close to zero). An alternative algorithm that is based on gradient
descent principles and which does not require the direct computation
of **Z**† can
be derived. This derivation is presented next.

Starting with the criterion function ,
gradient descent may be performed (Slansky and Wassel, 1981) with
respect to **b** and **w** so that *J* is minimized
subject to the constraint **b** > **0**. The gradient
of *J* with respect to **w** and **b** is given by

(3.1.62a)

(3.1.62b)

where the superscripts *k* and *k* + 1
represent current and updated values, respectively. One analytic
method for imposing the constraint **b** > **0** is to
replace the gradient in Equation (3.1.62a) by -0.5(**e**
+ |**e**|)
with **e**
as defined in Equation (3.1.61). This leads to the following
gradient descent formulation of the Ho-Kashyap procedure:

(3.1.63a)

and

(3.1.63b)

where 1 and
2 are strictly positive
constant learning rates. Because of the requirement that all
training vectors **z***k*
(or **x***k*)
be present and included in **Z**, we will refer to the above
procedure as the batch mode adaptive Ho-Kashyap (AHK) procedure.
It can be easily shown that if r1
= 0 and **b**1 = **1**,
Equation (3.1.63) reduces to the -LMS learning rule. Furthermore,
convergence can be guaranteed (Duda and Hart, 1973) if 0 < r1 < 2
and 0 < r2 < ,
where lmax
is the largest eigenvalue of the positive definite matrix **Z Z**T.

A completely adaptive Ho-Kashyap procedure for solving
Equation (3.1.59) is arrived at by starting from the instantaneous
criterion function , which leads to the
following incremental update rules:

(3.1.64a)

and

(3.1.64b)

Here *b**i*
represents a scalar margin associated with the **x***i*
input. In all of the above Ho-Kashyap learning procedures, the
margin values are initialized to small positive values, and the
perceptron weights are initialized to zero (or small random) values.
If full margin error correction is assumed in Equation (3.1.64a),
i.e., r1
= 1, the incremental learning procedure in Equation (3.1.64) reduces
to the heuristically derived procedure reported in Hassoun and
Clark (1988). An alternative way of writing Equation (3.1.64)
is

and (3.1.65a)

and
(3.1.65b)

where D*b*
and D**w**
signifies the difference between the updated and current values
of *b *and **w**, respectively. We will refer to this
procedure as the AHK I learning rule. For comparison purposes,
it may be noted that the -LMS rule in Equation (3.1.35) can be
written as , with *b**i*
held fixed at +1.

The implied constraint *b**i*
> 0 in Equation (3.1.64) and Equation (3.1.65) was realized
by starting with a positive initial margin and restricting the
change in D*b*
to positive real values. An alternative, more flexible way to
realize this constraint is to allow both positive and negative
changes in D*b*,
except for the cases where a decrease in *b**i*
results in a negative margin. This modification results in the
following alternative AHK II learning rule:

and (3.1.66a)

and
(3.1.66b)

In the general case of an adaptive margin, as in
Equation (3.1.66), Hassoun and Song (1992) showed that a sufficient
condition for the convergence of the AHK rules, is given by

(3.1.67a)

(3.1.67b)

Another variation results in the AHK III rule which
is appropriate for both linearly separable and nonlinearly separable
problems. Here, D**w**
is set to **0** in Equation (3.1.66b). The advantages of the
AHK III rule are that (1) it is capable of adaptively identifying
difficult-to-separate class boundaries and (2) it uses such information
to discard nonseparable training vectors and speed up convergence
(Hassoun and Song, 1992). The reader is invited to apply the
AHK III as in Problem 3.1.7 for gaining insight into the dynamics
and separation behavior of this learning rule.

**Example 3.1.2:** In this
example, the perceptron, LMS, and AHK learning rules are compared
in terms of the quality of the solution they generate. Consider
the simple two-class linearly separable problem shown earlier
in Figure 3.1.4. The -LMS rule of Equation (3.1.35) is used to
obtain the solution shown as a dashed line in Figure 3.1.9. Here,
the initial weight vector was set to **0** and a learning rate
= 0.005 is used. This solution is not one of the linearly
separable solutions for this problem. Four examples of linearly
separable solutions are shown as solid lines in the figure. These
solutions are generated using the perceptron learning rule of
Equation (3.1.2), with varying order of input vector presentations
and with a learning rate of = 0.1. Here, it should
be noted that the most robust solution, in the sense of tolerance
to noisy input, is given by , which is
shown as a dotted line in Figure 3.1.9. This robust solution
was in fact automatically generated by the AHK I learning rule
of Equation (3.1.65).

Figure 3.1.9. LMS generated decision boundary (dashed line) for a two-class linearly separable problem. For comparison, four solutions generated using the perceptron learning rule are shown (solid lines). The dotted line is the solution generated by the AHK I rule.

**3.1.5 Other Criterion Functions**

The SSE criterion function in Equation (3.1.32)
is not the only possible choice. We have already seen other alternative
functions such as the ones in Equations (3.1.20), (3.1.24), and
(3.1.25). In general, any differentiable function that is minimized
upon setting *y**i*
= *d**i*, for
*i* = 1, 2, ..., *m*, could be used. One possible generalization
of SSE is the Minkowski-*r* criterion function (Hanson and
Burr, 1988) given by

(3.1.68)

or its instantaneous version

(3.1.69)

Figure 3.1.10 shows a plot of |*d* - *y*|*r*
for *r* = 1, 2, and 20. The general form of the gradient
of this criterion function is given by

(3.1.70)

Note that for *r* = 2 this reduces to the gradient
of the SSE criterion function given by Equation (3.1.52). If
*r* = 1, then *J*(**w**) = |*d* - *y*|
with the gradient (note that the gradient of *J*(**w**)
does not exist at the solution points *d* = *y*)

(3.1.71)

In this case, the criterion function in Equation
(3.1.68) is known as the Manhattan norm. For ,
a supremum error measure is approached.

A small *r* gives less weight for large deviations
and tends to reduce the influence of the outermost points in the
input space during learning. It can be shown, for a linear unit
with normally distributed inputs, that *r* = 2 is an appropriate
choice in the sense of both minimum SSE and minimum probability
of prediction error (maximum likelihood). The proof is as follows.

Consider the training pairs {**x***i*, *d**i*},
*i* = 1, 2, ..., *m*. Assume
that the vectors **x***i*
are drawn randomly and independently from a normal distribution.
Then a linear unit with a fixed but unknown weight vector **w**
outputs the estimate when presented with
input **x***i*.
Since a weighted sum of independent normally distributed random
variables is itself normally distributed [e.g., see Mosteller
et al. (1970)], then is normally distributed.
Thus the prediction error is normally
distributed with mean zero and some variance, .
This allows us to express the conditional probability density
for observing target *d**i*,
given **w**, upon the presentation of **x***i*
as

(3.1.72)

This function is also known as the likelihood of
**w** with respect to observation *d**i*.
The maximum likelihood estimate of **w** is that value of
**w** which maximizes the probability of occurrence of observation
*d**i* for input
**x***i*. The
likelihood of **w** with respect to the whole training set
is the joint distribution

(3.1.73)

Maximizing the above likelihood is equivalent to
maximizing the log-likelihood function:

(3.1.74)

Since the term is a constant,
maximizing the log-likelihood function in Equation (3.1.74) is
equivalent to minimizing the SSE criterion

(3.1.75)

Therefore, with the assumption of a linear unit (ADALINE)
with normally distributed inputs, the SSE criterion is optimal
in the sense of minimizing prediction error. However, if the
input distribution is non-Gaussian, then the SSE criterion will
not possess maximum likelihood properties. See Mosteller and
Tukey (1980) for a more thorough discussion on the maximum likelihood
estimation technique.

If the distribution of the training patterns has
a heavy tail such as the Laplace-type distribution, *r* =
1 will be a better criterion function choice. This criterion
function is known as robust regression since it is more robust
to an outlier training sample than *r* = 2. Finally, 1 <
*r* < 2 is appropriate to use for pseudo-Gaussian distributions
where the distribution tails are more pronounced than in the Gaussian.

Another criterion function that can be used (Baum
and Wilczek, 1988; Hopfield, 1987; Solla et al., 1988) is the
instantaneous relative entropy error measure (Kullback, 1959)
defined by

(3.1.76)

where *d* belongs to the open interval (-1, +1).
As before, *J*(**w**) 0, and if *y* = *d*
then *J*(**w**) = 0. If *y* = *f*(*net*) = tanh(b*
net*), the gradient of Equation (3.1.76)
is

(3.1.77)

The factor *f* '(*net*) in Equations (3.1.53)
and (3.1.70) is missing from Equation (3.1.77). This eliminates
the flat-spot encountered in the delta rule and makes the training
here more like m-LMS
(note, however, that *y* here is given by *y* = ).
This entropy criterion is "well formed" in the sense
that gradient descent over such a function will result in a linearly
separable solution, if one exists (Wittner and Denker, 1988; Hertz
et al., 1991). On the other hand, gradient descent on the SSE
criterion function does not share this property, since it may
fail to find a linearly separable solution as demonstrated in
Example 3.1.2.

In order for gradient descent search to find a solution
**w*** in the desired
linearly separable region, we need to use a well-formed criterion
function. Consider the following general criterion function

(3.1.78)

where

Let . Now, we say
is "well-formed" if *g*(*s*) is differentiable
and satisfies the following conditions (Wittner and Denker, 1988):

1. For all *s*, ; i.e.,
*g* does not push in the wrong direction.

2. There exists e
> 0, such that for all *s* 0;
i.e., *g* keeps pushing if there is a misclassification.

3. *g*(*s*) is bounded below.

For a single unit with weight vector **w**, it
can be shown (Wittner and Denker, 1988) that if the criterion
function is well-formed, then gradient descent is guaranteed to
enter the region of linearly separable solutions **w***,
provided that such a region exists.

**Example 3.1.3:** The perceptron
criterion function in Equation (3.1.20) is a well-formed criterion
function since it satisfies the above conditions:

1. , thus *g*(*s*)
= -*s*
and = 1 > 0 for all *s.*

2. = 1 e
> 0 for all *s 0.*

3. *g*(*s*) is bounded below, since .

**3.1.6 Extension of Gradient-Descent-Based Learning
to Stochastic Units**

Formally, *y* is given by

(3.1.79)

One possible probability function is .
Hence, . Also, note that the expected
value of *y*, <*y*>, is given by

(3.1.80)

Stochastic units are the basis for reinforcement
learning networks, as is shown in the next section. Also, these
units allow for a natural mapping of optimal stochastic learning
and retrieval methods onto neural networks, as discussed in Section
8.3.

Let us now define a SSE criterion function in terms
of the mean output of the stochastic unit:

(3.1.81)

Employing gradient descent, we arrive at the update
rule

(3.1.82)

In the incremental update mode, we have the following
update rule:

This learning rule is identical in form to the delta
learning rule given in Equation (3.1.53), which used a deterministic
unit with an activation function *f *(*net*) = tanh(*net*).
Therefore, in an average sense, the stochastic unit learning
rule in Equation (3.1.83) leads to a weight vector which is equal
to that obtained using the delta rule for a deterministic unit
with a hyperbolic tangent activation function.