In the previous section, we considered simple Hebbian-based
networks of linear units which employed some degree of competition
through lateral inhibition in order for each unit to capture a
principal component of the training set. In this section, we
extend this notion of competition among units and specialization
of units to tackle a different class of problems involving clustering
of unlabeled data or vector quantization. Here, a network of
binary-valued outputs, with only one "on" at a time,
could tell which of several categories an input belongs to. These
categories are to be discovered by the network on the basis of
correlations in the input data. The network would then classify
each cluster of "similar" input data as a single output
class.

**3.4.1 Simple Competitive Learning**

and **x***k*
is the current input. Thus, unit *i* is the winning unit
if

In order to assure winner-take-all operation, a
proper choice of lateral weights and unit activation functions
must be made (e.g., see Grossberg, 1976, and Lippmann, 1987).
One possible choice for the lateral weights is

(3.4.4)

where 0 < < and *m*
is the number of units in the network. An appropriate activation
function for this type of network is shown in Figure 3.4.2 where
*T* is chosen such that the outputs *y**i*
do not saturate at 1 before convergence of the winner-take-all
competition; after convergence, only the winning unit will saturate
at 1 with all other units having zero outputs. Note, however,
that if one is training the net as part of a computer simulation,
there is no need for the winner-take-all net to be implemented
explicitly; it is more efficient from a computation point of view
to perform the winner selection by direct search for the maximum
*net**i*. Thus
far, we have only described the competition mechanism of the competitive
learning technique. Next, we give a learning equation for weight
updating.

For a given input **x***k*
drawn from a random distribution *p*(**x**), the weights
of the winning unit are updated (the weights of all other units
are left unchanged) according to (Grossberg, 1969; von der Malsburg,
1973; Rumelhart and Zipser, 1985):

(3.4.5)

If the magnitudes of the input vectors contain no
useful information, a more appropriate rule to use is

(3.4.6)

The above rules tend to tilt the weight vector of
the current winning unit in the direction of the current input.
The cumulative effect of the repetitive application of the above
rules can be described as follows. Let us view the input and
weight vectors as points scattered on the surface of a hypersphere
(or a circle as in Figure 3.4.3). The effect of the application
of the competitive learning rule is to sensitize certain units
towards neighboring clusters of input data. Ultimately, some
units (frequent winner units) will evolve so that their weight
vector points towards the "center of mass" of the nearest
significant dense cluster of data points. This effect is illustrated
pictorially for a simple example in Figure 3.4.3.

(a) (b)

Figure 3.4.3. Simple competitive learning. (a) Initial
weight vectors; (b) weight vector configuration after performing
simple competitive learning.

At the onset of learning, we do not know the number
of clusters in the data set. So we normally overestimate this
number by including excess units in the network. This means that
after convergence, some units will be redundant in the sense that
they do not evolve significantly and thus do not capture any data
clusters. This can be seen, for example, in Figure 3.4.3 where
the weight vector **w**1
does not significantly evolve throughout the computation. These
are typically the units that are initialized to points in the
weight space that have relatively low overlap with the training
data points; such units may still be desirable since they may
capture new clusters if the underlying distribution *p*(**x**)
changes in time, or, if desired, their probability of occurrence
may be reduced by using some of the following heuristics: (1)
Initialize the weight vectors to randomly selected sample input
vectors; (2) use "leaky learning" (Rumelhart and Zipser,
1985). Here, all loser units are also updated in response to
an input vector, but with a learning rate much smaller than the
one used for the winner unit; (3) update the neighboring
units of the winner unit; and/or (4) smear the input vectors with
added noise using a long-tailed distribution (Szu, 1986).

After learning has converged, it is necessary to
"calibrate" the above clustering network in order to
determine the number of units representing the various learned
clusters, and cluster membership. Here, the weights of the net
are held fixed and the training set is used to interrogate the
network for sensitized units. The following example demonstrates
the above ideas for a simple data clustering problem.

**Example 3.4.1:** Consider
the set of unlabeled two-dimensional data points plotted in Figure
3.4.4. Here, we demonstrate the typical solution (cluster formation)
generated by a four-unit competitive net, which employs Equations
(3.4.2) and (3.4.5). No normalization of either the weight vectors
or the input vectors is used. During training, the input samples
(data points) are selected uniformly at random from the data set.
The four weight vectors are initialized to random values close
to the center of the plot in Figure 3.4.4. Training with = 0.01
is performed for 1000 iterations. The resulting weight vector
trajectories are plotted in Figure 3.4.5(a) with the initial weights
marked by "*". Note how one of the units never evolved
beyond its initial weight vector. This is because this unit never
became a winner. Each of the remaining three units evolved its
weight vector to a distinct region in the input space. Since
is small, the weight vectors are shown to enter and stay inside
"tiny" neighborhoods; the weight vectors fluctuate inside
their respective terminal neighborhoods. These fluctuations are
amplified if a relatively larger ( = 0.05) value is
used, as demonstrated in Figure 3.4.5(b). Finally, the trained
network is calibrated by assigning a unique cluster label for
any unit which becomes a winner for one or more training samples.
Here, Equation (3.4.2) is used to determine the winning unit.
During calibration, only three units are found to ever become
winners. Figure 3.4.6 depicts the result of the calibration process.
Here, the symbols "+", "o", and ""
are used to tag the three winner units, respectively. The figure
shows a "+", "o", or "" printed
at the exact position of the training sample **x**, if **x**
causes the unit with label "+", "o", or ""
to be a winner, respectively. We should note that the exact same
clusters in Figure 3.4.6 are generated by the net with = 0.05.
The cluster labeled "" looks interesting because of
its obvious bimodal structure. Intuitively speaking, one would
have expected this cluster to be divided by the competitive net
into two distinct clusters. In fact, if one carefully looks at
Figure 3.4.5(b), this is what the net attempted to do, but failed.
However, very few simulations out of a large number of simulations
which were attempted, but are not shown here, did in fact result
in this "intuitive" solution. In Chapter 4, the competitive
learning rule is Equation (3.4.5) is shown to correspond to stochastic
gradient descent on the criterion , where
is the weight vector of the winner unit.
Therefore, one may explain the three-cluster solution in this
example as representing a suboptimal solution which corresponds
to a local minimum of *J*(**w**).

Figure 3.4.4. Unlabeled two-dimensional data used
in training the simple competitive net of Example 3.4.1.

Figure 3.4.5. Weight vector evolution trajectories
for a four-unit competitive net employing Equations (3.4.2) and
(3.4.5). These trajectories are shown superimposed on the plane
containing the training data. A "*" is used to indicate
the initial setting of the weights for each of the four units.
(a) Learning rate equals 0.01, and (b) learning rate equals 0.05.

Figure 3.4.6. A three-cluster solution for the data
shown in Figure 3.4.4. This solution represents a typical cluster
formation generated by the simple four-unit competitive net of
Example 3.4.1.

Vector quantization is a technique whereby the input
space is divided into a number of distinct regions, and for each
region a "template" (reconstruction vector) is defined
(Linde et al., 1980; Gray, 1984). When presented with a new input
vector **x**, a vector quantizer first determines the region
in which the vector lies. Then, the quantizer outputs an encoded
version of the reconstruction vector **w***i*
representing that particular region containing **x**. The
set of all possible reconstruction vectors **w***i*
is usually called the "codebook" of the quantizer.
When the Euclidean distance similarity measure is used to decide
on the region to which the input **x** belongs, the quantizer
is called Voronoi quantizer. The Voronoi quantizer partitions
its input space into Voronoi cells (Gray, 1984), each cell is
represented by one of the reconstruction vectors, **w***i*.
The *i*th Voronoi cell contains those points of the input
space that are closest (in a Euclidean sense) to the vector **w***i*
than to any other vector **w***j*,
*j* *i*. Figure 3.4.7 shows an example of
the input space partitions of a Voronoi quantizer with four reconstruction
vectors.

Figure 3.4.7. Input space partitions realized by
a Voronoi quantizer with four reconstruction vectors. The four
reconstruction vectors are shown as filled circles in the figure.

The competitive learning rule in Equation (3.4.5)
with a winning unit determination based on the Euclidean distance
as in Equation (3.4.3) may now be used in order to allocate a
set of *m* reconstruction vectors ,
*i* = 1, 2, ..., *m*, to the
input space of *n*-dimensional vectors **x**. Let **x**
be distributed according to the probability density function *p*(**x**).
Initially, we set the starting values of the vectors **w***i*
to the first *m* randomly generated samples of **x**.
Additional samples **x** are then used for training. Here,
the learning rate in Equation (3.4.5) is selected as a monotonically
decreasing function of the number of iterations *k*. Based
on empirical results, Kohonen (1989) conjectured that, in an average
sense, the asymptotic local point density of the **w***i*'s
(i.e., the number of **w***i*
falling in a small volume of R*n*
centered at **x**) obtained by the above competitive learning
process takes the form of a continuous, monotonically increasing
function of *p*(**x**). Thus, this competitive learning
algorithm may be viewed as an "approximate" method for
computing the reconstruction vectors **w***i*
in an unsupervised manner.

Kohonen (1989) designed supervised versions of vector
quantization (called learning vector quantization, LVQ) for adaptive
pattern classification. Here, class information is used to fine
tune the reconstruction vectors in a Voronoi quantizer, so as
to improve the quality of the classifier decision regions. In
pattern classification problems, it is the decision surface between
pattern classes and not the inside of the class distribution,
which should be described most accurately. The above quantizer
process can be easily adapted in order to optimize the placement
of the decision surface between different classes. Here, one
would start with a trained Voronoi quantizer and calibrate it
using a set of labeled input samples (vectors). Each calibration
sample is assigned to that **w***i*
which is closest. Each **w***i*
is then labeled according to the majority of classes represented
among those samples which have been assigned to **w***i*.
Here, the distribution of the calibration samples to the various
classes, as well as the relative numbers of the **w***i*
assigned to these classes must comply with the apriori probabilities
of the classes, if such probabilities are known. Next, the tuning
of the decision surfaces is accomplished by rewarding correct
classifications and punishing incorrect ones. Let the training
vector **x***k*
belong to the class *c**j*.
Assume that the closest reconstruction vector **w***i*
to **x***k* carries
the label of class *c**l*.
Then, only vector **w***i*
is updated according to the following supervised rule (LVQ rule):

(3.4.7)

where *k*
is assumed to be a monotonically decreasing function of the number
of iterations *k*. After convergence, the input space R*n*
is again partitioned by a Voronoi tessellation corresponding to
the tuned **w***i*
vectors. The primary effect of the reward/punish rule in Equation
(3.4.7) is to minimize the number of misclassifications. At the
same time, however, the vectors **w***i*
are pulled away from the zones of class overlap where misclassifications
persist.

The convergence speed of LVQ can be improved if
each vector **w***i*
has its own adaptive learning rate given
by (Kohonen, 1990):

(3.4.8)

This recursive rule causes *i*
to decrease if **w***i*
classifies **x***k*
correctly. Otherwise, *i*
increases. Equations (3.4.7) and (3.4.8) define what is known
as an "optimized learning rate" LVQ (OLVQ). Another
improved algorithm named LVQ2 has also been suggested by Kohonen
et al. (1988) which approaches the performance predicted by Bayes
decision theory (Duda and Hart, 1973).

Some theoretical aspects of competitive learning
are considered in the next chapter. More general competitive
networks with stable categorization behavior have been proposed
by Carpenter and Grossberg (1987 a, b). One of these
networks, called ART1, is described in Chapter 6.