**4.6 Theory of Simple Competitive Learning**

Also, consider the criterion function (Ritter and
Schulten, 1988a)

(4.6.2)

where is the weight vector
of the winner unit upon the presentation of the input vector **x***k*.
Here, all vectors **x***k*
are assumed to be equally probable. In general, a probability
of occurrence of **x***k*,
*P*(**x***k*),
should be inserted inside the summation in Equation (4.6.2).
An alternative way of expressing *J* in Equation (4.6.2)
is through the use of a "cluster membership matrix"
**M** defined for each unit *i* = 1, 2, ..., *n*
by

(4.6.3)

Here, is a dynamically evolving
function of *k* and *i* which specifies whether or not
unit *i* is the winning unit upon the presentation of input
**x***k*. The
cluster membership matrix allows us to express the criterion function
*J* as

(4.6.4)

Now, performing gradient descent on *J* in Equation
(4.6.4) yields

(4.6.5)

which is the batch mode version of the learning rule
in Equation (4.6.1). It was noted by Hertz et al. (1991) that
the above batch mode competitive learning rule corresponds to
the *k*-means clustering algorithm when a finite training
set is used. The local rule of Equation (4.6.1) may have an advantage
over the batch mode rule in Equation (4.6.5) since stochastic
noise due to the random presentation order of the input patterns
may kick the solution out of "poor" minima towards minima
which are more optimal. However, only in the case of "sufficiently
sparse" input data points can one prove stability and convergence
theorems for the stochastic (incremental) competitive learning
rule (Grossberg, 1976). The data points are sparse enough if
there exists a set of clusters so that the minimum overlap
within a cluster exceeds the maximum overlap between that cluster
and any other cluster. In practice, a damped learning rate *k*
is used (e.g., , where
and 0 is a positive constant)
in order to stop weight evolution at one of the local solutions.
Here, the relatively large initial learning rate allows for wide
exploration during the initial phase of learning.

Criterion functions, other than the one in Equation
(4.6.4), may be employed which incorporate some interesting heuristics
into the competitive rule for enhancing convergence speed or for
altering the underlying "similarity measure" implemented
by the learning rule. For an example, we may replace
by in Equation (4.6.4). This causes
the winning weight vector to be repelled by input vectors in other
clusters, while being attracted by its own cluster, which enhances
convergence. Another example is to employ a different similarity
measure (norm) in *J* such as the Minkowski-*r* norm
of Equation (3.1.68), which has the ability to reduce the effects
of outlier data points by proper choice of the exponent *r*.
Other criterion functions may also be employed; the reader is
referred to Bachmann et al. (1987) for yet another suitable criterion
function.

The following is an analysis of simple competitive
learning based on the stochastic approximation technique introduced
in Section 4.3. Consider the following normalized discrete-time
competitive rule (von der Malsberg, 1973 ; Rumelhart and Zipser,
1985):

(4.6.6)

where, again, the setting is a single layer network
of linear units. Here, and typically,
. Also, the weight normalization
is assumed for all units. It can be easily verified that Equation
(4.6.6) preserves this weight normalization at any iteration (this
was explored in Problem 3.4.1).

Let *P*(**x***k*)
be the probability that input **x***k*
is presented on any trial. Then, the average learning equation
may be expressed as

(4.6.7)

where is the conditional
probability that unit *i* wins when input **x***k*
is presented. Now, using Equation (4.6.6) in Equation (4.6.7)
we get

(4.6.8)

which implies that at equilibrium

(4.6.9)

Therefore, the *j*th component of vector **w***i
*is given as

(4.6.10)

We now make the following observations. First, note
that the denominator of Equation (4.6.10) is the probability that
unit *i* wins averaged over all stimulus patterns. Note
further that is the probability that
(active) and unit *i* is a winner.
Thus, assuming that all patterns have the same number of active
bits (i.e., for all *k*), we may
employ Bayes' rule and write Equation
(4.6.10) as

(4.6.11)

which states that at equilibrium, *w**ij*
is expected to be proportional to the conditional probability
that the *j*th bit of input **x***k*
is active given unit *i* is a winner.

Next, upon the presentation of a new pattern
[assuming the equilibrium weight values given by Equation (4.6.9)],
unit *i* will have a weighted sum (activity) of

(4.6.12)

or

(4.6.13)

where represents the overlap
between stimulus and the *k*th training
pattern **x***k*.
Thus, at equilibrium, a unit responds most strongly to patterns
that overlap other patterns to which the unit responds and most
weakly to patterns that are far from patterns to which it responds.
Note that we may express the conditional probability
according to the winner-take-all mechanism

(4.6.14)

Because of the dependency of
on **w***i*, there
are many solutions which satisfy the equilibrium relation given
in Equation (4.6.9).

Equation (4.6.6) leads the search to one of many
stable equilibrium states satisfying Equations (4.6.9) and (4.6.14).
In such a state, the *i*th unit activations
become stable (fluctuate minimally), and therefore,
becomes stable. A sequence of stimuli might, however, be presented
in such a way as to introduce relatively large fluctuations in
the **w***i*'s.
In this case, the system might move to a new equilibrium state
which is, generally, more stable in the sense that
becomes unlikely to change values for a very long period of time.
Rumelhart and Zipser (1985) gave a measure of the stability of
an equilibrium state as the average amount by which the output
of the winning units is greater than the response of all of the
other units averaged over all patterns and all clusters. This
stability measure is given by

(4.6.15)

where the averaging is taken over all **x***k*,
and *i** is the index
of winning units. Note that Equation (4.6.15) can be written
as

(4.6.16)

The larger the value of *J*, the more stable
the system is expected to be. Maximizing *J* can also be
viewed as maximizing the overlap among patterns within a group
(cluster) while minimizing the overlap among patterns between
groups; this is exactly what is required for the clustering of
unlabeled data. In geometric terms, *J* is maximized when
the weight vectors point toward maximally compact stimulus (input)
regions that are as distant as possible from other such regions.

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.7] [4.8] [4.9] [4.10]