In this section we attempt to characterize simple competitive learning. Two approaches are described: one deterministic and the other statistical.

Consider a single layer of linear units, where each unit uses the simple continuous-time competitive rule [based on Equations (3.4.3) and (3.4.5)] (4.6.1) (4.6.2)

where is the weight vector of the winner unit upon the presentation of the input vector xk. Here, all vectors xk are assumed to be equally probable. In general, a probability of occurrence of xk, P(xk), should be inserted inside the summation in Equation (4.6.2). An alternative way of expressing J in Equation (4.6.2) is through the use of a "cluster membership matrix" M defined for each unit i = 1, 2, ..., n by (4.6.3)

Here, is a dynamically evolving function of k and i which specifies whether or not unit i is the winning unit upon the presentation of input xk. The cluster membership matrix allows us to express the criterion function J as (4.6.4)

Now, performing gradient descent on J in Equation (4.6.4) yields (4.6.5)

which is the batch mode version of the learning rule in Equation (4.6.1). It was noted by Hertz et al. (1991) that the above batch mode competitive learning rule corresponds to the k-means clustering algorithm when a finite training set is used. The local rule of Equation (4.6.1) may have an advantage over the batch mode rule in Equation (4.6.5) since stochastic noise due to the random presentation order of the input patterns may kick the solution out of "poor" minima towards minima which are more optimal. However, only in the case of "sufficiently sparse" input data points can one prove stability and convergence theorems for the stochastic (incremental) competitive learning rule (Grossberg, 1976). The data points are sparse enough if there exists a set of clusters so that the minimum overlap within a cluster exceeds the maximum overlap between that cluster and any other cluster. In practice, a damped learning rate k is used (e.g., , where and 0 is a positive constant) in order to stop weight evolution at one of the local solutions. Here, the relatively large initial learning rate allows for wide exploration during the initial phase of learning.

Criterion functions, other than the one in Equation (4.6.4), may be employed which incorporate some interesting heuristics into the competitive rule for enhancing convergence speed or for altering the underlying "similarity measure" implemented by the learning rule. For an example, we may replace by in Equation (4.6.4). This causes the winning weight vector to be repelled by input vectors in other clusters, while being attracted by its own cluster, which enhances convergence. Another example is to employ a different similarity measure (norm) in J such as the Minkowski-r norm of Equation (3.1.68), which has the ability to reduce the effects of outlier data points by proper choice of the exponent r. Other criterion functions may also be employed; the reader is referred to Bachmann et al. (1987) for yet another suitable criterion function. (4.6.6)

where, again, the setting is a single layer network of linear units. Here, and typically, . Also, the weight normalization is assumed for all units. It can be easily verified that Equation (4.6.6) preserves this weight normalization at any iteration (this was explored in Problem 3.4.1).

Let P(xk) be the probability that input xk is presented on any trial. Then, the average learning equation may be expressed as (4.6.7)

where is the conditional probability that unit i wins when input xk is presented. Now, using Equation (4.6.6) in Equation (4.6.7) we get (4.6.8)

which implies that at equilibrium (4.6.9)

Therefore, the jth component of vector wi is given as (4.6.10)

We now make the following observations. First, note that the denominator of Equation (4.6.10) is the probability that unit i wins averaged over all stimulus patterns. Note further that is the probability that (active) and unit i is a winner. Thus, assuming that all patterns have the same number of active bits (i.e., for all k), we may employ Bayes' rule and write Equation (4.6.10) as (4.6.11)

which states that at equilibrium, wij is expected to be proportional to the conditional probability that the jth bit of input xk is active given unit i is a winner.

Next, upon the presentation of a new pattern [assuming the equilibrium weight values given by Equation (4.6.9)], unit i will have a weighted sum (activity) of (4.6.12)

or (4.6.13)

where represents the overlap between stimulus and the kth training pattern xk. Thus, at equilibrium, a unit responds most strongly to patterns that overlap other patterns to which the unit responds and most weakly to patterns that are far from patterns to which it responds. Note that we may express the conditional probability according to the winner-take-all mechanism (4.6.14)

Because of the dependency of on wi, there are many solutions which satisfy the equilibrium relation given in Equation (4.6.9).

Equation (4.6.6) leads the search to one of many stable equilibrium states satisfying Equations (4.6.9) and (4.6.14). In such a state, the ith unit activations become stable (fluctuate minimally), and therefore, becomes stable. A sequence of stimuli might, however, be presented in such a way as to introduce relatively large fluctuations in the wi's. In this case, the system might move to a new equilibrium state which is, generally, more stable in the sense that becomes unlikely to change values for a very long period of time. Rumelhart and Zipser (1985) gave a measure of the stability of an equilibrium state as the average amount by which the output of the winning units is greater than the response of all of the other units averaged over all patterns and all clusters. This stability measure is given by (4.6.15)

where the averaging is taken over all xk, and i* is the index of winning units. Note that Equation (4.6.15) can be written as (4.6.16)

The larger the value of J, the more stable the system is expected to be. Maximizing J can also be viewed as maximizing the overlap among patterns within a group (cluster) while minimizing the overlap among patterns between groups; this is exactly what is required for the clustering of unlabeled data. In geometric terms, J is maximized when the weight vectors point toward maximally compact stimulus (input) regions that are as distant as possible from other such regions.

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.7] [4.8] [4.9] [4.10] Back to the Table of Contents Back to Main Menu