3.4 Competitive Learning

In the previous section, we considered simple Hebbian-based networks of linear units which employed some degree of competition through lateral inhibition in order for each unit to capture a principal component of the training set. In this section, we extend this notion of competition among units and specialization of units to tackle a different class of problems involving clustering of unlabeled data or vector quantization. Here, a network of binary-valued outputs, with only one "on" at a time, could tell which of several categories an input belongs to. These categories are to be discovered by the network on the basis of correlations in the input data. The network would then classify each cluster of "similar" input data as a single output class.

Because we are now dealing with competition, it only makes sense to consider a group of interacting units. We assume the simplest architecture where we have a single layer of units, each receiving the same input x Rn and producing an output yi. We also assume that only one unit is active at a given time. This active unit is called the "winner" and is determined as the unit with the largest weighted-sum , where

(3.4.1)

and xk is the current input. Thus, unit i is the winning unit if

(3.4.2)

which may be written as

(3.4.3)

if for all i = 1, 2, ..., m. Thus, the winner is the node with the weight vector closest (in a Euclidean distance sense) to the input vector. It is interesting to note that lateral inhibition may be employed here in order to implement the "winner-take-all" operation in Equation (3.4.2) or (3.4.3). This is similar to what we have described in the previous section with a slight variation: Each unit inhibits all other units and self-excites itself, as shown in Figure 3.4.1.

Figure 3.4.1. Single layer competitive network.

In order to assure winner-take-all operation, a proper choice of lateral weights and unit activation functions must be made (e.g., see Grossberg, 1976, and Lippmann, 1987). One possible choice for the lateral weights is

(3.4.4)

where 0 < < and m is the number of units in the network. An appropriate activation function for this type of network is shown in Figure 3.4.2 where T is chosen such that the outputs yi do not saturate at 1 before convergence of the winner-take-all competition; after convergence, only the winning unit will saturate at 1 with all other units having zero outputs. Note, however, that if one is training the net as part of a computer simulation, there is no need for the winner-take-all net to be implemented explicitly; it is more efficient from a computation point of view to perform the winner selection by direct search for the maximum neti. Thus far, we have only described the competition mechanism of the competitive learning technique. Next, we give a learning equation for weight updating.

Figure 3.4.2. Activation function for units in the competitive network of Figure 3.4.1.

For a given input xk drawn from a random distribution p(x), the weights of the winning unit are updated (the weights of all other units are left unchanged) according to (Grossberg, 1969; von der Malsburg, 1973; Rumelhart and Zipser, 1985):

(3.4.5)

If the magnitudes of the input vectors contain no useful information, a more appropriate rule to use is

(3.4.6)

The above rules tend to tilt the weight vector of the current winning unit in the direction of the current input. The cumulative effect of the repetitive application of the above rules can be described as follows. Let us view the input and weight vectors as points scattered on the surface of a hypersphere (or a circle as in Figure 3.4.3). The effect of the application of the competitive learning rule is to sensitize certain units towards neighboring clusters of input data. Ultimately, some units (frequent winner units) will evolve so that their weight vector points towards the "center of mass" of the nearest significant dense cluster of data points. This effect is illustrated pictorially for a simple example in Figure 3.4.3.

(a) (b)

Figure 3.4.3. Simple competitive learning. (a) Initial weight vectors; (b) weight vector configuration after performing simple competitive learning.

At the onset of learning, we do not know the number of clusters in the data set. So we normally overestimate this number by including excess units in the network. This means that after convergence, some units will be redundant in the sense that they do not evolve significantly and thus do not capture any data clusters. This can be seen, for example, in Figure 3.4.3 where the weight vector w1 does not significantly evolve throughout the computation. These are typically the units that are initialized to points in the weight space that have relatively low overlap with the training data points; such units may still be desirable since they may capture new clusters if the underlying distribution p(x) changes in time, or, if desired, their probability of occurrence may be reduced by using some of the following heuristics: (1) Initialize the weight vectors to randomly selected sample input vectors; (2) use "leaky learning" (Rumelhart and Zipser, 1985). Here, all loser units are also updated in response to an input vector, but with a learning rate much smaller than the one used for the winner unit; (3) update the neighboring units of the winner unit; and/or (4) smear the input vectors with added noise using a long-tailed distribution (Szu, 1986).

After learning has converged, it is necessary to "calibrate" the above clustering network in order to determine the number of units representing the various learned clusters, and cluster membership. Here, the weights of the net are held fixed and the training set is used to interrogate the network for sensitized units. The following example demonstrates the above ideas for a simple data clustering problem.

Example 3.4.1: Consider the set of unlabeled two-dimensional data points plotted in Figure 3.4.4. Here, we demonstrate the typical solution (cluster formation) generated by a four-unit competitive net, which employs Equations (3.4.2) and (3.4.5). No normalization of either the weight vectors or the input vectors is used. During training, the input samples (data points) are selected uniformly at random from the data set. The four weight vectors are initialized to random values close to the center of the plot in Figure 3.4.4. Training with  = 0.01 is performed for 1000 iterations. The resulting weight vector trajectories are plotted in Figure 3.4.5(a) with the initial weights marked by "*". Note how one of the units never evolved beyond its initial weight vector. This is because this unit never became a winner. Each of the remaining three units evolved its weight vector to a distinct region in the input space. Since is small, the weight vectors are shown to enter and stay inside "tiny" neighborhoods; the weight vectors fluctuate inside their respective terminal neighborhoods. These fluctuations are amplified if a relatively larger ( = 0.05) value is used, as demonstrated in Figure 3.4.5(b). Finally, the trained network is calibrated by assigning a unique cluster label for any unit which becomes a winner for one or more training samples. Here, Equation (3.4.2) is used to determine the winning unit. During calibration, only three units are found to ever become winners. Figure 3.4.6 depicts the result of the calibration process. Here, the symbols "+", "o", and "" are used to tag the three winner units, respectively. The figure shows a "+", "o", or "" printed at the exact position of the training sample x, if x causes the unit with label "+", "o", or "" to be a winner, respectively. We should note that the exact same clusters in Figure 3.4.6 are generated by the net with  = 0.05. The cluster labeled "" looks interesting because of its obvious bimodal structure. Intuitively speaking, one would have expected this cluster to be divided by the competitive net into two distinct clusters. In fact, if one carefully looks at Figure 3.4.5(b), this is what the net attempted to do, but failed. However, very few simulations out of a large number of simulations which were attempted, but are not shown here, did in fact result in this "intuitive" solution. In Chapter 4, the competitive learning rule is Equation (3.4.5) is shown to correspond to stochastic gradient descent on the criterion , where is the weight vector of the winner unit. Therefore, one may explain the three-cluster solution in this example as representing a suboptimal solution which corresponds to a local minimum of J(w).

Figure 3.4.4. Unlabeled two-dimensional data used in training the simple competitive net of Example 3.4.1.

(a)

(b)

Figure 3.4.5. Weight vector evolution trajectories for a four-unit competitive net employing Equations (3.4.2) and (3.4.5). These trajectories are shown superimposed on the plane containing the training data. A "*" is used to indicate the initial setting of the weights for each of the four units. (a) Learning rate equals 0.01, and (b) learning rate equals 0.05.

Figure 3.4.6. A three-cluster solution for the data shown in Figure 3.4.4. This solution represents a typical cluster formation generated by the simple four-unit competitive net of Example 3.4.1.

One of the common applications of competitive learning is adaptive vector quantization for data compression (e.g., speech and image data). Here, we need to categorize a given set of xk data points (vectors) into m "templates" so that later one may use an encoded version of the corresponding template of any input vector to represent the vector, as opposed to using the vector itself. This leads to efficient quantization (compression) for storage and for transmission purposes (albeit at the expense of some distortion).

Vector quantization is a technique whereby the input space is divided into a number of distinct regions, and for each region a "template" (reconstruction vector) is defined (Linde et al., 1980; Gray, 1984). When presented with a new input vector x, a vector quantizer first determines the region in which the vector lies. Then, the quantizer outputs an encoded version of the reconstruction vector wi representing that particular region containing x. The set of all possible reconstruction vectors wi is usually called the "codebook" of the quantizer. When the Euclidean distance similarity measure is used to decide on the region to which the input x belongs, the quantizer is called Voronoi quantizer. The Voronoi quantizer partitions its input space into Voronoi cells (Gray, 1984), each cell is represented by one of the reconstruction vectors, wi. The ith Voronoi cell contains those points of the input space that are closest (in a Euclidean sense) to the vector wi than to any other vector wj, j  i. Figure 3.4.7 shows an example of the input space partitions of a Voronoi quantizer with four reconstruction vectors.

Figure 3.4.7. Input space partitions realized by a Voronoi quantizer with four reconstruction vectors. The four reconstruction vectors are shown as filled circles in the figure.

The competitive learning rule in Equation (3.4.5) with a winning unit determination based on the Euclidean distance as in Equation (3.4.3) may now be used in order to allocate a set of m reconstruction vectors , i = 1, 2, ..., m, to the input space of n-dimensional vectors x. Let x be distributed according to the probability density function p(x). Initially, we set the starting values of the vectors wi to the first m randomly generated samples of x. Additional samples x are then used for training. Here, the learning rate in Equation (3.4.5) is selected as a monotonically decreasing function of the number of iterations k. Based on empirical results, Kohonen (1989) conjectured that, in an average sense, the asymptotic local point density of the wi's (i.e., the number of wi falling in a small volume of Rn centered at x) obtained by the above competitive learning process takes the form of a continuous, monotonically increasing function of p(x). Thus, this competitive learning algorithm may be viewed as an "approximate" method for computing the reconstruction vectors wi in an unsupervised manner.

Kohonen (1989) designed supervised versions of vector quantization (called learning vector quantization, LVQ) for adaptive pattern classification. Here, class information is used to fine tune the reconstruction vectors in a Voronoi quantizer, so as to improve the quality of the classifier decision regions. In pattern classification problems, it is the decision surface between pattern classes and not the inside of the class distribution, which should be described most accurately. The above quantizer process can be easily adapted in order to optimize the placement of the decision surface between different classes. Here, one would start with a trained Voronoi quantizer and calibrate it using a set of labeled input samples (vectors). Each calibration sample is assigned to that wi which is closest. Each wi is then labeled according to the majority of classes represented among those samples which have been assigned to wi. Here, the distribution of the calibration samples to the various classes, as well as the relative numbers of the wi assigned to these classes must comply with the apriori probabilities of the classes, if such probabilities are known. Next, the tuning of the decision surfaces is accomplished by rewarding correct classifications and punishing incorrect ones. Let the training vector xk belong to the class cj. Assume that the closest reconstruction vector wi to xk carries the label of class cl. Then, only vector wi is updated according to the following supervised rule (LVQ rule):

(3.4.7)

where k is assumed to be a monotonically decreasing function of the number of iterations k. After convergence, the input space Rn is again partitioned by a Voronoi tessellation corresponding to the tuned wi vectors. The primary effect of the reward/punish rule in Equation (3.4.7) is to minimize the number of misclassifications. At the same time, however, the vectors wi are pulled away from the zones of class overlap where misclassifications persist.

The convergence speed of LVQ can be improved if each vector wi has its own adaptive learning rate given by (Kohonen, 1990):

(3.4.8)

This recursive rule causes i to decrease if wi classifies xk correctly. Otherwise, i increases. Equations (3.4.7) and (3.4.8) define what is known as an "optimized learning rate" LVQ (OLVQ). Another improved algorithm named LVQ2 has also been suggested by Kohonen et al. (1988) which approaches the performance predicted by Bayes decision theory (Duda and Hart, 1973).

Some theoretical aspects of competitive learning are considered in the next chapter. More general competitive networks with stable categorization behavior have been proposed by Carpenter and Grossberg (1987 a, b). One of these networks, called ART1, is described in Chapter 6.

Goto [3.0] [3.1] [3.2] [3.3] [3.5] [3.6]