6.4 Clustering Networks

The task of pattern clustering is to automatically group unlabeled input vectors into several categories (clusters), so that each input is assigned a label corresponding to a unique cluster. The clustering process is normally driven by a similarity measure. Vectors in the same cluster are similar, which usually means that they are "close" to each other in the input space. A simple clustering net which employs competitive learning has already been covered in Section 3.4. In this section, two additional clustering neural networks are described which have more interesting features than the simple competitive net of Chapter Three. Either network is capable of automatic discovery (estimation) of the underlying number of clusters.

There are various ways of representing clusters. The first network we describe uses a simple representation in which each cluster is represented by the weight vector of a prototype unit (this is also the prototype representation scheme adapted by the simple competitive net). The first network is also characterized by its ability to allocate clusters incrementally. Networks with incremental clustering capability can handle an infinite stream of input data. They don't require large memory to store training data because their cluster prototype units contain implicit representation of all the inputs previously encountered.

The second clustering network described in this section has a more complex architecture than the first net. It employs a distributed representation as opposed to a single prototype unit cluster representation. This network is non-incremental in terms of cluster formation. However, the highly nonlinear multiple layer architecture of this clustering net enables it to perform well on very difficult clustering tasks. Another interesting property of this net is that it does not require an explicit user defined similarity measure! The network develops its own internal measure of similarity as part of the training phase.

6.4.1 Adaptive Resonance Theory (ART) Networks

Adaptive resonance architectures are artificial neural networks that are capable of stable categorization of an arbitrary sequence of unlabeled input patterns in real time. These architectures are capable of continuous training with non-stationary inputs. They also solve the "stability-plasticity dilemma," namely, they let the network adapt yet prevent current inputs from destroying past training. The basic principles of the underlying theory of these networks, know as adaptive resonance theory (ART), were introduced by Grossberg (1976). ART networks are biologically motivated and were developed as possible models of cognitive phenomena in humans and animals.

A class of ART architectures, called ART1, is characterized by a system of ordinary differential equations (Carpenter and Grossberg, 1987a) with associated theorems proving its self-stabilization property and the convergence of its adaptive weights. ART1 embodies certain simplifying assumptions which allows its behavior to be described in terms of a discrete-time clustering algorithm. A number of interpretations/simplifications of the ART1 net have been reported in the literature (e.g., Lippmann, 1987 ; Pao, 1989; Moore, 1989). In the following, we adopt Moore's abstraction of the clustering algorithm from the ART1 architecture and discuss this algorithm in conjunction with a simplified architecture.

The basic architecture of the ART1 net consists of a layer of linear units representing prototype vectors whose outputs are acted upon by a winner-take-all network (described in Section 3.4.1). This architecture is identical to that of the simple competitive net in Figure 3.4.1 with one major difference: The linear prototype units are allocated dynamically, as needed, in response to novel input vectors. Once a prototype unit is allocated, appropriate lateral-inhibitory and self-excitatory connections are introduced so that the allocated unit may compete with pre-existing prototype units. Alternatively, one may assume a pre-wired architecture as in Figure 3.4.1 with a large number of inactive (zero weight) units. Here, a unit becomes active if the training algorithm decides to assign it as a cluster prototype unit, and its weights are adapted accordingly.

The general idea behind ART1 training is as follows. Every training iteration consists of taking a training example xk and examining existing prototypes (weight vectors wj) that are sufficiently similar to xk. If a prototype wi is found to "match" xk (according to a "similarity" test based on a preset matching threshold), example xk is added to wi's cluster and wi is modified to make it better match xk. If no prototype matches xk, then xk becomes the prototype for a new cluster. The details of the ART1 clustering procedure are considered next.

The input vector (pattern) x in ART1 is restricted to binary values, x  {0, 1}n. Each learned cluster, say cluster j, is represented by the weight vector wj  {0, 1}n of the jth prototype unit. Every time an input vector x is presented to the ART1 net, each existing prototype unit computes a normalized output (the motivation behind this normalization is discussed later)

(6.4.1)

and feeds it to the winner-take-all net for determining the winner unit. Note that yj is the ratio of the overlap between prototype wj and x to the size of wj. The winner-take-all net computes a "winner" unit i. Subject to further verification, the weight vector of the winner unit wi now represents a potential prototype for the input vector. The verification comes in the form of passing two tests.

In order to pass the first test, the input x must be "close enough" to the winner prototype wi, i.e.,

(6.4.2)

Here, ||x||2 and ||wi||2 are equal to the number of 1's in x and wi, respectively. Passing this test guarantees that a sufficient fraction of the wi and x bits are matched.

The second test is a match verification test between wi and x. This test is passed if

(6.4.3)

where 0 <  < 1 is a user defined "vigilance" parameter. Here, wi is declared to "match" x if a significant fraction (determined by ) of the 1's in x appear in wi. Note that Equation (6.4.3) causes more differentiation among input vectors of smaller magnitude. This feature of ART1 is referred to as "automatic scaling," "self-scaling," or "noise-insensitivity."

If the above two tests are passed by the winner unit i for a given input x (here, the network is said to be in resonance), then x joins cluster j and this unit's weight vector wi is updated according to

 = wi x (6.4.4)

where "" stands for the logical AND operation applied component-wise to the corresponding components of vectors wi and xi. According to Equation (6.4.4) a new prototype wi can only have fewer and fewer 1's as training progresses. Note, that it is possible for a training example to join a new cluster but eventually to leave that cluster because other training examples have joined it.

The second scenario corresponds to unit i passing the first test but not the second one. Here, the ith unit is deactivated (its output is clamped to zero until a new input arrives) and the tests are repeated with the unit with the next highest normalized output. If this scenario persists even after all existing prototype units are exhausted, then a new unit representing a new cluster j is allocated and its weight vector wj is initialized according to

wj = x (6.4.5)

In a third scenario, unit i does not pass the first test. Here, wi is declared "too far" from the input x and a new unit representing a new cluster, j, is allocated with its weight vector wj initialized as in Equation (6.4.5). Hence, initially wj is a binary vector. And, since wj is updated according to Equation (6.4.4), the vector wj conserves its binary nature. This is true for any unit in the ART1 net since all units undergo the initialization in Equation (6.4.5) upon being allocated.

Note that the learning dynamics in the second scenario described above constitutes a search through the prototype vectors, looking at the closest, next closest, etc. by the maximum criterion. This search is continued until a prototype vector is found that satisfies the matching criterion in Equation (6.4.3). These criteria are different. The first criterion measures the fraction of the bits in wj that are also in x, whereas the second criterion measures the fraction of the bits in x that are also in wj. So going further away by the first measure may actually bring us closer by the second. It should also be ntoed that this search only occurs before stability is reached for a given training set. After that, each prototype vector is matched on the first attempt and no search is needed.

The normalization factor ||wi||2 in Equation (6.4.1) is used as a "tie-breaker." It favors smaller magnitude prototype vectors over vectors which are supersets of them (i.e., have 1's in the same places) when an input matches them equally well. This mechanism of favoring small prototype vectors helps maintain prototype vectors apart. It also helps compensate for the fact that in the updating step of Equation (6.4.4), the prototype vectors always move to vectors with fewer 1's.

The vigilance parameter in Equation (6.4.3) controls the granularity of the clusters generated by the ART1 net. Small values allow for large deviations from cluster centers, and hence lead to a small set of clusters. On the other hand, a higher vigilance leads to a larger number of tight clusters. Regardless of the setting of , the ART1 network is stable for a finite training set; i.e., the final clusters will not change if additional training is performed with one or more patterns drawn from the original training set. A key feature of the ART1 network is its continuous learning ability. This feature coupled with the above stability result allows the ART1 net to follow nonstationary input distributions.

The clustering behavior of the ART1 network is illustrated for a set of random binary vectors. Here, the task of ART1 is to cluster twenty four uniformly distributed random vectors x  {0, 1}16. Simulation results are shown in Figure 6.4.1 (a) and (b) for  = 0.5 and  = 0.7, respectively (here, the vectors are shown as 4  4 patterns of "on" and "off" pixels for ease of visualization). The resulting prototype vectors are also shown. Note the effect of the vigilance parameter setting on the granularity of the generated clusters.

The family of ART networks also include more complex models such as ART2 (Carpenter and Grossberg, 1987b) and ART3 (Carpenter and Grossberg, 1990). These ART models are capable of clustering binary and analog input patterns. However, these models are inefficient from a computational point of view. A simplified model of ART2, ART2-A, has been proposed which is two to three orders of magnitude faster than ART2 (Carpenter et al., 1991b). Also, a supervised real-time learning ART model called ARTMAP has been proposed (Carpenter et al., 1991a).

(a)

(b)

Figure 6.4.1. Random binary pattern clustering employing the ART1 net. Different vigilance values cause different numbers of categories (clusters) to form: (a)  = 0.5 and (b)  = 0.7. For each case, the top row shows prototype vectors extracted by the ART1 network.

An example of ART2 clustering is shown in Figure 6.4.2 (Carpenter and Grossberg, 1987b). Here, the problem is to cluster a set of fifty 25-dimensional real-valued input signals (patterns). The results are shown for two different vigilance levels. It is left to the reader to check (subjectively) the consistency of the formed clusters. Characterization of the clustering behavior of ART2 was given by Burke (1991) who draws an analogy between ART-based clustering and k-means-based clustering.

ART networks are sensitive to the presentation order of the input patterns; they may yield different clustering on the same data when the presentation order of patterns is varied (with all other parameters kept fixed). Similar effects are also present in incremental versions of classical clustering methods such as k-means clustering (i.e., k-means is also sensitive to the initial choice of cluster centers).










(a)










(b)

Figure 6.4.2. ART2 clustering of analog signals for two vigilance levels: The vigilance value in (a) is smaller than that in (b). (From G. Carpenter and S. Grossberg, 1987b, Applied Optics, 26(23), pp. 4919-4930, ©1987 Optical Society of America.)

6.4.2 Autoassociative Clustering Network

Other data clustering networks can be derived from "concept forming" cognitive models (Anderson, 1983 ; Knapp and Anderson, 1984 ; Anderson and Murphy, 1986). A "concept" describes the situation where a number of different objects are categorized together by some rule or similarity relationship. For example, a person is able to recognize that physically different objects are really "the same" (e.g., a person's concept of "tree").

A simple concept forming model consists of two basic interrelated components. First, it consists of a prototype forming component, which is responsible for generating category prototypes via an autoassociative learning mechanism. The second component is a retrieval mechanism where a prototype becomes an attractor in a dynamical system. Here, the prototype and its surrounding basin of attraction represent an individual concept.

Artificial neural networks based on the above concept forming model have been proposed for data clustering of noisy, superimposed patterns (Anderson et al., 1990 ; Spitzer et al., 1990; Hassoun et al., 1992, 1994a, 1994b). Here, a feedforward (single or multiple layer) net is trained in an autoassociative mode (recall the discussion on autoassociative nets in Section 5.3.6) with the noisy patterns. The training is supervised in the sense that each input pattern to the net serves as its own target. However, these targets are not the "correct answers" in the general sense of supervised training. The strategy here is to force the network to develop internal representations during training so as to better reconstruct the noisy inputs. In the prototype extraction phase, the trained feedforward net is transformed into a dynamical system by using the output layer outputs as inputs to the net, thus forming an external feedback loop. Now, and with proper stabilization, when initialized with one of the input patterns the dynamical system will evolve and eventually converge to the "closest" attractor state. Hopefully, these attractors may be identified with the prototypes derived from the training phase.

Next, we describe the above ideas in the context of a simple single layer network (Hassoun and Spitzer, 1988; Anderson et al., 1990) and then present a more general autoassociative clustering architecture. Consider an unlabeled training set of vectors   Rn representing distorted and/or noisy versions of a set {xk; k = 1, 2, ..., m} of m unknown prototype vectors. Also, assume a single layer net of n linear units each having an associated weight vector wi, i = 1, 2, ..., n. Let us denote by W the n n matrix whose ith row is the weight vector . This simple network outputs the vector y  Rn upon the presentation of an input where y = W. Now, we update the connection matrix W in response to the presentation of a sample according to the matrix form of the Widrow-Hoff (LMS) rule

(6.4.6)

which realizes a gradient descent minimization of the criterion function

(6.4.7)

Therefore, the trained net will be represented by a connection matrix W which approximates the mapping

Wx = x (6.4.8)

for all m prototype vectors x. This approximation results from minimizing J in Equation (6.4.7) and it requires that the clusters of noisy samples associated with each prototype vector are sufficiently uncorrelated. In addition, it requires that the network be incapable of memorizing the training autoassociations and that the number of underlying prototypes m to be learned be much smaller than n.

According to Equation (6.4.8), the autoassociative training phase attempts to estimate the unknown prototypes and encode them as eigenvectors of W with eigenvalues of 1. A simple method for extracting these eigenvalues (prototypes) is to use the "power" method of eigenvector extraction. This is an iterative method which can be used to pick out the eigenvector with the largest-magnitude eigenvalue of a matrix A by repeatedly passing an initially random vector c0 through the matrix, according to (also, refer to footnote 6 in Chapter 3)

ck+1 = Ack, k = 0, 1, 2, ... (6.4.9)

After a number of iterations, the eigenvector with the largest magnitude eigenvalue will dominate. The above eigenvector extraction method can be readily implemented by adding external feedback to our simple net, with A represented by W and c by y or x.

Once initialized with one of the noisy vectors , the resulting dynamical system evolves its state (the n-dimensional output vector y) in such a way that this state moves towards the prototype vector x "most similar" to . This is possible because the prototype vectors x approximate the dominating eigenvectors of W (with eigenvalues close to '1'). Note that the remaining eigenvectors of W arise from learning uncorrelated noise and tend to have small eigenvalues compared to 1. The ability of the autoassociative dynamical net to selectively extract a learned prototype/eigenvector from a "similar" input vector, as opposed to always extracting the most dominant eigenvector, is due to the fact that all learned prototypes have comparable eigenvalues close to unity. Thus, the extracted prototype is the one that is "closest" to the initial state (input vector) of the net.

The stability of the dynamic autoassociative net is an important design issue. Stability is determined by the network weights and network architecture. Care must be taken in matching the learning algorithm for prototype encoding in the feedforward net to the dynamic architecture which is ultimately used for prototype extraction. One would like to design an autoassociative clustering net which minimizes an associated energy or Liapunov function; i.e., starting from any initial state, the system's state always evolves along a trajectory for which the energy function is monotonically decreasing (the reader may refer to Section 7.1.2 for further details on Liapunov functions and stability). Anderson et al. (1990) reported a stable autoassociative clustering net based on the brain-state-in-a-box (BSB) concept forming model (Anderson et al., 1977; also, see Section 7.4.1).

A serious limitation of the autoassociative linear net we have just described is that it does not allow the user to control the granularity of the formed clusters; i.e., the number of learned prototypes. This network will tend to merge different clusters that are "close" to each other in the input space due to the lack of cluster competition mechanisms (recall the similarity and vigilance tests employed in the ART1 net for controlling cluster granularity). When two different clusters are merged by the linear net, they become represented by a distorted prototype which is a linear combination of the two correct (but unknown) prototypes. Introducing nonlinearity into the autoassociative net architecture can help the net escape the above limitation by allowing control of the granularity of the clusters. This feature is discussed in connection with the dynamic nonlinear multilayer autoassociative network (Spitzer et al., 1990 ; Wang, 1991 ; Hassoun et al., 1992) considered next.

Consider a two hidden layer feedforward net with an output layer of linear units. All hidden layer units employ the hyperbolic tangent activation function

(6.4.10)

where controls the slope of the activation. The activation slopes of the second hidden layer units (the layer between the first hidden and output layers) are fixed (typically set to 1). On the other hand, the activation slopes for the units in the first hidden layer are made monotonically increasing during training as explained below. Each hidden unit in the first layer receives inputs from all components of an n-dimensional input vector and an additional bias input (held fixed at 1). Similarly, each unit in the second hidden layer receives inputs from all units in the first hidden layer plus a bias input of 1. Finally, each linear output unit receives inputs from all second hidden layer units plus a bias.

This layered network functions as an autoassociative net, as described above. Therefore, the n output layer units serve to reconstruct (decode) the n-dimensional vector presented at the input. We wish the network to discover a limited number of representations (prototypes) of a set of noisy input vectors, to describe the training data. An essential feature of the network's architecture is therefore a restrictive "bottleneck" in the hidden layer; the number of units in each hidden layer (especially the first hidden layer) is small compared to n. The effect of this bottleneck is to restrict the degrees of freedom of the network, and constrain it to discovering a limited set of unique prototypes which describes (clusters) the training set. The network does not have sufficient capacity to memorize the training set. This clustering is further enhanced by aspects of the learning algorithm, described below. Autoassociative multilayer nets with hidden layer bottleneck have been studied by Bourlard and Kamp (1988), Baldi and Hornik (1989), Funahashi (1990), Hoshino et al. (1990), Kramer (1991), Oja (1991), and Usui et al. (1991) among others. In these studies, such nets have been found to implement principal component analysis (PCA) and nonlinear PCA when one hidden layer and three or more hidden layers are used, respectively. The reader is referred to Section 5.3.6 for details.

The learning algorithm employed is essentially the incremental backprop algorithm of Section 5.1.1 with simple heuristic modifications. These heuristics significantly enhance the network's tendency to discover the best prototypical representations of the training data. These modifications include a dynamic slope for the first hidden layer activations that saturates during learning, and damped learning rate coefficients. As a result of learning, the first hidden layer in this net discovers a set of bipolar binary distributed representations that characterize the various input vectors. The second hidden and the output layers perform a nonlinear mapping which decodes these representations into reduced-noise versions of the input vectors.

In order to enhance separation of clusters and promote grouping of similar input vectors, the slope of the activation functions of first hidden layer units is made dynamic; it increases monotonically during learning, according to

(k) = k (6.4.11)

where is greater than but close to 1 and k is the learning step index. As a result, the nonlinearity gradually (over a period of many cycles through the whole training set) becomes the sgn (sign) function, and the outputs of the first hidden layer become functionally restricted to bipolar binary values. As these activations saturate, a limited number of representations for "features" of the input vectors are available. This gradually forces "similar" inputs to activate a unique distributed representation at this layer. The larger the value of , the faster is the saturation of the activations which, in turn, increases the sensitivity of the first hidden layer representations to differences among the input vectors. This increased sensitivity increases the number of unique representations at this layer, thus leading the rest of the network to reconstruct an equal number of prototypes. Hence, the slope saturation parameter controls cluster granularity and may be viewed as a vigilance parameter similar to that in the ART nets.

The mapping characteristics of this highly nonlinear first hidden layer may also be thought to emerge from a kind of nonlinear principal component analysis (PCA) mechanism where unit activities are influenced by high order statistics of the training set. The other modification to backprop is the use of exponentially damped learning coefficients. The tendency is for the network to best remember the most recently presented training data. A decaying learning coefficient helps counteract this tendency and balance the sensitivity of learning for all patterns. The learning rate coefficient used is therefore dynamically adjusted according to

(k) = k (6.4.12)

where is a predefined constant less than but close to unity. As a result of this exponentially decaying learning rate, learning initially proceeds rapidly, but then the declining rate of learning produces a de-emphasis of the most recently learned input vectors, which reduces "forgetting" effects and allows the repeating patterns to be learned evenly.

In the prototype extraction phase, a dynamic net is generated by feeding the above trained net's output back to the input. A pass is now made over the training set, but this time no learning occurs. The primary objective of this pass is to classify the vectors in the training data, and extract the prototype discovered by the network for each cluster. As each vector is presented, an output is generated and is fed back to the input of the network. This process is repeated iteratively until convergence. When the network settles into a final state the outputs of the first hidden layer converge to a bipolar binary state. This binary state gives an intermediate distributed representation (activity pattern) of the particular cluster containing the present input. The intermediate activity pattern is mapped by the rest of the net into a real-valued activity pattern at the output layer. This output can be taken as a "nominal" representation of the underlying cluster "center"; i.e., a prototype of the cluster containing the current input. Therefore, the network generates two sets of prototypes (concepts): Abstract binary-valued concepts and explicit real-valued concepts. The network also supplies the user with parallel implementations of the two mappings from one concept representation to the other; the first hidden layer maps the input vectors into their corresponding abstract concepts, while the second hidden layer and the output layer implement the inverse of this mapping.

Proving the stability of this dynamic multilayer autoassociative clustering net is currently an open problem. The highly nonlinear nature of this dynamical system makes the analysis difficult. More specifically, it would be difficult to find an appropriate Liapunov function, if one exists, to prove stability. However, empirical evidence suggests a high degree of stability when the system is initialized with one of the training vectors and/or a new vector which is "similar" to any one of the training vectors.

We conclude this section by presenting two simulation results which illustrate the capability of the above autoassociative clustering net. In the first simulation, the net is used to cluster the 50 analog patterns of Figure 6.4.2. These patterns are repetitively presented to the net until the slopes in the first hidden layer rise to 200, according to Equation (6.4.11). Two experiments were performed with  = 1.0003 and 1.0005. The network used had eight units in each of the first and second hidden layers. Figures 6.4.3 (a) and (b) show the learned prototypes (top row) and their associated cluster members (in associated columns) for  = 1.0003 and  = 1.0005, respectively ( = 1.0003 required about 350 training cycles to saturate at 200, while  = 1.0005 required about 210 cycles). A typical setting for in Equation (6.4.12) is 0.9999.












Figure 6.4.3. Clusters of patterns formed by the multilayer autoassociative clustering net; (a)  = 1.0003 and (b)  = 1.0005. The top row shows the learned prototype for each cluster.

It is interesting to note that the choice of network size and parameters and were not optimized for this particular clustering problem. In fact, these parameters were also found to be appropriate for clustering motor unit potentials in the electromyogram (EMG) signal (Wang, 1991). It is left to the reader to compare the clustering results in Figure 6.4.3 to those due to the ART2 net shown in Figure 6.4.2. Note how the level of controls cluster granularity.

In the second simulation, the autoassociative clustering net (with comparable size and parameters as in the first simulation) is used to decompose a 10 second recording of an EMG signal obtained from a patient with a diagnosis of myopathy (a form of muscle disease). A segment of this signal (about 0.8 sec.) is shown in the top window in Figure 6.4.4. The objective here is to extract prototype motor unit potentials (MUPs) comprising this signal, and to associate each noisy MUP in the signal with its correct prototype. A preprocessing algorithm detected 784 candidate MUPs (each MUP pattern is uniformly sampled and is represented by a 50-dimensional vector centered at the highest peak in the MUP signal) in the 10 second EMG recording shown. These detected noisy MUPs were used to train the autoassociative clustering net. The clustering results are shown in Figure 6.4.4. A total of 24 unique prototype MUPs are identified by the network. The figure shows the prototypes associated with the 11 most significant clusters in small windows; i.e., a significant number of noisy input MUPs converged to these prototypes. Examples of MUPs from each cluster are shown to the right of discovered cluster prototypes. This result is superior to those of existing automated EMG decomposition techniques. For a detailed description and analysis of this problem, the reader may consult Wang (1991), Hassoun et al. (1994a), and Hassoun et al. (1994b).












Figure 6.4.4. Decomposition of an EMG signal by the autoassociative clustering net. One segment of an EMG (about 0.8 sec.) is shown in the top window. The extracted MUP waveform prototypes are shown in small windows. Examples of classified waveforms are shown to the right.

Back to the Table of Contents

Back to Main Menu