3.5 Self-Organizing Feature Maps: Topology Preserving Competitive Learning

Self-organization is a process of unsupervised learning whereby significant patterns or features in the input data are discovered. In the context of a neural network, self-organization learning consists of adaptively modifying the synaptic weights of a network of locally interacting units in response to input excitations and in accordance with a learning rule, until a final useful configuration develops. The local interaction of units means that the changes in the behavior of a unit only (directly) affects the behavior of its immediate neighborhood. The key question here is how could a useful configuration evolve from self-organization. The answer lies essentially in a naturally observed phenomenon whereby global order can arise from local interactions (Turing, 1952). This phenomenon applies to neural networks (biological and artificial) where many originally random local interactions between neighboring units (neurons) of a network couple and coalesce into states of global order. This global order leads to coherent behavior, which is the essence of self-organization.

In the following, a modified version of the simple competitive learning network discussed in Section 3.4.1 is presented, which exhibits self-organization features. This network attempts to map a set of input vectors xk in Rn onto an array of units (normally one- or two-dimensional) such that any topological relationships among the xk patterns are preserved and are represented by the network in terms of a spatial distribution of unit activities. The more related two patterns are in the input space, the closer we expect the position in the array of the two units representing these patterns. In other words, if x1 and x2 are "similar" or are topological neighbors in Rn, and if r1 and r2 are the location of the corresponding winner units in the net/array, then the Euclidean distance ||r1 - r2|| is expected to be small. Also, ||r1 - r2|| approaches zero as x1 approaches x2. The idea is to develop a topographic map of the input vectors, so that similar input vectors would trigger nearby units. Thus, a global organization of the units is expected to emerge.

An example of such topology preserving self-organizing mappings which exists in animals is the somatosensory map from the skin onto the somatosensory cortex, where there exists an image of the body surface. The retinotopic map from the retina to the visual cortex is another example. It is believed that such biological topology preserving maps are not entirely preprogrammed by the genes and that some sort of (unsupervised) self-organizing learning phenomenon exists that tune such maps during development. Two early models of topology preserving competitive learning were proposed by von der Malsburg (1973) and Willshaw and von der Malsburg (1976) for the retinotopic map problem. In this section, we present a detailed topology preserving model due to Kohonen (1982a) which is commonly referred to as the self-organizing feature map (SOFM).

3.5.1 Kohonen's SOFM

The purpose of Kohonen's self-organizing feature map is to capture the topology and the probability distribution of input data (Kohonen, 1982a and 1989). This model generally involves an architecture consisting of a two-dimensional structure (array) of linear units, where each unit receives the same input xk, xk Rn. Each unit in the array is characterized by an n-dimensional weight vector. The ith unit weight vector wi is sometimes viewed as a position vector which defines a "virtual position" for unit i in Rn. This, in turn, will allow us to interpret changes in wi as movements of unit i. However, one should keep in mind that no physical movements of units take place.

The learning rule is similar to that of simple competitive learning in Equation (3.4.5) and is defined by:


where i* is the index of the winner unit. The winner unit is determined according to the Euclidean distance as in Equation (3.4.3), with no weight vector normalization. The major difference between this update rule and that of simple competitive learning is the presence of the neighborhood function in the former. This function is very critical for the successful preservation of topological properties. It is normally symmetric [i.e., ] with values close to one for units i close to i* and decreases monotonically with the Euclidean distance between units i and i* in the array.

At the onset of learning, defines a relatively large neighborhood whereby all units in the net are updated for any input xk. As learning progresses, the neighborhood is shrunk down until it ultimately goes to zero, where only the winner unit is updated. The learning rate also must follow a monotonically decreasing schedule in order to achieve convergence. One may think of the initial large neighborhood as effecting an exploratory global search which is then continuously refined to a local search as the variance of approaches zero. The following is one possible choice for :


where the variance 2 controls the width of the neighborhood. Here, and may be set proportional to with k representing the learning step. Ritter and Schulten (1988a) proposed the following update schedules for


and for


with 0(0) and f(f) controlling the initial and final values of the learning rate (neighborhood width), respectively, and kmax is the maximum number of learning steps anticipated.

There is no theory to specify the parameters of these learning schedules for arbitrary training data. However, practical values are 0 < 0 1 (typically 0.8), f << 1, 0 (md is the number of units along the largest diagonal of the array), or an equivalent value that will permit (ii*k = 0) to reach all units when i* is set close to the center of the array, and f = 0.5. Finally kmax is usually set to 2 or more orders of magnitude larger than the total number of units in the net.

Let the input vector x be a random variable with a stationary probability density function p(x). Then, the basis for Equation (3.5.1) and various other self-organizing algorithms is captured by the following two step process: (1) Locate the best-matching unit for the input vector x; (2) Increase matching at this unit and its topological neighbors. The computation achieved by a repetitive application of this process is rather surprising and is captured by the following proposition due to Kohonen (1989): "The wi vectors tend to be ordered according to their mutual similarity, and the asymptotic local point density of the wi, in an average sense, is of the form g(p(x)), where g is some continuous, monotonically increasing function." This proposition is tested by the following experiments.

3.5.2 Examples of SOFMs

Figure 3.5.1 shows an example of mapping uniform random points inside the positive unit square onto a 10 10 planar array of units with rectangular topology. Here, each unit i has four immediate neighboring units, which are located symmetrically at a distance of 1 from unit i. In general, however, other array topologies such as hexagonal topology may be assumed. In the figure, the weight vectors of all units are shown as points superimposed on the input space. Connections are shown between points corresponding to topologically neighboring units in the array, for improved visualization. Thus, a line connecting two weight vectors wi and wj is only used to indicate that the two corresponding units i and j are adjacent (immediate neighbors) in the array. The weights are initialized randomly near (0.5, 0.5) as shown in Figure 3.5.1 (a). During training the inputs to the units in the array are selected randomly and independently from a uniform distribution p(x) over the positive unity square. Figures 3.5.1 (b)-(d) show snapshots of the time evolution of the feature map. Initially, the map untangles and orders its units as in parts (b) and (c) of the figure. Ultimately, as shown in Figure 3.5.1(d), the map spreads to fill all of the input space except for the border region which shows a slight contraction of the map. Figures 3.5.2 through 3.5.4 show additional examples of mapping uniformly distributed points from a disk, a triangle, and a hollow disk, respectively, onto a 15 15 array of units. The initial weight distribution is shown in part (a) of each of these figures. Learning rule parameters are provided in the figure captions.

The feature maps in Figures 3.5.1 through 3.5.4 all share the following properties. First, it is useful to observe that there are two phases in the formation of the map: an ordering phase and a convergence phase. The ordering phase involves the initial formation of the correct topological ordering of the weight vectors. This is roughly accomplished during the first several hundred iterations of the learning algorithm. The fine tuning of the map is accomplished during the convergence phase, where the map converges asymptotically to a solution which approximates p(x). During this phase, the neighborhood width 2 and the learning rate take on very small values, thus contributing to slow convergence. For good results, the convergence phase may take 10 to 1000 times as many steps as the ordering phase. Another common property in the above SOFMs is a border aberration effect which causes a slight contraction of the map, and a higher density of weight vectors at the borders. This aberration is due to the "pulling" by the units inside the map. The important thing, though, is to observe how in all cases the w vectors are ordered according to their mutual similarity, which preserves the topology of the Euclidean input space. Another important result is that the density of the weight vectors in the weight space follows the uniform probability distribution of the input vectors. If the distribution p(x) had been non-uniform we would have found more grid points of the map where p(x) was high (this is explored in Problem 3.5.5).

(a) Iteration 0 (b) Iteration 100

(c) Iteration 10,000 (d) Iteration 50,000

Figure 3.5.1. Mapping uniform random points from the positive unit square using 10 10 planar array of units (0 = 0.8, f = 0.01, 0 = 5, f = 1, and kmax = 50,000).

(a) Iteration 0 (b) Iteration 10,000

(c) Iteration 30,000 (d) Iteration 70,000

Figure 3.5.2. Mapping uniform random points from a disk onto a 15 15 array of units (0 = 0.8, f = 0.01, 0 = 8, f = 1, and kmax = 70,000).

(a) Iteration 0 (b) Iteration 1,000

(c) Iteration 10,000 (d) Iteration 50,000

Figure 3.5.3. Mapping uniform random points from a triangle onto a 15 15 array of units (0 = 0.8, f = 0.01, 0 = 8, f = 1, and kmax = 50,000).

(a) Iteration 0 (b) Iteration 100

(c) Iteration 20,000 (d) Iteration 50,000

Figure 3.5.4. Mapping uniform random points from a hollow disk onto a 15 15 array of units (0 = 0.8, f = 0.01, 0 = 8, f = 1, and kmax = 50,000).

(a) Iteration 0 (b) Iteration 5000

(c) Iteration 20,000 (d) Iteration 40,000

Figure 3.5.5. Mapping from a hollow disk region onto a linear array (chain) of sixty units using a SOFM.

Mappings from higher to lower dimensions are also possible with a SOFM, and are in general useful for dimensionality reduction of input data. An illustrative example is shown in Figure 3.5.5 where a mapping from a hollow disk region to a linear array (chain) of sixty units is performed using Kohonen's feature map learning in Equation (3.5.1). Here, the units self-organize such that they cover the largest region possible (space filling curve).

Two additional interesting simulations of SOFMs, due to Ritter and Schulten (1988a), are considered next. The first simulation involves the formation of a somatosensory map between the tactile receptors of a hand surface and an "artificial cortex" formed by a 30 30 planar square array of linear units. The training set consisted of the activity patterns of the set of tactile receptors covering the hand surface. Figure 3.5.6 depicts the evolution of the feature map from an initial random map. Here, random points are selected according to the probability distributions defined by regions D, L, M, R, and T shown in Figure 3.5.6 (a). It is interesting to note the boundaries developed [shown as dotted regions in the maps of Figure 3.5.6 (c) and (d)] between the various sensory regions. Also note the correlations between the sensory regions sizes of the input data and their associated regions in the converged map.

(a) (b)

(c) (d)

Figure 3.5.6. An example of somatosensory map formation on a planar array of 30 x 30 units. (From H. Ritter and K. Schulten, 1988, Kohonen's Self-Organizing Maps: Exploring Their Computational Capabilities, Proceedings of the IEEE International Conference on Neural Networks, vol. I, pp. 109-116, © 1988 IEEE.)

The second simulation is inspired by the work of Durbin and Willshaw (1987) (see also Angeniol et al., 1988, and Hueter, 1988) related to solving the traveling salesman problem (TSP) by an elastic net. Here, 30 random city locations are chosen in the unit square with the objective of finding the path with minimal length that visits all cities, where each city is visited only once. A linear array (chain) of 100 units is used for the feature map. The initial neighborhood size used is 20 and the initial learning rate is 1. Figure 3.5.7 (a) shows the 30 randomly chosen city locations (filled circles) and the initial location and shape of the chain (open squares). The generated solution path is shown in Figures 3.5.7 (b), (c), and (d) after 5,000, 7,500, and 10,000 steps, respectively.

(a) (b)

(c) (d)

Figure 3.5.7. Solving 30-city TSP using a SOFM consisting of a "chain" of one hundred units. Filled circles correspond to fixed city locations. Open squares correspond to weight vector coordinates of units forming the chain.

Applications of SOFMs can be found in many areas, including trajectory planning for a robot arm (Kohonen, 1989; Ritter and Schulten, 1988a), and combinatorial optimization (Angeniol et al., 1988; Hueter, 1988 ; Ritter and Schulten, 1988a). In practical applications of self-organizing feature maps, input vectors are often high dimensional. Such is the case in speech processing. We conclude this section by highlighting an interesting practical application of feature maps as phonotopic maps for continuous speech processing (Kohonen, 1988).

In speech processing (transcription, recognition, etc.) a microphone signal is first digitally preprocessed and converted into a 15 channel spectral representation which covers the range of frequencies from 200 Hz. to 5 KHz. Let these channels together constitute the 15-dimensional input vector x(t) to a feature map (normally, each vector is preprocessed by subtracting the average from all components and normalizing its length). Here, an 8 × 12 hexagonal array of units is assumed as depicted in Figure 3.5.8; note how each unit in this array has six immediate neighbors. The self-organizing process of Equation (3.5.1) is used to create a "topographic," two-dimensional map of speech elements onto the hexagonal array.

Figure 3.5.8. Phonotopic feature map for Finnish speech. Units, shown as circles, are labeled with the symbols of the phonemes to which they adapted to give the best responses. Some units are shown to respond to two phonemes. (From T. Kohonen, 1988, The "Neural" Phonetic Typewriter, IEEE Computer Magazine, 21(3), pp. 11-22, © 1988 IEEE.)

The input vectors x(t), representing a short-time spectra of the speech waveform, are computed every 9.83 milliseconds. These inputs are presented in their natural order as inputs to the units in the array. After training on sufficiently large segments of continuous speech (Finnish speech in this case) and subsequent calibration of the resulting map with standard reference phoneme spectra, the phonotopic map of Figure 3.5.8 emerges. Here, the units are labeled with the symbols of the phonemes to which they "learned" to give the best responses. A striking result is that the various units in the array become sensitized to spectra of different phonemes and their variations in a two-dimensional order, although teaching was not done using the phonemes. One can attribute this to the fact that the input spectra are clustered around phonemes, and the self-organizing process finds these clusters. Some of the units in the array have double labels which implies units that respond to two phonemes. For example, the distinction of 'k', 'p', and 't' from this map is not reliable [a solution for such a problem is described in Kohonen (1988)]. This phonotopic map can be used as the basis for isolated-word recognition. Here, the signal x(t) corresponding to an uttered word induces an ordered sequence of responses in the units in the array. Figure 3.5.9 shows the sequence of responses obtained from the phonotopic map when the Finnish word "humpilla" (name of a place) was uttered. This sequence of responses defines a phonemic transcription of the uttered word. Then this transcription can be recognized by comparing it with reference transcriptions collected from a great many words. Also, these phonemic trajectories provide the means for the visualization of the phonemes of speech, which may be useful for speech training therapy. Further extensions of the use of SOFMs in automatic speech recognition can be found in Tattersal et al. (1990).

Figure 3.5.9. Sequence of the responses of units obtained from the phonotopic map when the Finnish word "humpilla" was uttered. The arrows correspond to intervals of 9.83 milliseconds. (From T. Kohonen, 1988, The "Neural" Phonetic Typewriter, IEEE Computer Magazine, 21(3), pp. 11-22, © 1988 IEEE.)

Kohonen (1988) employed the phonotopic map in a speech transcription system implemented in hardware in a PC environment. This system is used as a "phonetic typewriter" which can produce text from arbitrary dictation. When trained on speech from half a dozen speakers, using office text, names, and the most frequent words of the language (this amounts to several thousand words), the phonetic typewriter had a transcription accuracy between 92 and 97 percent, depending on the speaker and difficulty of text. This system can be adapted on-line to new speakers by fine tuning the phonotopic map on a dictation of about 100 words.

Goto [3.0] [3.1] [3.2] [3.3] [3.4] [3.6]

Back to the Table of Contents

Back to Main Menu