Self-organization is a process of unsupervised learning
whereby significant patterns or features in the input data are
discovered. In the context of a neural network, self-organization
learning consists of adaptively modifying the synaptic weights
of a network of locally interacting units in response to input
excitations and in accordance with a learning rule, until a final
useful configuration develops. The local interaction of units
means that the changes in the behavior of a unit only (directly)
affects the behavior of its immediate neighborhood. The key question
here is how could a useful configuration evolve from self-organization.
The answer lies essentially in a naturally observed phenomenon
whereby global order can arise from local interactions (Turing,
1952). This phenomenon applies to neural networks (biological
and artificial) where many originally random local interactions
between neighboring units (neurons) of a network couple and coalesce
into states of global order. This global order leads to coherent
behavior, which is the essence of self-organization.

In the following, a modified version of the simple
competitive learning network discussed in Section 3.4.1 is presented,
which exhibits self-organization features. This network attempts
to map a set of input vectors **x***k*
in R*n* onto an array
of units (normally one- or two-dimensional) such that any topological
relationships among the **x***k*
patterns are preserved and are represented by the network in terms
of a spatial distribution of unit activities. The more related
two patterns are in the input space, the closer we expect the
position in the array of the two units representing these patterns.
In other words, if **x**1
and **x**2 are "similar"
or are topological neighbors in R*n*,
and if **r**1 and **r**2
are the location of the corresponding winner units in the net/array,
then the Euclidean distance ||**r**1 - **r**2||
is expected to be small. Also, ||**r**1 - **r**2||
approaches zero as **x**1
approaches **x**2.
The idea is to develop a topographic map of the input vectors,
so that similar input vectors would trigger nearby units. Thus,
a global organization of the units is expected to emerge.

An example of such topology preserving self-organizing
mappings which exists in animals is the somatosensory map from
the skin onto the somatosensory cortex, where there exists an
image of the body surface. The retinotopic map from the retina
to the visual cortex is another example. It is believed that
such biological topology preserving maps are not entirely preprogrammed
by the genes and that some sort of (unsupervised) self-organizing
learning phenomenon exists that tune such maps during development.
Two early models of topology preserving competitive learning
were proposed by von der Malsburg (1973) and Willshaw and von
der Malsburg (1976) for the retinotopic map problem. In this
section, we present a detailed topology preserving model due to
Kohonen (1982a) which is commonly referred to as the self-organizing
feature map (SOFM).

The purpose of Kohonen's self-organizing feature
map is to capture the topology and the probability distribution
of input data (Kohonen, 1982a and 1989). This model generally
involves an architecture consisting of a two-dimensional structure
(array) of linear units, where each unit receives the same input
**x***k*, **x***k*
R*n*. Each unit
in the array is characterized by an *n*-dimensional weight
vector. The *i*th unit weight vector **w***i*
is sometimes viewed as a position vector which defines a "virtual
position" for unit *i* in R*n*.
This, in turn, will allow us to interpret changes in **w***i*
as movements of unit *i*. However, one should keep in mind
that no physical movements of units take place.

The learning rule is similar to that of simple competitive
learning in Equation (3.4.5) and is defined by:

(3.5.1)

where *i**
is the index of the winner unit. The winner unit is determined
according to the Euclidean distance as in Equation (3.4.3), with
no weight vector normalization. The major difference between
this update rule and that of simple competitive learning is the
presence of the neighborhood function
in the former. This function is very critical for the successful
preservation of topological properties. It is normally symmetric
[i.e., ] with values close to one for
units *i* close to *i**
and decreases monotonically with the Euclidean distance
between units *i* and *i**
in the array.

At the onset of learning,
defines a relatively large neighborhood whereby all units in the
net are updated for any input **x***k*.
As learning progresses, the neighborhood is shrunk down until
it ultimately goes to zero, where only the winner unit is updated.
The learning rate also must follow a monotonically decreasing
schedule in order to achieve convergence. One may think of the
initial large neighborhood as effecting an exploratory global
search which is then continuously refined to a local search as
the variance of approaches zero. The
following is one possible choice for :

(3.5.2)

where the variance 2
controls the width of the neighborhood. Here, and may be set
proportional to
with *k* representing the learning step. Ritter and Schulten
(1988a) proposed the following update schedules for

(3.5.3)

and for

(3.5.4)

with 0(0)
and *f*(*f*)
controlling the initial and final values of the learning rate
(neighborhood width), respectively, and *k*max
is the maximum number of learning steps anticipated.

There is no theory to specify the parameters of
these learning schedules for arbitrary training data. However,
practical values are 0 < 0
1 (typically 0.8), *f*
<< 1, 0
(*m**d* is
the number of units along the largest diagonal of the array),
or an equivalent value that will permit (*i*, *i**, *k* = 0)
to reach all units when *i**
is set close to the center of the array, and *f*
= 0.5. Finally *k*max
is usually set to 2 or more orders of magnitude larger than the
total number of units in the net.

Let the input vector **x** be a random variable
with a stationary probability density function *p*(**x**).
Then, the basis for Equation (3.5.1) and various other self-organizing
algorithms is captured by the following two step process: (1)
Locate the best-matching unit for the input vector **x**; (2)
Increase matching at this unit and its topological neighbors.
The computation achieved by a repetitive application of this
process is rather surprising and is captured by the following
proposition due to Kohonen (1989): "The **w***i*
vectors tend to be ordered according to their mutual similarity,
and the asymptotic local point density of the **w***i*,
in an average sense, is of the form *g*(*p*(**x**)),
where *g* is some continuous, monotonically increasing function."
This proposition is tested by the following experiments.

(a) Iteration 0 (b) Iteration 100

(c) Iteration 10,000 (d) Iteration 50,000

(a) Iteration 0 (b) Iteration 10,000

(c) Iteration 30,000 (d) Iteration 70,000

(a) Iteration 0 (b) Iteration 1,000

(c) Iteration 10,000 (d) Iteration 50,000

(a) Iteration 0 (b) Iteration 100

(c) Iteration 20,000 (d) Iteration 50,000

(a) Iteration 0 (b) Iteration 5000

(c) Iteration 20,000 (d) Iteration 40,000

Two additional interesting simulations of SOFMs,
due to Ritter and Schulten (1988a), are considered next. The
first simulation involves the formation of a somatosensory map
between the tactile receptors of a hand surface and an "artificial
cortex" formed by a 30 30 planar square array of linear
units. The training set consisted of the activity patterns of
the set of tactile receptors covering the hand surface. Figure
3.5.6 depicts the evolution of the feature map from an initial
random map. Here, random points are selected according to the
probability distributions defined by regions D, L, M, R, and T
shown in Figure 3.5.6 (a). It is interesting to note the boundaries
developed [shown as dotted regions in the maps of Figure 3.5.6
(c) and (d)] between the various sensory regions. Also note the
correlations between the sensory regions sizes of the input data
and their associated regions in the converged map.

(a) (b)

(c) (d)

Figure 3.5.6. An example of somatosensory map formation
on a planar array of 30 x 30 units. (From H. Ritter and K. Schulten,
1988, Kohonen's Self-Organizing Maps: Exploring Their Computational
Capabilities, *Proceedings of the IEEE International Conference
on Neural Networks*, vol. I, pp. 109-116, © 1988 IEEE.)

The second simulation is inspired by the work of
Durbin and Willshaw (1987) (see also Angeniol et al., 1988, and
Hueter, 1988) related to solving the traveling salesman problem
(TSP) by an elastic net. Here, 30 random city locations are chosen
in the unit square with the objective of finding the path with
minimal length that visits all cities, where each city is visited
only once. A linear array (chain) of 100 units is used for the
feature map. The initial neighborhood size used is 20 and the
initial learning rate is 1. Figure 3.5.7 (a) shows the 30 randomly
chosen city locations (filled circles) and the initial location
and shape of the chain (open squares). The generated solution
path is shown in Figures 3.5.7 (b), (c), and (d) after 5,000,
7,500, and 10,000 steps, respectively.

(a) (b)

(c) (d)

Figure 3.5.7. Solving 30-city TSP using a SOFM consisting
of a "chain" of one hundred units. Filled circles correspond
to fixed city locations. Open squares correspond to weight vector
coordinates of units forming the chain.

Applications of SOFMs can be found in many areas,
including trajectory planning for a robot arm (Kohonen, 1989;
Ritter and Schulten, 1988a), and combinatorial optimization (Angeniol
et al., 1988; Hueter, 1988 ; Ritter and Schulten, 1988a). In practical
applications of self-organizing feature maps, input vectors are
often high dimensional. Such is the case in speech processing.
We conclude this section by highlighting an interesting practical
application of feature maps as phonotopic maps for continuous
speech processing (Kohonen, 1988).

In speech processing (transcription, recognition,
etc.) a microphone signal is first digitally preprocessed and
converted into a 15 channel spectral representation which covers
the range of frequencies from 200 Hz. to 5 KHz. Let these channels
together constitute the 15-dimensional input vector **x**(*t*)
to a feature map (normally, each vector is preprocessed by subtracting
the average from all components and normalizing its length).
Here, an 8 × 12 hexagonal array of units is assumed
as depicted in Figure 3.5.8; note how each unit in this array
has six immediate neighbors. The self-organizing process of Equation
(3.5.1) is used to create a "topographic," two-dimensional
map of speech elements onto the hexagonal array.

Figure 3.5.8. Phonotopic feature map for Finnish
speech. Units, shown as circles, are labeled with the symbols
of the phonemes to which they adapted to give the best responses.
Some units are shown to respond to two phonemes. (From T. Kohonen,
1988, The "Neural" Phonetic Typewriter, *IEEE Computer
Magazine*, **21**(3), pp. 11-22, © 1988 IEEE.)

The input vectors **x**(*t*), representing
a short-time spectra of the speech waveform, are computed every
9.83 milliseconds. These inputs are presented in their natural
order as inputs to the units in the array. After training on
sufficiently large segments of continuous speech (Finnish speech
in this case) and subsequent calibration of the resulting map
with standard reference phoneme spectra, the phonotopic map of
Figure 3.5.8 emerges. Here, the units are labeled with the symbols
of the phonemes to which they "learned" to give the
best responses. A striking result is that the various units in
the array become sensitized to spectra of different phonemes and
their variations in a two-dimensional order, although teaching
was not done using the phonemes. One can attribute this to the
fact that the input spectra are clustered around phonemes, and
the self-organizing process finds these clusters. Some of the
units in the array have double labels which implies units that
respond to two phonemes. For example, the distinction of 'k',
'p', and 't' from this map is not reliable [a solution for such
a problem is described in Kohonen (1988)]. This phonotopic map
can be used as the basis for isolated-word recognition. Here,
the signal **x**(*t*) corresponding to an uttered word
induces an ordered sequence of responses in the units in the array.
Figure 3.5.9 shows the sequence of responses obtained from the
phonotopic map when the Finnish word "humpilla" (name
of a place) was uttered. This sequence of responses defines a
phonemic transcription of the uttered word. Then this transcription
can be recognized by comparing it with reference transcriptions
collected from a great many words. Also, these phonemic trajectories
provide the means for the visualization of the phonemes of speech,
which may be useful for speech training therapy. Further extensions
of the use of SOFMs in automatic speech recognition can be found
in Tattersal et al. (1990).

Figure 3.5.9. Sequence of the responses of units
obtained from the phonotopic map when the Finnish word "humpilla"
was uttered. The arrows correspond to intervals of 9.83 milliseconds.
(From T. Kohonen, 1988, The "Neural" Phonetic Typewriter,
*IEEE Computer Magazine*, **21**(3), pp. 11-22, ©
1988 IEEE.)

Kohonen (1988) employed the phonotopic map in a
speech transcription system implemented in hardware in a PC environment.
This system is used as a "phonetic typewriter" which
can produce text from arbitrary dictation. When trained on speech
from half a dozen speakers, using office text, names, and the
most frequent words of the language (this amounts to several thousand
words), the phonetic typewriter had a transcription accuracy between
92 and 97 percent, depending on the speaker and difficulty of
text. This system can be adapted on-line to new speakers by fine
tuning the phonotopic map on a dictation of about 100 words.