4.7 Theory of Feature Mapping

The characterization of topological feature preserving maps has received special attention in the literature (Kohonen, 1982b; Cottrell and Fort, 1986; Ritter and Schulten, 1986 and 1988b; Tolat, 1990; Heskes and Kappen, 1993a; Lo et al., 1993; Kohonen, 1993b). In particular, Takeuchi and Amari (1979) and Amari (1980 and 1983) have extensively studied a continuous-time dynamical version of this map to investigate the topological relation between the self-organized map and the input space governed by the density p(x), the resolution and stability of the map, and convergence speed. The characterization of a general feature map is difficult, and much of the analysis has been done under simplifying assumptions.

In the following, a one-dimensional version of the self-organizing feature map of Kohonen is characterized following the approach of Ritter and Schulten (1986). A continuous dynamical version of Kohonen's map is also described and analyzed.

4.7.1 Characterization of Kohonen's Feature Map

Consider the criterion function J(w) defined by (Ritter and Schulten, 1988a)

(4.7.1)

where i* is the label of the winner unit upon the presentation of stimulus (input) xk and is the neighborhood function that was introduced in Section 3.5. It can be seen that Equation (4.7.1) is an extension of the competitive learning criterion function of Equation (4.6.2). Performing gradient descent on (4.7.1) yields

(4.7.2)

which is just the batch mode version of Kohonen's self-organizing rule in Equation (3.5.1). Thus, Kohonen's rule is a stochastic gradient descent search and leads, on average and for small , to a local minimum of J in Equation (4.7.1). These minima are given as solutions to

(4.7.3)

This equation is not easy to solve; it depends on the choice of and the distribution p(x). Actually, we desire the global minimum of the criterion function J. Local minima of J are topological defects like kinks in one-dimensional maps and twists in two-dimensional maps [Kohonen, 1989; Geszti, 1990].

The analysis of feature maps becomes more tractable if one replaces Equation (4.7.3) with a continuous version that assumes a continuum of units and where the distribution p(x) appears explicitly, namely

(4.7.4)

where rr*(x) is the coordinate vector of the winning unit upon the presentation of input x.

An implicit partial differential equation for w can be derived from Equation (4.7.4) [Ritter and Schulten, 1986]. However, for the case of two- or higher-dimensional maps, no explicit solutions exist for w(r) given an arbitrary p(x). On the other hand, solutions of Equation (4.7.4) are relatively easy to find for the one-dimensional map with scalar r and a given input distribution p(x). Here, the equilibrium w* satisfies (Ritter and Schulten, 1986)

(4.7.5)

which, in turn, satisfies the implicit differential equation corresponding to Equation (4.7.4), given by (assuming a sharply peaked symmetric ):

(4.7.6)

In Equations (4.7.5) and (4.7.6), the term p(w) is given by p(x)|x=w. Equation (4.7.5) shows that the density of the units in w space is proportional to around point r. This verifies the density preserving feature of the map. Ideally, however, we would have for zero distortion. Therefore, a self-organizing feature map tends to under sample high probability regions and over sample low probability ones.

Finally, one may obtain the equilibria w* by solving Equation (4.7.6). The local stability of some of these equilibria is ensured (with a probability approaching 1) if the learning coefficient  = (t) is sufficiently small, positive, and decays according to the following necessary and sufficient conditions (Ritter and Schulten, 1988b)

(4.7.7a)

and

(4.7.7b)

In particular, the decay law with 0 < 1 ensures convergence. For laws with  > 1 or exponential decay laws, Equation (4.7.7a) is not fulfilled, and some residual error remains even in the limit t  . It can also be shown that during convergence, the map first becomes untangled and fairly even, and then moves into a refinement phase where it adapts to the details of p(x). Occasionally, the "untangling" phase can slow convergence, because some types of tangle (e.g., kinks and twists) can take a long time to untangle. Geszti (1990) suggested the use of a strongly asymmetric neighborhood function in order to speed up learning by breaking the symmetry effects responsible for slow untangling of kinks and twists.

4.7.2 Self-Organizing Neural Fields

Consider a continuum of units arranged as an infinite two-dimensional array (neural field). Each point (unit) on this array may be represented by a position vector r and has an associated potential u(r). The output of the unit at r is assumed to be a nonlinear function of its potential y(r) = [u(r)], where f is either a monotonically non-decreasing positive saturating activation function or a step function. Associated with each unit r is a set of input weights w(r) and another set of lateral weights (rr') = (r'). This lateral weight distribution is assumed to be of the on-center off-surround type as is shown in Figure 4.7.1 for the one-dimensional case. Here, a unit at position r makes excitatory connections with all of its neighbors located within a distance from r, and makes inhibitory connections with all other units.


Figure 4.7.1. One-dimensional plot of a neural field's lateral weight distribution.

The dynamics of the neural field potential u(r, t) are given by

(4.7.8)

where

(4.7.9)

and h is a constant bias field. In Equation (4.7.8), it is assumed that the potential u(r, t) decays with time constant to the resting potential h in the absence of any stimulation. Also, it is assumed that this potential increases in proportion to the total stimuli s(r, x) which is the sum of the lateral stimuli and the input stimuli wTx due to the input signal x Rn. A conceptual diagram for the above neural field is shown in Figure 4.7.2.


Figure 4.7.2. Self-organizing neural field.

In Equation (4.7.8), the rates of change of and w, if any, are assumed to be much slower than that of the neural field potential. The input signal (pattern) x is a random time sequence, and it is assumed that a pattern x is chosen according to probability density p(x). Also, we assume that inputs are applied to the neural field for a time duration which is longer than the time constant of the neural field potential. On the other hand, the duration of stimulus x is assumed to be much shorter than the time constant ' of the weight w. Thus, the potential distribution u(r, t) can be considered to change in a quasi-equilibrium manner denoted by u(r, x).

An initial excitation pattern applied to the neural field changes according to the dynamics given in Equation (4.7.8), and eventually converges to one of the equilibrium solutions. Stable equilibrium solutions are the potential fields u(r) which the neural field can retain persistently under a constant input x. The equilibria of Equation (4.7.8) must satisfy or

(4.7.10)

where s(ru*) is the total stimuli at equilibrium. When the lateral connections distribution (r - r') is strongly off-surround inhibitory, given any x, only a local excitation pattern is aroused as a stable equilibrium which satisfies Equation (4.7.10) (Amari, 1990). Here, a local excitation is a pattern where the excitation is concentrated on units in a small local region; i.e., u*(r) is positive only for a small neighborhood centered at a maximally excited unit r0. Thus u*(r, x) represents a mapping from the input space onto the neural field.

Let us now look at the dynamics of the self-organizing process. We start by assuming a particular update rule for the input weights of the neural field. One biologically plausible update rule is the Hebbian rule:

(4.7.11)

where is the neural field's equilibrium output activity due to input x. In Equation (4.7.11), we use the earlier assumption ' . Next, we assume strong mixing in Equation (4.7.11) which allows us to write an expression for the average learning equation (absorbing ' in and in ) as

(4.7.12)

where the averaging is over all possible x. The equilibria of Equation (4.7.12) are given by

(4.7.13)

If we now transpose Equation (4.7.12) and multiply it by an arbitrary input vector x we arrive at an equation for the change in input stimuli

(4.7.14)

The vector inner product xT x' represents the similarity of two input signals x and x' and hence the topology of the signal space (Takeuchi and Amari, 1979). Note how Equation (4.7.14) relates the topology of the input stimulus set {x} with that of the neural field.

On the other hand, if one assumes a learning rule where a unit r updates its input weight vector in proportion to the correlation of its equilibrium potential u*(r, x) and the difference x - w(r), one arrives at the average differential equation

(4.7.15)

This learning rule is equivalent to the averaged continuum version of Kohonen's self-organizing feature map in Equation (4.7.2), if one views the potential distribution u*(r, x) as the weighting neighborhood function . Here, self-organization will emerge if the dynamics of the potential field evolve such that the quasi-stable equilibrium potential u* starts positive for all r, then monotonically and slowly shrinks in diameter for positive time. This may be accomplished through proper control of the bias field h, as described below.

In general, it is difficult to solve Equation (4.7.8) and (4.7.12) [or (4.7.15)]. However, some properties of the formation of feature maps are revealed from these equations for the special, but revealing, case of a one-dimensional neural field (Takeuchi and Amari, 1979; Amari, 1980 and 1983). The dynamics of the potential field in Equation (4.7.8) for a one-dimensional neural field was analyzed in detail by Amari (1977b) and Kishimoto and Amari (1979) for a step activation function and a continuous monotonically non-decreasing activation function, respectively. It was shown that with (r - r') as shown in Figure 4.7.1 and f(u) = step(u), there exist stable equilibrium solutions u*(r) for the x = 0 case. The 0-solution potential field, u*(r) 0, and the -solution field, u*(r) 0, are among these stable solutions. The 0-solution is stable if and only if h < 0. On the other hand, the -solution is stable if and only if h > -2 A(), where with A(a) as defined below. Local excitations (also known as a-solutions) where u*(r) is positive only over a finite interval [a1, a2] of the neural field are also possible. An a-solution exists if and only if h + A(a) = 0. Here, A(a) is the definite integral defined by

(4.7.16)

and plotted in Figure 4.7.3. Amari (see also Krekelberg and Kok, 1993) also showed that a single a-solution can exist for the case of a non-zero input stimulus, and that the corresponding active region of the neural field is centered at the unit r receiving the maximum input. Furthermore, the width, a, of this active region is a monotonically decreasing function of the bias field h. Thus, one may exploit the fact that the field potential/neighborhood function u*(rx) is controlled by the bias field h in order to control the convergence of the self-organizing process. Here, the uniform bias field h is started at a positive value h > -2A() and is slowly decreased towards negative values. This, in turn, causes u* to start at the -solution, then gradually move through a-solutions with decreasing width a until, ultimately, u* becomes the 0-solution. For further analysis of the self-organizing process in a neural field, the reader is referred to Zhang (1991).


Figure 4.7.3. A plot of A(a) of Equation (4.7.16).

Kohonen (1993a and 1993b) proposed a self-organizing map model for which he gives physiological justification. The model is similar to Amari's self-organizing neural field except that it uses a discrete two-dimensional array of units. The model assumes sharp self-on off-surround lateral interconnections so that the neural activity of the map is stabilized where the unit receiving the maximum excitation becomes active and all other units are inactive. Kohonen's model employs unit potential dynamics similar to those of Equation (4.7.8). On the other hand, Kohonen uses a learning equation more complex than those in Equations (4.7.12) and (4.7.15). This equation is given for the ith unit weight vector by the pseudo-Hebbian learning rule

(4.7.17)

where is a positive constant. The term in Equation (4.7.17) models a natural "transient" neighborhood function. It represents a weighted-sum of the output activities yl of nearby units, which describes the strength of the diffuse chemical effect of cell l on cell i; hil is a function of the distance of these units. This weighted-sum of output activities replaces the output activity of the same cell, yi, in Hebb's learning rule. On the other hand, the term acts as a stabilizing term which models "forgetting" effects, or disturbance due to adjacent synapses. Typically, forgetting effects in wi are proportional to the weight wi itself. In addition, if the disturbance caused by synaptic site r is mediated through the post synaptic potential, the forgetting effect must further be proportional to wirxr. This phenomenon is modeled by in the forgetting term. Here, the summation is taken over a subset of the synapses of unit i that are located near the jth synapse wij, and approximately act as one collectively interacting set. The major difference between the learning rules in Equation (4.7.17) and (4.7.12) is that, in the former, the "neighborhood" function is determined by a "transient" activity due to a diffusive chemical effect of nearby cell potentials, whereas it is determined by a stable region of neural field potential in the latter.

Under the assumption that the index r ranges over all components of the input signal x and regarding as a positive scalar independent of w and x, the vector form of Equation (4.7.18) takes the Riccati differential equation form

(4.7.18)

where  =  > 0. Now, multiplying both sides of Equation (4.7.18) by 2wT leads to the differential equation

(4.7.19)

Thus, for arbitrary x with wTx > 0, Equation (4.7.19) converges to . On the other hand, the solution for the direction of w* cannot be determined in closed form from the deterministic differential Equation (4.7.18). However, a solution for the expected value of w may be found if Equation (4.7.18) is treated as a stochastic differential equation with strong mixing in accordance to the discussion of Section 4.3. Taking the expected value of both sides of Equation (4.7.18) and solving for its equilibrium points (by setting ) gives

(4.7.20)

Furthermore, this equilibrium point can be shown to be stable (Kohonen, 1989).

From the above analysis, it can be concluded that the synaptic weight vector w is automatically normalized to the length , independent of the input signal x, and that w rotates such that its average direction is aligned with the mean of x. This is the expected result of a self-organizing map when a uniform nondecreasing neighborhood function is used. In general, though, the neighborhood term is nonuniform and time varying, which makes the analysis of Equation (4.7.17) much more difficult.

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.8] [4.9] [4.10]

Back to the Table of Contents

Back to Main Menu