In this section, an artificial neural network model
motivated by "locally-tuned" response biological neurons
is described. Neurons with locally-tuned response characteristics
can be found in many parts of biological nervous systems. These
nerve cells have response characteristics which are "selective"
for some finite range of the input signal space. The cochlear
stereocilia cells, for example, have a locally-tuned response
to frequency which is a consequence of their biophysical properties.
The present model is also motivated by earlier work on radial
basis functions (Medgassy, 1961) which are utilized for interpolation
(Micchelli, 1986 ; Powell, 1987), probability density estimation
(Parzen, 1962 ; Duda and Hart, 1973 ; Specht, 1990), and approximations
of smooth multivariate functions (Poggio and Girosi, 1989). The
model is commonly referred to as the radial basis function (RBF)
network.

The most important feature that distinguishes the
RBF network from earlier radial basis function-based models is
its adaptive nature which generally allows it to utilize a relatively
smaller number of locally-tuned units (RBF's). RBF networks were
independently proposed by Broomhead and Lowe (1988) , Lee and Kil
(1988), Niranjan and Fallside (1988), and Moody and Darken (1989a,
1989b). Similar schemes were also suggested by Hanson and Burr
(1987), Lapedes and Farber (1987), Casdagli (1989), Poggio and
Girosi (1990b), and others. The following is a description of
the basic RBF network architecture and its associated training
algorithm.

The RBF network has a feedforward structure consisting
of a single hidden layer of *J* locally-tuned units which
are fully interconnected to an output layer of *L* linear
units as shown in Figure 6.1.1.

Figure 6.1.1. A radial basis function neural network
consisting of a single hidden layer of locally-tuned units which
is fully interconnected to an output layer of linear units. For
clarity, only hidden to output layer connections for the *l*th
output unit are shown.

All hidden units simultaneously receive the *n*-dimensional
real-valued input vector **x**. Notice the absence of hidden
layer weights in Figure 6.1.1. This is because the hidden unit
outputs are not calculated using the weighted-sum/sigmoidal activation
mechanism as in the previous chapter. Rather, here, each hidden
unit output *z**j*
is obtained by calculating the "closeness" of the input
**x** to an *n*-dimensional parameter vector *j*
associated with the *j*th hidden unit. Here, the response
characteristics of the *j*th hidden unit are given by:

(6.1.1)

where K is a strictly positive radially-symmetric
function (kernel) with a unique maximum at its "center"
*j* and which drops
off rapidly to zero away from the center. The parameter *j*
is the "width" of the receptive field in the input space
for unit *j*. This implies that *z**j*
has an appreciable value only when the "distance"
is smaller than the width *j*.
Given an input vector **x**, the output of the RBF network
is the *L*-dimensional activity vector **y** whose *l*th
component is given by:

(6.1.2)

It is interesting to note here that for *L* = 1
the mapping in Equation (6.1.2) is similar in form to that employed
by a PTG, as in Equation (1.4.1). However, in the RBF net, a
choice is made to use radially symmetric kernels as "hidden
units" as opposed to monomials.

RBF networks are best suited for approximating continuous
or piecewise continuous real-valued mappings
where *n* is sufficiently small; these approximation problems
include classification problems as a special case. According
to Equations (6.1.1) and (6.1.2), the RBF network may be viewed
as approximating a desired function *f*(**x**) by superposition
of non-orthogonal bell-shaped basis functions. The degree of
accuracy can be controlled by three parameters: The number of
basis functions used, their location, and their width. In fact,
like feedforward neural networks with a single hidden layer of
sigmoidal units, it can be shown that RBF networks are universal
approximators (Poggio and Girosi, 1989; Hartman et al., 1990 ;
Baldi, 1991; Park and Sandberg, 1991, 1993)

A special but commonly used RBF network assumes
a Gaussian basis function for the hidden units:

(6.1.3)

where *j*
and *j* are the standard
deviation and mean of the *j*th unit receptive field, respectively,
and the norm is the Euclidean norm. Another possible choice for
the basis function is the logistic function of the form:

(6.1.4)

where *j*
is an adjustable bias. In fact, with the basis function in Equation
(6.1.4), the only difference between a RBF network and a feedforward
neural network with a single hidden layer of sigmoidal units is
the similarity computation performed by the hidden units. If
we think of *j* as
the parameter (weight) vector associated with the *j*th hidden
unit, then it is easy to see that an RBF network can be obtained
from a single hidden layer neural network with unipolar sigmoid-type
units and linear output units (like the one in Figure 5.1.1) by
simply replacing the *j*th hidden unit weighted-sum *net**j* = **x**T*j*
by the negative of the normalized Euclidean distance .
On the other hand, the use of the Gaussian basis function in
Equation (6.1.3) leads to hidden units with Gaussian-type activation
functions and with a Euclidean distance similarity computation.
In this case, no bias is needed.

Next, we turn our attention to the training of RBF
networks. Consider a training set of *m* labeled pairs {**x***i*,
**d***i*} which
represent associations of a given mapping or samples of a continuous
multivariate function. Also, consider the SSE criterion function
as an error function *E* that we desire to minimize over
the given training set. In other words, we would like to develop
a training method that minimizes *E* by adaptively updating
the free parameters of the RBF network. These parameters are
the receptive field centers (means *j*
of the hidden layer Gaussian units), the receptive field widths
(standard deviations *j*),
and the output layer weights (*w**lj*).

Because of the differentiable nature of the RBF
network's transfer characteristics, one of the first training
methods that comes to mind is a fully supervised gradient descent
method over *E* (Moody and Darken, 1989a ; Poggio and Girosi,
1989). In particular, *j*,
*j*, and *w**lj*
are updated as follows: , ,
and , where , , and *w*
are small positive constants. This method, although capable of
matching or exceeding the performance of backprop trained networks,
still gives training times comparable to those of sigmoidal-type
networks (Wettschereck and Dietterich, 1992).

One reason for the slow convergence of the above
supervised gradient descent trained RBF network is its inefficient
use of the locally-tuned representation of the hidden layer units.
When the hidden unit receptive fields are narrow, only a small
fraction of the total number of units in the network will be activated
for a given input **x**; the activated units are the ones
with centers very close to the input vector in the input space.
Thus, only those units which were activated need be updated for
each input presentation. The above supervised learning method,
though, places no restrictions on maintaining small values for
*j*. Thus, the supervised
learning method is not guaranteed to utilize the computational
advantages of locality. One way to rectify this problem is to
only use gradient descent-based learning for the basis function
centers and use a method that maintains small values for the *j*'s.
Examples of learning methods which take advantage of the locality
property of the hidden units are presented below.

A training strategy that decouples learning at the
hidden layer from that at the output layer is possible for RBF
networks due to the local receptive field nature of the hidden
units. This strategy has been shown to be very effective in terms
of training speed; though, this advantage is generally offset
by reduced generalization ability, unless a large number of basis
functions is used. In the following, we describe efficient methods
for locating the receptive field centers and computing receptive
field widths. As for the output layer weights, once the hidden
units are synthesized, these weights can be easily computed using
the delta rule [Equation (5.1.2)]. One may view this computation
as finding the proper normalization coefficients of the basis
functions. That is, the weight *w**lj*
determines the amount of contribution of the *j*th basis
function to the *l*th output of the RBF net.

Several schemes have been suggested to find proper
receptive field centers and widths without propagating the output
error back through the network. The idea here is to populate
dense regions of the input space with receptive fields. One method
places the centers of the receptive fields according to some coarse
lattice defined over the input space (Broomhead and Lowe, 1988).
Assuming a uniform lattice with *k* divisions along each
dimension of an *n*-dimensional input space, this lattice
would require *k**n*
basis functions to cover the input space. This exponential growth
renders this approach impractical for a high dimensional space.
An alternative approach is to center *k* receptive fields
on a set of *k* randomly chosen training samples. Here,
unless we have prior knowledge about the location of prototype
input vectors and/or the regions of the input space containing
meaningful data, a large number of receptive fields would be required
to adequately represent the distribution of the input vectors
in a high dimensional space.

Moody and Darken (1989a) employed unsupervised learning
of the receptive field centers *j*
in which a relatively small number of RBF's are used; the adaptive
centers learn to represent only the parts of input space which
are richly represented by clusters of data. The adaptive strategy
also helps reduce sampling error since it allows the 's to be
determined by a large number of training samples. Here, the *k*-means
clustering algorithm (MacQueen, 1967; Anderberg, 1973) is used
to locate a set of *k* RBF centers which represents a local
minimum of the SSE between the training set vectors **x** and
the nearest of the *k* receptive field centers *j*
(this SSE criterion function is given by Equation (4.6.4) with
**w** replaced by ). In the basic *k*-means algorithm,
the *k* RBFs are initially assigned centers *j*,
*j* = 1, 2, ..., *k* which
are set equal to *k* randomly selected training vectors.
The remaining training vectors are assigned to class *j*
of the closest center *j*.
Next, the centers are recomputed as the average of the training
vectors in their class. This two step process is invoked until
all centers stop changing. An incremental version of this batch
mode process may also be used which requires no storage of past
training vectors or cluster membership information. Here, at
each time step, a random training vector **x** is selected
and the center of the nearest (in a Euclidean
distance sense) receptive field is updated according to:

(6.1.5)

where is a small positive constant. Equation (6.1.5)
is the simple competitive rule which we have analyzed in Section
4.6.1. Similarly, we may use learning vector quantization (LVQ)
or one of its variants (see Section 3.4.2) to effectively locate
the *k* RBF centers (Vogt, 1993). Generally speaking, there
is no formal method for specifying the required number *k*
of hidden units in an RBF network. Cross-validation is normally
used to decide on *k*.

Once the receptive field centers are found using
one of the above methods, their widths can be determined by one
of several heuristics in order to get smooth interpolation. Theoretically
speaking, RBF networks with the same *j*
in each hidden kernel unit have the capability of universal approximation
(Park and Sandberg, 1991). This suggests that we may simply use
a single global fixed value for all *j*'s
in the network. In order to preserve the local response characteristics
of the hidden units, one should choose a relatively small (positive)
value for this global width parameter. The actual value of for
a particular training set may be found by cross-validation. Empirical
results (Moody and Darken, 1989a) suggest that a "good"
estimate for the global width parameter is the average width
which represents a global average over all Euclidean distances
between the center of each unit *i* and that of its nearest
neighbor *j*. Other heuristics based on local computations
may be used which yield individually-tuned widths *j*.
For example, the width for unit *j* may be set to the distance
where *i*
is the center of the nearest neighbor to unit *j* (usually,
is taken between 1.0 and 1.5). For classification tasks, one
may make use of the category label of the nearest training vector.
If that category label is different from that represented by
the current RBF unit, it would be advisable to use a smaller width
which narrows the bell-shaped receptive field of the current unit.
This leads to a sharpening of the class domains and allows for
better approximation.

We have already noted that the output layer weights,
*w**lj*, can
be adaptively computed using the delta rule

(6.1.6)

once the hidden layer parameters are obtained. Here,
the term *f *'(*net**l*)
can be dropped for the case of linear units. Equation (6.1.6)
drives the output layer weights to minimize the SSE criterion
function [recall Equation (5.1.13)], for sufficiently small .
Alternatively, for the case of linear output units, one may formulate
the problem of computing the weights as a set of simultaneous
linear equations and employ the generalized-inverse method [recall
Equation (3.1.42)] to obtain the minimum SSE solution. Without
loss of generality, consider a single output RBF net, and denote
by **w** = [*w*1 *w*2 ...
*w**J*]T
the weight vector of the output unit. Now, recalling Equations
(3.1.39) through (3.1.42), the minimum SSE solution for the system
of equations **Z**T**w** = **d**
is given by (assuming an overdetermined system; i.e., *m* *J*)

**w*** = **Z**†**d** = (**ZZ**T)-1**Zd**
(6.1.7)

where **Z** = [**z**1
**z**2 ... **z***m*]
is a *J *× *m* matrix, and **d** =
[*d*1 *d*2
... *d**m*]T.
Here, **z***i*
is the output of the hidden layer for input **x***i*.
Therefore, the *ji*th element of matrix **Z** may be
expressed explicitly as

(6.1.8)

with the parameters *j*
and assumed to have been computed using
the earlier described methods.

For "strict" interpolation problems, it
is desired that an interpolation function be found which is constrained
to "exactly" map the sample points **x***i*
into their associated targets *d**i*,
for *i* = 1, 2, ..., *m*. It is well known that a polynomial
with finite order *r* = *m* -
1 is capable of performing strict interpolation on *m* samples
{**x***i*, *d**i*},
with distinct **x***i*'s
in R*n* (see Problem
1.3.4). A similar result is available for RBF nets. This result
states that there is a class of radial-basis functions which guarantee
that an RBF net with *m* such functions is capable of strict
interpolation of *m* sample point in R*n*
(Micchelli, 1986 ; Light, 1992b), the Gaussian function in Equation
(6.1.3) is one example. Furthermore, there is no need to search
for the centers *j*;
one can just set *j* = **x***j*
for *j* = 1, 2, ..., *m*.
Thus, for strict interpolation, the **Z** matrix in Equation
(6.1.8) becomes the *m* × *m* matrix

(6.1.9)

which we refer to as the interpolation matrix. Note
that the appropriate width parameters *j*
still need to be found; the choice of these parameters affects
the interpolation quality of the RBF net.

According to the above discussion, an exact solution
**w*** is assured.
This requires **Z** to be nonsingular. Hence, **w***
can be computed as

(6.1.10)

Although in theory Equation (6.1.10) always assures
a solution to the strict interpolation problem, in practice the
direct computation of can become ill-conditioned
due to the possibility of **Z**T
being nearly singular. Alternatively, one may resort to Equation
(6.1.6) for an adaptive computation of **w***.

Receptive field properties play an important role
in the quality of an RBF network's approximation capability.
To see this, consider a single input/single output RBF network
for approximating a continuous function *f*: RR. Approximation
error, due to error in the "fit" of the RBF network
to that of the target function *f*, occurs when the receptive
fields (e.g., Gaussians) are either too broad and/or too widely
spaced relative to the fine spatial structure of *f*. In
other words, these factors act to locally limit the high frequency
content of the approximating network. According to Nyquist's
sampling criterion, the highest frequency which may be recovered
from a sampled signal is one half the sampling frequency. Therefore,
when the receptive field density is not high enough, the high
frequency fine structure in the function being approximated is
lost. The high frequency fine structure of *f* can also
be "blurred" when the receptive fields are excessively
wide. By employing the Taylor series expansion, it can be shown
(Hoskins et al., 1993) that when the width parameter is large,
the RBF net exhibits polynomial behavior with an order successively
decreasing as the RBF widths increase. In other words, the net's
output approaches that of a polynomial function whose order is
decreasing in . Therefore, it is important that receptive field
densities and widths be chosen to match the frequency transfer
characteristics imposed by the function *f* (Mel and Omohundro,
1991). These results also suggest that even for moderately high
dimensional input spaces, a relatively large number of RBF's must
be used if the training data represent high frequency content
mappings (functions) and if low approximation error is desired.
These observations can be generalized to the case of RBF network
approximation of multivariate functions.

**Example 6.1.1:** The following
example illustrates the application of the RBF net for approximating
the function (refer to the solid line
plot in Figure 6.1.2) from the fifteen noise-free samples (*x **j*, *g*(*x **j*)),
*j* = 1, 2, ..., 15, in Figure 6.1.2.
We will employ the method of strict interpolation for designing
the RBF net. Hence, 15 Gaussian hidden units are used (all having
the same width parameter ) with the *j*th Gaussian unit having
its center *j* equal
to *x**j*.
The design is completed by computing the weight vector **w**
of the output linear unit using Equation (6.1.10). Three designs
are generated which correspond to the values = 0.5, 1.0, and
1.5. We then tested these networks with two hundred inputs *x*,
uniformly sampled in the interval [-8, 12].
The output of the RBF net is shown in Figure 6.1.2 for = 0.5
(dotted line), = 1.0 (dashed line), and = 1.5
(dotted-dashed line). The value of = 1.0 is close
to the average distance among all 15 sample points, and it resulted
in better interpolation of *g*(*x*) compared to = 0.5
and = 1.5. As expected, these results show poor extrapolation
capabilities by the RBF net, regardless of the value of (check
the net output in Figure 6.1.2 for *x* > 10
and ). It is interesting to note the
excessive overfit by the RBF net for relatively high (compare
to the polynomial-based strict interpolation of the same data
shown in Figure 5.2.2). Finally, by comparing the above results
to those in Figure 5.2.3 one can see that more accurate interpolation
is possible with sigmoidal hidden unit nets; this is mainly attributed
to the ability of feedforward multilayer sigmoidal unit nets to
approximate the first derivative of *g*(*x*).

Figure 6.1.2. RBF net approximation of the function
(shown as a solid line), based on strict
interpolation using the 15 samples shown (small circles). The
RBF net employs 15 Gaussian hidden units and its output is shown
for three hidden unit widths: = 0.5 (dotted line),
= 1.0 (dashed line), and (dotted-dashed
line). (Compare these results to those in Figures 5.2.2 and 5.2.3.)

**6.1.1 RBF Networks Versus Backprop Networks**

RBF networks have been applied with success to function
approximation (Broomhead and Lowe, 1988 ; Lee and Kil, 1988 ; Casdagli,
1989 ; Moody and Darken, 1989a, 1989b) and classification (Niranjan
and Fallside, 1988 ; Nowlan, 1990; Lee, 1991 ; Wettschereck and
Dietterich, 1992 ; Vogt, 1993). On difficult approximation/prediction
tasks (e.g., predicting the Glass-Mackey chaotic series of Problem
5.4.1 *T* time steps (*T *> 50) in the
future), RBF networks which employ clustering for locating hidden
unit receptive field centers can achieve a performance comparable
to backprop networks (backprop-trained feedforward networks with
sigmoidal hidden units), while requiring orders of magnitude less
training time than backprop. However, the RBF network typically
requires ten times or more data to achieve the same accuracy as
a backprop network. The accuracy of RBF networks may be further
improved if supervised learning of receptive field centers is
used (Wettschereck and Dietterich, 1992) but the speed advantage
over backprop networks is compromised. For difficult classification
tasks, RBF networks or their modified versions (see Section 6.1.2)
employing sufficient training data and hidden units can lead to
better classification rates (Wettschereck and Dietterich, 1992)
and smaller "false-positive" classification errors (Lee,
1991) compared to backprop networks. In the following, qualitative
arguments are given for the above simulation-based observations
on the performance of RBF and backprop networks.

Some of the reasons for the training speed advantage
of RBF networks have been presented earlier in this section.
Basically, since the receptive field representation is well localized,
only a small fraction of the hidden units in an RBF network responds
to any particular input vector. This allows the use of efficient
self-organization (clustering) algorithms for adapting such units
in a training mode that does not involve the network's output
units. On the other hand, all units in a backprop network must
be evaluated and their weights updated for every input vector.
Another important reason for the faster training speed of RBF
networks is the hybrid two-stage training scheme employed, which
decouples the learning task for both hidden and output layers
thus eliminating the need for the slow back error propagation.

The RBF network with self-organized receptive fields
needs more data and more hidden units to achieve similar precision
to that of the backprop network. When used for function approximation,
the backprop network performs global fit to the training data,
whereas the RBF network performs local fit. This results in greater
generalization by the backprop network from each training example.
It also utilizes the network's free parameters more efficiently,
which leads to a smaller number of hidden units. Furthermore,
the backprop network is a better candidate net when extrapolation
is desired. This is primarily due to the ability of feedforward
nets, with sigmoidal hidden units, to approximate a function and
its derivatives (see Section 5.2.5). On the other hand, the local
nature of the hidden unit receptive fields in RBF nets prevents
them from being able to "see" beyond the training data.
This makes the RBF net a poor extrapolator.

When used as a classifier, the RBF net can lead
to low "false-positive" classification rates. This
property is due to the same reason that makes RBF nets poor extrapolators.
Regions of the input space which are far from training vectors
are usually mapped to low values by the localized receptive fields
of the hidden units. By contrast, the sigmoidal hidden units
in the backprop network can have high output even in regions far
away from those populated by training data. This causes the backprop
network/classifier to indicate high confidence classifications
to meaningless inputs. False-positive classification may be reduced
in backprop networks by employing the "training with rubbish"
strategy discussed at the end of Section 5.3.3. However, and
when dealing with high dimensional input spaces, this strategy
generally requires an excessively large training set due to the
large number of possible "rubbish" pattern combinations.

Which network is better to use for which tasks?
The backprop network is better to use when training data is expensive
(or hard to generate) and/or retrieval speed, assuming a serial
machine implementation, is critical (the smaller backprop network
size requires less storage and leads to faster retrievals compared
to RBF networks). However, if the data is cheap and plentiful
and if on-line training is required (e.g., the case of adaptive
signal processing or adaptive control where data is acquired at
a high rate and cannot be saved), then the RBF network is superior.

In their work on RBF networks, Moody and Darken
(1989a) suggested the use of normalized hidden unit activities
according to

(6.1.11)

based on empirical evidence of improved approximation
properties. The use of Equation (6.1.11) implies that
for all inputs **x**; i.e., the unweighted sum of all hidden
unit activities in an RBF network results in the unity function.
Here, the RBF network realizes a "partition of unity,"
which is a desired mathematical property in function decomposition/approximation
(Werntges, 1993); the motivation being that a superposition of
basis functions that can represent the unity function (*f*(**x**) = 1)
"exactly" would also suppress spurious structure when
fitting a non-trivial function. In other words, the normalization
in Equation (6.1.11) leads to a form of "smoothness"
regularization.

Another justification for the normalization of hidden
unit outputs may be given based on statistical arguments. If
one interprets *z**j*
in Equation (6.1.1) as the probability *P**j*(**x***k*)
of observing **x***k*
under Gaussian distribution *j*:

(6.1.12)

(where *a* is a normalization constant and *j* =
for all *j*) and also assumes that all Gaussians are selected
with equal probability, then the probability of Gaussian *j*
having generated **x***k*,
given that we have observed **x***k*,
is:

(6.1.13)

Therefore, the normalization in Equation (6.1.11)
now has a statistical significance: it represents the conditional
probability of unit *j* generating **x***k*.

Another variation of RBF networks involves the so-called
"soft" competition among Gaussian units for locating
the centers *j* (Nowlan,
1990). The clustering of the *j*'s
according to the incremental *k*-means algorithm is equivalent
to a "hard" competition winner-take-all operation where,
upon the presentation of input **x***k*,
the RBF unit with the highest output *z**j*
updates its mean *j*
according to Equation (6.1.5). This in effect realizes an iterative
version of the "approximate" maximum likelihood estimate
(Nowlan, 1990):

(6.1.14)

where *S**j*
is the set of exemplars closest to Gaussian *j*, and *N**j*
is the number of vectors contained in this set. Rather than using
the approximation in Equation (6.1.14), the "exact"
maximum likelihood estimate for *j*
is given by (Nowlan, 1990):

(6.1.15)

where *P*(*j*|**x***k*)
is given by Equation (6.1.13). In this "soft" competitive
model, all hidden unit centers are updated according to an iterative
version of Equation (6.1.15). One drawback of this "soft"
clustering method is the computational requirements in that all
*j*'s, rather than
the mean of the winner, are updated for each input. However,
the high performance of RBF networks employing "soft"
competition may justify this added training computational cost.
For example, consider the classical vowel recognition task of
Peterson and Barney (1952). Here, the data is obtained by spectrographic
analysis and consists of the first and second formant frequencies
of 10 vowels contained in words spoken by a total of 67 men, women
and children. The spoken words consisted of 10 monosyllabic words
each beginning with the letter "h" and ending with "d"
and differing only in the vowel. The words used to obtain the
data were *heed*, *hid*, *head*, *had*, *hud*,
*hod*, *heard*, *hood*, *who'd*, and *hawed*.
This vowel data is randomly split into two sets, resulting in
338 training examples and 333 test examples. A plot of the test
examples is shown in Figure 6.1.3. An RBF network employing 100
Gaussian hidden units and soft competition for locating the Gaussian
means is capable of 87.1 percent correct classification on the
333 example test set of the vowel data after being trained with
the 338 training examples (Nowlan, 1990). This performance exceeds
the 82.0%, 82.0%, and 80.2% recognition rates reported for a 100
unit *k*-means-trained RBF network (Moody and Darken, 1989b),
*k*-nearest neighbor network (Huang and Lippmann, 1988),
and backprop network (Huang and Lippmann, 1988), respectively
(the decision boundaries shown in Figure 6.1.3 are those generated
by the backprop network). A related general framework for designing
optimal RBF classifiers can be found in Fakhr (1993).

Figure 6.1.3. A plot of the test samples for the
10 vowel problem of Peterson and Barney (1952). The lines are
class boundaries generated by a two layer feedforward net trained
with backprop on training samples. (Adapted from W. Y. Huang
and R. P. Lippmann, 1988, with permission of the American Institute
of Physics).

We conclude this section by considering a network
of "semilocal activation" hidden units (Hartman and
Keeler, 1991a). This network has been found to retain comparable
training speeds to RBF networks, with the advantages of requiring
a smaller number of units to cover high-dimensional input spaces
and producing high approximation accuracy. Semilocal activation
networks are particularly advantageous when the training set has
irrelevant input exemplars.

An RBF unit responds to a localized region of the
input space. Figure 6.1.4 (a) shows the response of a two input
Gaussian RBF. On the other hand, a sigmoid unit responds to a
semi-infinite region by partitioning the input space with a "sigmoidal"
hypersurface, as shown in Figure 6.1.4 (b). RBF's have greater
flexibility in discriminating finite regions of the input space,
but this comes at the expense of a great increase in the number
of required units. To overcome this tradeoff, "Gaussian-bar"
units with the response depicted in Figure 6.1.4 (c) may be used
to replace the RBF's. Analytically, the output of the *j*th
Gaussian-bar unit is given by:

(6.1.16)

where *i* indexes the input dimension and *w**ji*
is a positive parameter signifying the *i*th weight of the
*j*th hidden unit. For comparison purposes, we write the
Gaussian RBF as a product

(6.1.17)

According to Equation (6.1.16), the Gaussian-bar
unit responds if any of the *i* Gaussians is activated (assuming
the scaling factors *w**ji*
are non-zero) while a Gaussian RBF requires all component Gaussians
to be activated. Thus a Gaussian-bar unit is more like an "ORing"
device and a pure Gaussian is more like an "ANDing"
device. Note that a Gaussian-bar network has significantly more
free parameters to adjust compared to a Gaussian RBF network of
the same size (number of units). The output units in a Gaussian-bar
network can be linear or Gaussian-bar.

Because of their semilocal receptive fields, the
centers *j* of the
hidden units cannot be determined effectively using competitive
learning as in RBF networks. Therefore, supervised gradient descent-based
learning is normally used to update all network parameters.

(a) (b) (c)

Figure 6.1.4. Response characteristics for two-input
(a) Gaussian, (b) sigmoid, and (c) Gaussian-bar units.

Since the above Gaussian-bar network employs parameter
update equations which are non-linear in their parameters, one
might suspect that the training speed of such a network is compromised.
However, on the contrary, simulations involving difficult function
prediction tasks have shown that training Gaussian-bar networks
is significantly faster than sigmoid networks and slower but of
the same order as RBF networks. One possible explanation for
the training speed of Gaussian-bar networks could be their built-in
automatic dynamic reduction of the network architecture (Hartman
and Keeler, 1991b), as explained next.

A Gaussian-bar unit can effectively "prune"
input dimension *i* by one of the following mechanisms: *w**ji*
becoming zero, *ji*
moving away from the data, and/or *ji*
shrinking to a very small value. These mechanisms can occur completely
independently for each input dimension. On the other hand, moving
any one of the *ji*'s
away from the data or shrinking *ji*
to zero deactivates a Gaussian unit completely. Sigmoid units
may also be pruned (according to the techniques of Section 5.2.5)
but such pruning is limited to synaptic weights. Therefore, Gaussian-bar
networks have greater pruning flexibility than sigmoid or Gaussian
RBF networks. Training time could also be reduced by monitoring
pruned units and excluding them from the calculations. Since
pruning may lead to very small *ji*'s
which, in turn, create a spike response at *ji*,
it is desirable to move such *ji*
to a location far away from the data in order to eliminate the
danger of these spikes on generalization. Here, one may avoid
this danger, reduce storage requirements, and increase retrieval
speed by postprocessing trained networks to remove the pruned
units of the network.

Many other versions of RBF networks can be found
in the literature (see Moody, 1989; Jones et al., 1990 ; Saha and
Keeler, 1990 ; Bishop, 1991 ; Kadirkamanathan et al., 1991 ; Mel
and Omohundro, 1991 ; Platt, 1991 ; Musavi et al., 1992 ; Wettschereck
and Dietterich, 1992 ; Lay and Hwang, 1993). Roy and Govil (1993)
presented a method based on linear programming models which simultaneously
adds RBF units and trains the RBF network in polynomial time for
classification tasks. This training method is described in detail
in Section 6.3.1 in connection with a hyperspherical classifier
net similar to the RBF network.