6.1 Radial Basis Function (RBF) Networks

In this section, an artificial neural network model motivated by "locally-tuned" response biological neurons is described. Neurons with locally-tuned response characteristics can be found in many parts of biological nervous systems. These nerve cells have response characteristics which are "selective" for some finite range of the input signal space. The cochlear stereocilia cells, for example, have a locally-tuned response to frequency which is a consequence of their biophysical properties. The present model is also motivated by earlier work on radial basis functions (Medgassy, 1961) which are utilized for interpolation (Micchelli, 1986 ; Powell, 1987), probability density estimation (Parzen, 1962 ; Duda and Hart, 1973 ; Specht, 1990), and approximations of smooth multivariate functions (Poggio and Girosi, 1989). The model is commonly referred to as the radial basis function (RBF) network.

The most important feature that distinguishes the RBF network from earlier radial basis function-based models is its adaptive nature which generally allows it to utilize a relatively smaller number of locally-tuned units (RBF's). RBF networks were independently proposed by Broomhead and Lowe (1988) , Lee and Kil (1988), Niranjan and Fallside (1988), and Moody and Darken (1989a, 1989b). Similar schemes were also suggested by Hanson and Burr (1987), Lapedes and Farber (1987), Casdagli (1989), Poggio and Girosi (1990b), and others. The following is a description of the basic RBF network architecture and its associated training algorithm.

The RBF network has a feedforward structure consisting of a single hidden layer of J locally-tuned units which are fully interconnected to an output layer of L linear units as shown in Figure 6.1.1.


Figure 6.1.1. A radial basis function neural network consisting of a single hidden layer of locally-tuned units which is fully interconnected to an output layer of linear units. For clarity, only hidden to output layer connections for the lth output unit are shown.

All hidden units simultaneously receive the n-dimensional real-valued input vector x. Notice the absence of hidden layer weights in Figure 6.1.1. This is because the hidden unit outputs are not calculated using the weighted-sum/sigmoidal activation mechanism as in the previous chapter. Rather, here, each hidden unit output zj is obtained by calculating the "closeness" of the input x to an n-dimensional parameter vector j associated with the jth hidden unit. Here, the response characteristics of the jth hidden unit are given by:

(6.1.1)

where K is a strictly positive radially-symmetric function (kernel) with a unique maximum at its "center" j and which drops off rapidly to zero away from the center. The parameter j is the "width" of the receptive field in the input space for unit j. This implies that zj has an appreciable value only when the "distance" is smaller than the width j. Given an input vector x, the output of the RBF network is the L-dimensional activity vector y whose lth component is given by:

(6.1.2)

It is interesting to note here that for L = 1 the mapping in Equation (6.1.2) is similar in form to that employed by a PTG, as in Equation (1.4.1). However, in the RBF net, a choice is made to use radially symmetric kernels as "hidden units" as opposed to monomials.

RBF networks are best suited for approximating continuous or piecewise continuous real-valued mappings where n is sufficiently small; these approximation problems include classification problems as a special case. According to Equations (6.1.1) and (6.1.2), the RBF network may be viewed as approximating a desired function f(x) by superposition of non-orthogonal bell-shaped basis functions. The degree of accuracy can be controlled by three parameters: The number of basis functions used, their location, and their width. In fact, like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators (Poggio and Girosi, 1989; Hartman et al., 1990 ; Baldi, 1991; Park and Sandberg, 1991, 1993)

A special but commonly used RBF network assumes a Gaussian basis function for the hidden units:

(6.1.3)

where j and j are the standard deviation and mean of the jth unit receptive field, respectively, and the norm is the Euclidean norm. Another possible choice for the basis function is the logistic function of the form:

(6.1.4)

where j is an adjustable bias. In fact, with the basis function in Equation (6.1.4), the only difference between a RBF network and a feedforward neural network with a single hidden layer of sigmoidal units is the similarity computation performed by the hidden units. If we think of j as the parameter (weight) vector associated with the jth hidden unit, then it is easy to see that an RBF network can be obtained from a single hidden layer neural network with unipolar sigmoid-type units and linear output units (like the one in Figure 5.1.1) by simply replacing the jth hidden unit weighted-sum netj = xTj by the negative of the normalized Euclidean distance . On the other hand, the use of the Gaussian basis function in Equation (6.1.3) leads to hidden units with Gaussian-type activation functions and with a Euclidean distance similarity computation. In this case, no bias is needed.

Next, we turn our attention to the training of RBF networks. Consider a training set of m labeled pairs {xi, di} which represent associations of a given mapping or samples of a continuous multivariate function. Also, consider the SSE criterion function as an error function E that we desire to minimize over the given training set. In other words, we would like to develop a training method that minimizes E by adaptively updating the free parameters of the RBF network. These parameters are the receptive field centers (means j of the hidden layer Gaussian units), the receptive field widths (standard deviations j), and the output layer weights (wlj).

Because of the differentiable nature of the RBF network's transfer characteristics, one of the first training methods that comes to mind is a fully supervised gradient descent method over E (Moody and Darken, 1989a ; Poggio and Girosi, 1989). In particular, j, j, and wlj are updated as follows: , , and , where , , and w are small positive constants. This method, although capable of matching or exceeding the performance of backprop trained networks, still gives training times comparable to those of sigmoidal-type networks (Wettschereck and Dietterich, 1992).

One reason for the slow convergence of the above supervised gradient descent trained RBF network is its inefficient use of the locally-tuned representation of the hidden layer units. When the hidden unit receptive fields are narrow, only a small fraction of the total number of units in the network will be activated for a given input x; the activated units are the ones with centers very close to the input vector in the input space. Thus, only those units which were activated need be updated for each input presentation. The above supervised learning method, though, places no restrictions on maintaining small values for j. Thus, the supervised learning method is not guaranteed to utilize the computational advantages of locality. One way to rectify this problem is to only use gradient descent-based learning for the basis function centers and use a method that maintains small values for the j's. Examples of learning methods which take advantage of the locality property of the hidden units are presented below.

A training strategy that decouples learning at the hidden layer from that at the output layer is possible for RBF networks due to the local receptive field nature of the hidden units. This strategy has been shown to be very effective in terms of training speed; though, this advantage is generally offset by reduced generalization ability, unless a large number of basis functions is used. In the following, we describe efficient methods for locating the receptive field centers and computing receptive field widths. As for the output layer weights, once the hidden units are synthesized, these weights can be easily computed using the delta rule [Equation (5.1.2)]. One may view this computation as finding the proper normalization coefficients of the basis functions. That is, the weight wlj determines the amount of contribution of the jth basis function to the lth output of the RBF net.

Several schemes have been suggested to find proper receptive field centers and widths without propagating the output error back through the network. The idea here is to populate dense regions of the input space with receptive fields. One method places the centers of the receptive fields according to some coarse lattice defined over the input space (Broomhead and Lowe, 1988). Assuming a uniform lattice with k divisions along each dimension of an n-dimensional input space, this lattice would require kn basis functions to cover the input space. This exponential growth renders this approach impractical for a high dimensional space. An alternative approach is to center k receptive fields on a set of k randomly chosen training samples. Here, unless we have prior knowledge about the location of prototype input vectors and/or the regions of the input space containing meaningful data, a large number of receptive fields would be required to adequately represent the distribution of the input vectors in a high dimensional space.

Moody and Darken (1989a) employed unsupervised learning of the receptive field centers j in which a relatively small number of RBF's are used; the adaptive centers learn to represent only the parts of input space which are richly represented by clusters of data. The adaptive strategy also helps reduce sampling error since it allows the 's to be determined by a large number of training samples. Here, the k-means clustering algorithm (MacQueen, 1967; Anderberg, 1973) is used to locate a set of k RBF centers which represents a local minimum of the SSE between the training set vectors x and the nearest of the k receptive field centers j (this SSE criterion function is given by Equation (4.6.4) with w replaced by ). In the basic k-means algorithm, the k RBFs are initially assigned centers j, j = 1, 2, ..., k which are set equal to k randomly selected training vectors. The remaining training vectors are assigned to class j of the closest center j. Next, the centers are recomputed as the average of the training vectors in their class. This two step process is invoked until all centers stop changing. An incremental version of this batch mode process may also be used which requires no storage of past training vectors or cluster membership information. Here, at each time step, a random training vector x is selected and the center of the nearest (in a Euclidean distance sense) receptive field is updated according to:

(6.1.5)

where is a small positive constant. Equation (6.1.5) is the simple competitive rule which we have analyzed in Section 4.6.1. Similarly, we may use learning vector quantization (LVQ) or one of its variants (see Section 3.4.2) to effectively locate the k RBF centers (Vogt, 1993). Generally speaking, there is no formal method for specifying the required number k of hidden units in an RBF network. Cross-validation is normally used to decide on k.

Once the receptive field centers are found using one of the above methods, their widths can be determined by one of several heuristics in order to get smooth interpolation. Theoretically speaking, RBF networks with the same j in each hidden kernel unit have the capability of universal approximation (Park and Sandberg, 1991). This suggests that we may simply use a single global fixed value for all j's in the network. In order to preserve the local response characteristics of the hidden units, one should choose a relatively small (positive) value for this global width parameter. The actual value of for a particular training set may be found by cross-validation. Empirical results (Moody and Darken, 1989a) suggest that a "good" estimate for the global width parameter is the average width which represents a global average over all Euclidean distances between the center of each unit i and that of its nearest neighbor j. Other heuristics based on local computations may be used which yield individually-tuned widths j. For example, the width for unit j may be set to the distance where i is the center of the nearest neighbor to unit j (usually, is taken between 1.0 and 1.5). For classification tasks, one may make use of the category label of the nearest training vector. If that category label is different from that represented by the current RBF unit, it would be advisable to use a smaller width which narrows the bell-shaped receptive field of the current unit. This leads to a sharpening of the class domains and allows for better approximation.

We have already noted that the output layer weights, wlj, can be adaptively computed using the delta rule

(6.1.6)

once the hidden layer parameters are obtained. Here, the term '(netl) can be dropped for the case of linear units. Equation (6.1.6) drives the output layer weights to minimize the SSE criterion function [recall Equation (5.1.13)], for sufficiently small . Alternatively, for the case of linear output units, one may formulate the problem of computing the weights as a set of simultaneous linear equations and employ the generalized-inverse method [recall Equation (3.1.42)] to obtain the minimum SSE solution. Without loss of generality, consider a single output RBF net, and denote by w = [w1 w2 ... wJ]T the weight vector of the output unit. Now, recalling Equations (3.1.39) through (3.1.42), the minimum SSE solution for the system of equations ZTw = d is given by (assuming an overdetermined system; i.e., m  J)

w* = Zd = (ZZT)-1Zd (6.1.7)

where Z = [z1 z2 ... zm] is a × m matrix, and d = [d1 d2 ... dm]T. Here, zi is the output of the hidden layer for input xi. Therefore, the jith element of matrix Z may be expressed explicitly as

(6.1.8)

with the parameters j and assumed to have been computed using the earlier described methods.

For "strict" interpolation problems, it is desired that an interpolation function be found which is constrained to "exactly" map the sample points xi into their associated targets di, for i = 1, 2, ..., m. It is well known that a polynomial with finite order r = m - 1 is capable of performing strict interpolation on m samples {xidi}, with distinct xi's in Rn (see Problem 1.3.4). A similar result is available for RBF nets. This result states that there is a class of radial-basis functions which guarantee that an RBF net with m such functions is capable of strict interpolation of m sample point in Rn (Micchelli, 1986 ; Light, 1992b), the Gaussian function in Equation (6.1.3) is one example. Furthermore, there is no need to search for the centers j; one can just set j = xj for j = 1, 2, ..., m. Thus, for strict interpolation, the Z matrix in Equation (6.1.8) becomes the m × m matrix

(6.1.9)

which we refer to as the interpolation matrix. Note that the appropriate width parameters j still need to be found; the choice of these parameters affects the interpolation quality of the RBF net.

According to the above discussion, an exact solution w* is assured. This requires Z to be nonsingular. Hence, w* can be computed as

(6.1.10)

Although in theory Equation (6.1.10) always assures a solution to the strict interpolation problem, in practice the direct computation of can become ill-conditioned due to the possibility of ZT being nearly singular. Alternatively, one may resort to Equation (6.1.6) for an adaptive computation of w*.

Receptive field properties play an important role in the quality of an RBF network's approximation capability. To see this, consider a single input/single output RBF network for approximating a continuous function f: RR. Approximation error, due to error in the "fit" of the RBF network to that of the target function f, occurs when the receptive fields (e.g., Gaussians) are either too broad and/or too widely spaced relative to the fine spatial structure of f. In other words, these factors act to locally limit the high frequency content of the approximating network. According to Nyquist's sampling criterion, the highest frequency which may be recovered from a sampled signal is one half the sampling frequency. Therefore, when the receptive field density is not high enough, the high frequency fine structure in the function being approximated is lost. The high frequency fine structure of f can also be "blurred" when the receptive fields are excessively wide. By employing the Taylor series expansion, it can be shown (Hoskins et al., 1993) that when the width parameter is large, the RBF net exhibits polynomial behavior with an order successively decreasing as the RBF widths increase. In other words, the net's output approaches that of a polynomial function whose order is decreasing in . Therefore, it is important that receptive field densities and widths be chosen to match the frequency transfer characteristics imposed by the function f (Mel and Omohundro, 1991). These results also suggest that even for moderately high dimensional input spaces, a relatively large number of RBF's must be used if the training data represent high frequency content mappings (functions) and if low approximation error is desired. These observations can be generalized to the case of RBF network approximation of multivariate functions.

Example 6.1.1: The following example illustrates the application of the RBF net for approximating the function (refer to the solid line plot in Figure 6.1.2) from the fifteen noise-free samples (jg(j)), j = 1, 2, ..., 15, in Figure 6.1.2. We will employ the method of strict interpolation for designing the RBF net. Hence, 15 Gaussian hidden units are used (all having the same width parameter ) with the jth Gaussian unit having its center j equal to xj. The design is completed by computing the weight vector w of the output linear unit using Equation (6.1.10). Three designs are generated which correspond to the values  = 0.5, 1.0, and 1.5. We then tested these networks with two hundred inputs x, uniformly sampled in the interval [-8, 12]. The output of the RBF net is shown in Figure 6.1.2 for  = 0.5 (dotted line),  = 1.0 (dashed line), and  = 1.5 (dotted-dashed line). The value of  = 1.0 is close to the average distance among all 15 sample points, and it resulted in better interpolation of g(x) compared to  = 0.5 and  = 1.5. As expected, these results show poor extrapolation capabilities by the RBF net, regardless of the value of (check the net output in Figure 6.1.2 for x > 10 and ). It is interesting to note the excessive overfit by the RBF net for relatively high (compare to the polynomial-based strict interpolation of the same data shown in Figure 5.2.2). Finally, by comparing the above results to those in Figure 5.2.3 one can see that more accurate interpolation is possible with sigmoidal hidden unit nets; this is mainly attributed to the ability of feedforward multilayer sigmoidal unit nets to approximate the first derivative of g(x).










Figure 6.1.2. RBF net approximation of the function (shown as a solid line), based on strict interpolation using the 15 samples shown (small circles). The RBF net employs 15 Gaussian hidden units and its output is shown for three hidden unit widths:  = 0.5 (dotted line),  = 1.0 (dashed line), and (dotted-dashed line). (Compare these results to those in Figures 5.2.2 and 5.2.3.)

6.1.1 RBF Networks Versus Backprop Networks

RBF networks have been applied with success to function approximation (Broomhead and Lowe, 1988 ; Lee and Kil, 1988 ; Casdagli, 1989 ; Moody and Darken, 1989a, 1989b) and classification (Niranjan and Fallside, 1988 ; Nowlan, 1990; Lee, 1991 ; Wettschereck and Dietterich, 1992 ; Vogt, 1993). On difficult approximation/prediction tasks (e.g., predicting the Glass-Mackey chaotic series of Problem 5.4.1 T time steps (> 50) in the future), RBF networks which employ clustering for locating hidden unit receptive field centers can achieve a performance comparable to backprop networks (backprop-trained feedforward networks with sigmoidal hidden units), while requiring orders of magnitude less training time than backprop. However, the RBF network typically requires ten times or more data to achieve the same accuracy as a backprop network. The accuracy of RBF networks may be further improved if supervised learning of receptive field centers is used (Wettschereck and Dietterich, 1992) but the speed advantage over backprop networks is compromised. For difficult classification tasks, RBF networks or their modified versions (see Section 6.1.2) employing sufficient training data and hidden units can lead to better classification rates (Wettschereck and Dietterich, 1992) and smaller "false-positive" classification errors (Lee, 1991) compared to backprop networks. In the following, qualitative arguments are given for the above simulation-based observations on the performance of RBF and backprop networks.

Some of the reasons for the training speed advantage of RBF networks have been presented earlier in this section. Basically, since the receptive field representation is well localized, only a small fraction of the hidden units in an RBF network responds to any particular input vector. This allows the use of efficient self-organization (clustering) algorithms for adapting such units in a training mode that does not involve the network's output units. On the other hand, all units in a backprop network must be evaluated and their weights updated for every input vector. Another important reason for the faster training speed of RBF networks is the hybrid two-stage training scheme employed, which decouples the learning task for both hidden and output layers thus eliminating the need for the slow back error propagation.

The RBF network with self-organized receptive fields needs more data and more hidden units to achieve similar precision to that of the backprop network. When used for function approximation, the backprop network performs global fit to the training data, whereas the RBF network performs local fit. This results in greater generalization by the backprop network from each training example. It also utilizes the network's free parameters more efficiently, which leads to a smaller number of hidden units. Furthermore, the backprop network is a better candidate net when extrapolation is desired. This is primarily due to the ability of feedforward nets, with sigmoidal hidden units, to approximate a function and its derivatives (see Section 5.2.5). On the other hand, the local nature of the hidden unit receptive fields in RBF nets prevents them from being able to "see" beyond the training data. This makes the RBF net a poor extrapolator.

When used as a classifier, the RBF net can lead to low "false-positive" classification rates. This property is due to the same reason that makes RBF nets poor extrapolators. Regions of the input space which are far from training vectors are usually mapped to low values by the localized receptive fields of the hidden units. By contrast, the sigmoidal hidden units in the backprop network can have high output even in regions far away from those populated by training data. This causes the backprop network/classifier to indicate high confidence classifications to meaningless inputs. False-positive classification may be reduced in backprop networks by employing the "training with rubbish" strategy discussed at the end of Section 5.3.3. However, and when dealing with high dimensional input spaces, this strategy generally requires an excessively large training set due to the large number of possible "rubbish" pattern combinations.

Which network is better to use for which tasks? The backprop network is better to use when training data is expensive (or hard to generate) and/or retrieval speed, assuming a serial machine implementation, is critical (the smaller backprop network size requires less storage and leads to faster retrievals compared to RBF networks). However, if the data is cheap and plentiful and if on-line training is required (e.g., the case of adaptive signal processing or adaptive control where data is acquired at a high rate and cannot be saved), then the RBF network is superior.

6.1.2 RBF Network Variations

In their work on RBF networks, Moody and Darken (1989a) suggested the use of normalized hidden unit activities according to

(6.1.11)

based on empirical evidence of improved approximation properties. The use of Equation (6.1.11) implies that for all inputs x; i.e., the unweighted sum of all hidden unit activities in an RBF network results in the unity function. Here, the RBF network realizes a "partition of unity," which is a desired mathematical property in function decomposition/approximation (Werntges, 1993); the motivation being that a superposition of basis functions that can represent the unity function (f(x) = 1) "exactly" would also suppress spurious structure when fitting a non-trivial function. In other words, the normalization in Equation (6.1.11) leads to a form of "smoothness" regularization.

Another justification for the normalization of hidden unit outputs may be given based on statistical arguments. If one interprets zj in Equation (6.1.1) as the probability Pj(xk) of observing xk under Gaussian distribution j:

(6.1.12)

(where a is a normalization constant and j =  for all j) and also assumes that all Gaussians are selected with equal probability, then the probability of Gaussian j having generated xk, given that we have observed xk, is:

(6.1.13)

Therefore, the normalization in Equation (6.1.11) now has a statistical significance: it represents the conditional probability of unit j generating xk.

Another variation of RBF networks involves the so-called "soft" competition among Gaussian units for locating the centers j (Nowlan, 1990). The clustering of the j's according to the incremental k-means algorithm is equivalent to a "hard" competition winner-take-all operation where, upon the presentation of input xk, the RBF unit with the highest output zj updates its mean j according to Equation (6.1.5). This in effect realizes an iterative version of the "approximate" maximum likelihood estimate (Nowlan, 1990):

(6.1.14)

where Sj is the set of exemplars closest to Gaussian j, and Nj is the number of vectors contained in this set. Rather than using the approximation in Equation (6.1.14), the "exact" maximum likelihood estimate for j is given by (Nowlan, 1990):

(6.1.15)

where P(j|xk) is given by Equation (6.1.13). In this "soft" competitive model, all hidden unit centers are updated according to an iterative version of Equation (6.1.15). One drawback of this "soft" clustering method is the computational requirements in that all j's, rather than the mean of the winner, are updated for each input. However, the high performance of RBF networks employing "soft" competition may justify this added training computational cost. For example, consider the classical vowel recognition task of Peterson and Barney (1952). Here, the data is obtained by spectrographic analysis and consists of the first and second formant frequencies of 10 vowels contained in words spoken by a total of 67 men, women and children. The spoken words consisted of 10 monosyllabic words each beginning with the letter "h" and ending with "d" and differing only in the vowel. The words used to obtain the data were heed, hid, head, had, hud, hod, heard, hood, who'd, and hawed. This vowel data is randomly split into two sets, resulting in 338 training examples and 333 test examples. A plot of the test examples is shown in Figure 6.1.3. An RBF network employing 100 Gaussian hidden units and soft competition for locating the Gaussian means is capable of 87.1 percent correct classification on the 333 example test set of the vowel data after being trained with the 338 training examples (Nowlan, 1990). This performance exceeds the 82.0%, 82.0%, and 80.2% recognition rates reported for a 100 unit k-means-trained RBF network (Moody and Darken, 1989b), k-nearest neighbor network (Huang and Lippmann, 1988), and backprop network (Huang and Lippmann, 1988), respectively (the decision boundaries shown in Figure 6.1.3 are those generated by the backprop network). A related general framework for designing optimal RBF classifiers can be found in Fakhr (1993).











Figure 6.1.3. A plot of the test samples for the 10 vowel problem of Peterson and Barney (1952). The lines are class boundaries generated by a two layer feedforward net trained with backprop on training samples. (Adapted from W. Y. Huang and R. P. Lippmann, 1988, with permission of the American Institute of Physics).

We conclude this section by considering a network of "semilocal activation" hidden units (Hartman and Keeler, 1991a). This network has been found to retain comparable training speeds to RBF networks, with the advantages of requiring a smaller number of units to cover high-dimensional input spaces and producing high approximation accuracy. Semilocal activation networks are particularly advantageous when the training set has irrelevant input exemplars.

An RBF unit responds to a localized region of the input space. Figure 6.1.4 (a) shows the response of a two input Gaussian RBF. On the other hand, a sigmoid unit responds to a semi-infinite region by partitioning the input space with a "sigmoidal" hypersurface, as shown in Figure 6.1.4 (b). RBF's have greater flexibility in discriminating finite regions of the input space, but this comes at the expense of a great increase in the number of required units. To overcome this tradeoff, "Gaussian-bar" units with the response depicted in Figure 6.1.4 (c) may be used to replace the RBF's. Analytically, the output of the jth Gaussian-bar unit is given by:

(6.1.16)

where i indexes the input dimension and wji is a positive parameter signifying the ith weight of the jth hidden unit. For comparison purposes, we write the Gaussian RBF as a product

(6.1.17)

According to Equation (6.1.16), the Gaussian-bar unit responds if any of the i Gaussians is activated (assuming the scaling factors wji are non-zero) while a Gaussian RBF requires all component Gaussians to be activated. Thus a Gaussian-bar unit is more like an "ORing" device and a pure Gaussian is more like an "ANDing" device. Note that a Gaussian-bar network has significantly more free parameters to adjust compared to a Gaussian RBF network of the same size (number of units). The output units in a Gaussian-bar network can be linear or Gaussian-bar.

Because of their semilocal receptive fields, the centers j of the hidden units cannot be determined effectively using competitive learning as in RBF networks. Therefore, supervised gradient descent-based learning is normally used to update all network parameters.





(a) (b) (c)

Figure 6.1.4. Response characteristics for two-input (a) Gaussian, (b) sigmoid, and (c) Gaussian-bar units.

Since the above Gaussian-bar network employs parameter update equations which are non-linear in their parameters, one might suspect that the training speed of such a network is compromised. However, on the contrary, simulations involving difficult function prediction tasks have shown that training Gaussian-bar networks is significantly faster than sigmoid networks and slower but of the same order as RBF networks. One possible explanation for the training speed of Gaussian-bar networks could be their built-in automatic dynamic reduction of the network architecture (Hartman and Keeler, 1991b), as explained next.

A Gaussian-bar unit can effectively "prune" input dimension i by one of the following mechanisms: wji becoming zero, ji moving away from the data, and/or ji shrinking to a very small value. These mechanisms can occur completely independently for each input dimension. On the other hand, moving any one of the ji's away from the data or shrinking ji to zero deactivates a Gaussian unit completely. Sigmoid units may also be pruned (according to the techniques of Section 5.2.5) but such pruning is limited to synaptic weights. Therefore, Gaussian-bar networks have greater pruning flexibility than sigmoid or Gaussian RBF networks. Training time could also be reduced by monitoring pruned units and excluding them from the calculations. Since pruning may lead to very small ji's which, in turn, create a spike response at ji, it is desirable to move such ji to a location far away from the data in order to eliminate the danger of these spikes on generalization. Here, one may avoid this danger, reduce storage requirements, and increase retrieval speed by postprocessing trained networks to remove the pruned units of the network.

Many other versions of RBF networks can be found in the literature (see Moody, 1989; Jones et al., 1990 ; Saha and Keeler, 1990 ; Bishop, 1991 ; Kadirkamanathan et al., 1991 ; Mel and Omohundro, 1991 ; Platt, 1991 ; Musavi et al., 1992 ; Wettschereck and Dietterich, 1992 ; Lay and Hwang, 1993). Roy and Govil (1993) presented a method based on linear programming models which simultaneously adds RBF units and trains the RBF network in polynomial time for classification tasks. This training method is described in detail in Section 6.3.1 in connection with a hyperspherical classifier net similar to the RBF network.

Back to the Table of Contents

Back to Main Menu