6.3 Unit-Allocating Adaptive Networks

In Chapter Four (Section 4.9) we saw that training a multilayer feedforward neural network with fixed resources requires, in the worst case, exponential time. On the other hand, polynomial time training is possible, in the worst case, if we are willing to use unit-allocating nets which are capable of allocating new units, as needed, as more patterns are learned ( Baum, 1989).

In the context of pattern classification, the classical k-nearest neighbors classifier ( Fix and Hodges, 1951; Duda and Hart, 1973) is an extreme example of a unit-allocating machine with O(1) learning complexity. The k-nearest neighbors classifier assigns to any new input the class most heavily represented among its k-nearest neighbors. This classifier represents an extreme unit-allocating machine since it allocates a new unit for every learned example in a training set. There are no computations involved in adjusting the "parameters" of allocated units: Each new allocated unit stores exactly the current example (vector) presented. In other words, no transformation or abstraction of the examples in the training set is required and one can immediately proceed to use this machine for classification regardless of the size of the training set. Therefore, the training time of this classifier does not scale with the number of examples, m, which means that the complexity is O(1). Surprisingly, it has been shown ( Cover and Hart, 1967) that in the limit of an infinite training set, this simple classifier has a probability of classification error less than twice the minimum achievable error probability of the optimal Bayes classifier for any integer value of k  1. Unfortunately, the performance of the nearest neighbor classifier deteriorates for training sets of small size. Also, the convergence to the above asymptotic performance can be arbitrarily slow, and the classification error rate need not even decrease monotonically with m ( Duda and Hart, 1973).

Even when utilizing a large training set, k-nearest neighbors classifiers are impractical as on-line classifiers due to the large number of computations required in classifying a new input. Thus, we are forced to use far fewer than one unit for every training sample; i.e., we must create and load an abstraction of the training data. This obviously leads to higher learning complexity than O(1).

Practical trainable networks should have a number of desirable attributes. The most significant of these attributes are fast learning speed, accurate learning, and compact representation of training data. The reason for desiring the first two attributes is obvious. On the other hand, the formation of a compact representation is important for two reasons: Good generalization (less free parameters leads to less overfitting) and feasibility of hardware (VLSI) realization since silicon surface area is the premium.

In the following, we consider three practical unit allocating networks. These networks are capable of forming compact representations of data easily and rapidly. Two of the networks considered are classifier networks. The third network is capable of classification as well as approximation of continuous functions.

6.3.1 Hyperspherical Classifiers

Pattern classification in n-dimensional space consists of partitioning the space into category (class) regions with decision boundaries and assigning an unknown point in this space to the class in whose region it falls. The typical geometrical shape of the decision boundaries for classical pattern classifiers are hyperplanes and hypersurfaces. In this section, we discuss two unit-allocating adaptive networks/classifiers which employ hyperspherical boundary forms.

Hyperspherical classifiers were introduced by Cooper (1962, 1966), Batchelor and Wilkins (1968) and Batchelor (1969) [See Batchelor (1974) for a summary of early work]. Like the nearest-neighbor classifier, a hyperspherical classifier is based upon the storage of examples represented as points in a metric space (e.g., Euclidean space). The metric defined on this space is a measure of the distance between an unknown input pattern and a known category. Each stored point has associated with it a finite "radius" that defines the point's region of influence. The interior of the resulting hypersphere represents the decision region associated with the center point's category. This region of influence makes a hyperspherical classifier typically more conservative in terms of storage than the nearest neighbor classifier. Furthermore, the finite radii of the regions of influence can make a hyperspherical classifier abstain from classifying patterns from unknown categories (these patterns are typically represented as points in the input space which are far away from any underlying class regions). This later feature enhances the classifiers ability to reject "rubbish."

Restricted Coulomb Energy (RCE) Classifier

The following is a description of a specific network realization of a hyperspherical classifier proposed by Reilly et al. (1982) and Reilly and Cooper (1990). This model is named the "restricted Coulomb energy" (RCE) network. The name is derived from the form of the "potential function" governing the mapping characteristics, which has been interpreted (Scofield et al., 1988) as a restricted form of a "high-dimensional Coulomb potential" between a positive test charge and negative charges placed at various sites.

The architecture of the RCE network contains two layers: A hidden layer and an output layer. The hidden layer is fully interconnected to all components of an input pattern (vector) x  Rn. The output layer consists of L units. The output layer is sparsely connected to the hidden layer; each hidden unit projects its output to one and only one output unit. The architecture of the RCE net is shown in Figure 6.3.1. Each unit in the output layer corresponds to a pattern category. The network assigns an input pattern to a category l if the output cell yl is activated in response to the input. The decision of the network is "unambiguous" if one and only one output unit is active upon the presentation of an input, otherwise, the decision is "ambiguous."


Figure 6.3.1. RCE network architecture.

The transfer characteristics of the jth hidden unit is given by

(6.3.1)

where j  Rn is a parameter vector called "center", rj  R is a threshold or "radius," and D is some predefined distance metric between vectors j and x (e.g., Euclidean distance between real-valued vectors or Hamming distance between binary-valued vectors). Here, f is the threshold activation function given by

(6.3.2)

On the other hand, the transfer function of a unit in the output layer is the logical OR function. The jth hidden unit in the RCE net is associated with a hyperspherical region of the input space which defines the unit's region of influence. The location of this region is defined by the center j and its size is determined by the radius rj. According to Equation (6.3.1), any input pattern falling within the influence region of a hidden unit will cause this unit to fire. Thus, the hidden units define a collection of hyperspheres in the space of input patterns. Some of these hyperspheres may overlap. When a pattern falls within the regions of influence of several hidden units, they will all "fire" and switch on the output units they are connected to.

Training the RCE net involves two mechanisms: Unit commitment and modification of hidden unit radii. Units may be committed to the hidden and output layers. When committed, units are interconnected so that they do not violate the RCE interconnectivity pattern described above.

Initially, the network starts with no units. An arbitrary sample pattern x1 is selected from the training set and one hidden unit and one output unit are allocated. The allocated hidden unit center 1 is set equal to x1 and its radius r1 is set equal to a user-defined parameter rmax (rmax is the maximum size of the region of influence ever assigned to a hidden unit). This unit is made fully interconnected to the input pattern and projects its output z1 to the allocated output unit (OR gate). This output unit represents the category of the input x1. Next, we choose a second arbitrary example x2 and feed it to the current network. Here, one of three scenarios emerges. First, if x2 causes the output unit to fire and if x2 belongs to the category represented by this unit, then do nothing and continue training with a new input. In general, this scenario might occur at a point during training where the network has multiple hidden and output units representing various categories. In this case, if the input pattern causes only the output unit representing the correct category to fire, then do nothing and continue the training session with a new input. On the other hand, the correct output unit may fire along with one or more other output units. This indicates that the regions of influence of hidden units representing various categories overlap and that the present input pattern lies inside the overlap region. Here, proceed by reducing the threshold values (radii) of all active hidden units that are associated with categories other than the correct one until they become inactive.

The second scenario involves the case when the input x2 happens to belong to the same category as x1 but does not cause the output unit to fire. Here, allocate a new hidden unit with center at 2 = x2 and radius rmax and connect the output z2 of this unit to the output unit. The general version of this scenario occurs when the network has multiple output units. Now, if the input pattern causes no output units (including the one representing the category of the input) to fire, then allocate a new hidden unit centered at the current input vector/pattern and assign it a radius r = min(rmax, dmin), where dmin is the distance from this new center to the nearest center of a hidden unit representing any category different from that of the current input pattern. The new allocated unit is connected to the output unit representing the category of the input pattern. Note that this setting of r may cause one or more output units representing the wrong category to fire. This should not be a problem since the shrinking of the region of influence mechanism described under the first scenario will ultimately rectify the situation. If, under this scenario, some hidden units representing the wrong category fire, then the radii of such units are shrunk as described earlier under the first scenario.

Finally, the third scenario represents the case of an input with a new category that is not represented by the network. Here, as in the first step of the training procedure, a hidden unit centered at this input is allocated and its radius is set as in the second scenario. Also, a new output unit representing the new category is added which receives an input from the newly allocated hidden unit. Again if existing hidden units become active under this scenario, then their radii are shrunk until they become inactive. The training phase continues (by cycling through the training set or by updating in response to a stream of examples) until no new units are allocated and the size of the regions of influence of all hidden units converges.

The RCE net is capable of developing proper separating boundaries for nonlinearly separable problems. The reader is referred to Figure 6.3.2 for a schematic representation of separating boundaries realized by the regions of influence for a nonlinearly separable two-class problem in two-dimensional pattern space. The RCE net can also handle the case where a single category is contained in several disjoint regions. In principle, an arbitrary degree of accuracy in the separating boundaries can be achieved if no restriction is placed on the size of the training set. Dynamic category learning is also possible with the RCE network. That is, new classes of patterns can be introduced at arbitrary points in training without always involving the need to retrain the network on all of its previously trained data. Note that, in its present form, the RCE network is not suitable for handling overlapping class regions. Here, the learning algorithm will tend to drive the radii of all hidden unit regions of influence to zero. It also leads to allocating a large number of hidden units approximately equal to the number of training examples coming from the regions of overlap.













Figure 6.3.2. Schematic diagram for an RCE classifier in two-dimensional space solving a nonlinearly separable problem.

Several variations of the RCE network are possible. For example, one might employ mechanisms that allow the centers of the hyperspheres to drift to more optimal locations in the input space. A second variation would be to allow the hyperspheres to grow. These two mechanisms have been considered for more general hyperspherical classifiers than RCE classifiers (Batchelor, 1974). Modifications to the RCE network for handling overlapping class regions can be found in Reilly and Cooper (1990). Empirical examination of RCE classifiers appeared in Lee and Lippmann (1990) and Hudak (1992).

Polynomial-Time Trained Hyperspherical Classifier

In the following, we describe a classifier network with an architecture identical to the RCE network (see Figure 6.3.1), but which employs a training algorithm that is shown to construct and train the classifier network in polynomial time. This polynomial time classifier (PTC) uses clustering and linear programming models to incrementally generate the hidden layer. Our description of the PTC net is based on the work of Roy and Mukhopadhyay (1991), Mukhopadhyay et al. (1993), and Roy et al. (1993).

As in the RCE net, the PTC net uses hidden units to cover class regions. However, the region of influence of each hidden unit need not be restricted to the hypersphere. A quadratic region of influence is assumed which defines the transfer characteristics of a hidden unit according to:

(6.3.3)

where

(6.3.4)

and w0, wi, and wij are modifiable real-valued parameters to be learned. Here, is greater than or equal to a small positive constant (say 0.001) and its value is computed as part of the training procedure described below. With the transfer characteristics as in Equation (6.3.3), one may view each hidden unit as a quadratic threshold gate (QTG), introduced in Chapter One. A quadratic region of influence contains the hypersphere as a special case but allows for the realization of hyperellipsoids and hyperboloids. This enables the PTC to form more accurate boundary regions than the RCE classifier. Other regions of influence may be used in the PTC as long as they are represented by functions which are linear in the parameters to be learned: w0, wi, and wij. For example, is also an acceptable function describing the regions of influence.

The learning algorithm determines the parameters of the hidden units in such a way that only sample patterns of a designated class are "covered" by a hidden unit representing this class. The algorithm also attempts to minimize the number of hidden units required to accurately solve a given classification problem.

Initially, the learning algorithm attempts to use a single hidden unit to cover a whole class region. If this fails, the sample patterns in that class are split into two or more clusters, using a clustering procedure (e.g., the k-means procedure described in Section 6.1) and then attempts are made to adapt separate hidden units to cover each of these clusters. If that fails, or if only some clusters are covered, then the uncovered clusters are further split to be separately covered until covers are provided for each ultimate cluster. The idea here is to allow a hidden unit to cover (represent) as many of the sample patterns within a given class as is feasibly possible (without including sample patterns from any other class), thereby minimizing the total number of hidden units needed to cover that class.

The parameters of each hidden unit are computed by solving a linear programming problem [for an accessible description of linear programming, the reader is referred to Chapter Five in Duda and Hart (1973)]. The linear program is used to adjust the location and boundaries of the region of influence of a hidden unit representing a given class cluster such that sample patterns from this cluster cause the net inputs to the hidden unit [the argument of f in Equation (6.3.3)] to be at least slightly positive and those from all other classes to be at least slightly negative. Linear programming is appropriate here because the regions of influence are defined by quadratics which are linear functions of their parameters. Formally put, the linear program set up for adjusting the parameters of the jth hidden unit is as follows:

The positive margin j ensures a finite separation between the classes and prevents the formation of common boundaries. Unit j becomes a permanent fixed unit of the PTC net if and only if the solution to the above problem is feasible. Roy et al. (1993) gave an extension to the above training method which enhances the PTC performance for data with outlier patterns. An alternative to linear programming is to use the Ho-Kashyap algorithm described in Section 3.1.4 which guarantees class separation (with finite separation between classes) if a solution exists or, otherwise, gives an indication of nonlinear separability.

Similar to the RCE net, all hidden units in the PTC whose respective regions of influence cover clusters of the same class have their outputs connected to a unique logical OR unit (or an LTG realizing the OR function) in the output layer.

The polynomial complexity of the above training algorithm is shown next. For each class label c = 1, 2, ..., L, let mc be the number of pattern vectors (for a total of patterns) to be covered. Consider the worst case scenario (from a computational point of view) where a separate cover (hidden unit) is required for each training pattern. Thus, in this case, all of the linear programs from Equation (6.3.5) for a class will be infeasible until the class is broken up into mc single point clusters. All single point clusters will produce feasible solutions which implies that we have just designed a PTC with perfect recall on the training set (note, however, that such a design will have poor generalization!). Let us further assume the worst case scenario in which the mc pattern vectors are broken up into one extra cluster at each clustering stage. Using simple counting arguments, we can readily show that for successful training a total of linear programs (feasible and infeasible combined) are solved and clustering performed m times. Now, since each linear program can be solved in polynomial time (Karmarkar, 1984; Khachian, 1979) and each clustering operation to obtain a specified number of clusters can also be performed in polynomial time (Everitt, 1980; Hartigan, 1975), it follows that the above learning algorithm is of polynomial complexity.

What remains to be seen is the generalization capability of PTC nets. As for most artificial neural nets, only empirical studies of generalization are available. One such study (Roy et al., 1993) reported comparable classification performance of a PTC net to the k-nearest neighbors classifier and backprop nets on relatively small sample classification tasks. In general, a PTC net allocates a much smaller number of hidden units compared to the RCE net when trained on the same data. However, the PTC training procedure requires simultaneous access to all training examples which makes the PTC net inapplicable for on-line implementations.

6.3.2 Cascade-Correlation Network

The Cascade-Correlation Network (CCN) proposed by Fahlman and Lebiere (1990) is yet another example of a unit-allocating architecture. The CCN was developed in an attempt to solve the so-called "moving target problem," which is attributed to the slowness of backprop learning. Because all of the weights in a backprop net are changing at once, each hidden unit sees a constantly changing environment. Therefore, instead of moving quickly to assume useful roles in the overall problem solution, the hidden units engage in a complex "dance" with much wasted motion (Fahlman and Lebiere, 1990).

The CCN differs from all networks considered so far in two major ways: (1) It builds a deep net of cascaded units (as opposed to a net with a wide hidden layer) and (2) it can allocate more than one type of hidden unit; for example, sigmoidal units and Gaussian units may coexist in the same network. This network is suited for classification tasks or approximation of continuous functions. The CCN has a significant learning speed advantage over backprop nets, since units are trained individually without requiring back propagation of error signals.

Figure 6.3.3. The cascade-correlation network architecture with three hidden units. The number of each hidden unit represents the order in which it has been allocated.

The CCN architecture after allocating three hidden units is illustrated in Figure 6.3.3. The number of output units is dictated by the application at hand, and by the way the designer chooses to encode the outputs. The hidden units can be sigmoid units, Gaussian units, etc. or a mixture of such units. An important requirement on a candidate hidden unit is that it has a differentiable transfer function. The original CCN uses sigmoid units as hidden units. The output units can also take various forms, but typically a sigmoid (or hyperbolic tangent) activation unit is employed for classification tasks. Linear output units are employed when the CCN is used for approximating mappings with real-valued outputs (e.g. function approximation/interpolation).

Every input (including a bias x0 = 1) is connected to every output unit by a connection with an adjustable weight. Each allocated hidden unit receives a connection from each pre-existing hidden unit. Therefore, each added hidden unit defines a new one-unit layer. This may lead to a very deep network with high fan-in for the hidden and output units.

The learning algorithm consists of two phases: Output unit training and hidden unit training. Initially, the network has no hidden units. In the first phase, output unit weights are trained (e.g., using the delta rule) to minimize the usual sum of squared error (SSE) measure at the output layer. At the completion of training, if the SSE remains above a predefined threshold, the residual errors (the difference between the actual and desired output) is recorded for each output unit l on each training pattern xk, k = 1, 2, ..., m. Also, a new hidden unit is inserted whose weights are determined by the hidden unit training phase described below. Note that if the first training phase converges (SSE error is very small) with no hidden units, then training is stopped (this convergence would be an indication that the training set is linearly separable, assuming we are dealing with a classification task).

In the hidden unit training phase, a pool of randomly initialized candidate hidden units (typically four to eight units) is trained in parallel. Later, one trained unit from this pool, the one which best optimizes some performance measure, is selected for permanent placement in the net. This multiple candidate training strategy minimizes the chance that a useless unit will be permanently allocated because an individual candidate has gotten stuck at a poor set of weights during training. Each candidate hidden unit receives trainable input connections from all of the network's external inputs and from all pre-existing hidden units, if any. During this phase, the outputs of these candidate units are not yet connected to any output units in the network. Next, the weights of each candidate unit are adjusted, independently of other candidates in the pool, in the direction of maximizing a performance measure E. Here, E is chosen as the sum, over all output units, of the magnitude of the covariance between the candidate unit's output zk and the residual output error , observed at unit l. Formally, the criterion E is defined as:

(6.3.6)

where and are average values taken over all patterns xk. The maximization of E by each candidate unit is done by incrementing the weight vector w for this unit by an amount proportional to the gradient ; i.e., performing steepest gradient ascent on E. Note that is recomputed every time w is incremented by averaging the unit's outputs due to all training patterns. During this training phase, the weights of any pre-existing hidden units are frozen. In fact once allocated, a hidden unit never changes its weights. After the training reaches an asymptote, the candidate unit which achieves the highest covariance score E is added to the network by fanning out its output to all output layer units through adjustable weights. The motivation here is that, by maximizing covariance, a unit becomes tuned to the features in the input pattern which have not been captured by the existing net. This unit then becomes a permanent feature-detector in the network, available for producing outputs or for creating other, more complex feature detectors. At this point, the network repeats the output layer training phase using the delta learning rule, and the residual output errors are recomputed. The two training phases continue to alternate until the output SSE is sufficiently small, which is almost always possible.

The CCN has been empirically shown to be capable of learning some hard classification-type tasks 10 to 100 times faster than backprop. Fahlman and Lebiere (1990) have empirically estimated the learning time in epochs to be roughly log(J), where J is the number of hidden units ultimately needed to solve the given task. Unfortunately, though, a precise value of J is almost impossible to determine. In addition, the CCN is capable of learning difficult tasks on which backprop nets have been found to get stuck in local minima. One of these difficult tasks is the n-bit parity problem. Another is the two spiral problem shown in Figure 6.3.4 (a).

The two spiral problem was proposed by Alex Wieland as a benchmark which is extremely hard for a standard backprop net to solve. The task requires a network with two real-valued inputs and a single output to learn a mapping that distinguishes between points on two intertwined spirals. The associated training set consists of 194 point coordinates (x1x2), half of which come from each spiral. The output should be +1 for the first spiral and -1 for the other spiral. Figure 6.3.4 (b) shows a solution to this problem generated by a trained CCN (Fahlman and Lebiere, 1990). This task requires, typically, 12 to 19 sigmoidal hidden units and an average of 1,700 training cycles when a pool of 8 candidate hidden units is used during training. For comparison purposes, we show in Figure 6.3.4 (c) a solution generated by a backprop network employing three hidden layers of five units each and with "shortcut" connections between layers (Lang and Witbrock, 1989). Here, each unit receives incoming connections from every unit in every earlier layer, not just from the immediately preceding layer.

The backprop net requires about 20,000 training cycles and about 8,000 cycles if Fahlman's version of backprop (see Section 5.2.3) is used. Therefore, the CCN outperforms standard backprop in training cycles by a factor of 10, while building a network of about the same complexity. In terms of actual computation on a serial machine, however, the CCN shows 50-fold speedup over standard backprop on the two spiral task. This is due to the lower number of computations constituting a single CCN training cycle compared to that of standard backprop. Note also that the solution generated by the CCN is qualitatively better than that generated by backprop (the reader might have already noticed a difference in the spiral directions between Figures 6.3.4 (b) and (c). This is simply because the simulation in (c) used training spirals which have the opposite direction to those shown in Figure 6.3.4 (a).








(a) (b) (c)

Figure 6.3.4. (a) Training samples for the two spiral problem. (b) Solution found by a cascade-correlation network. (c) Solution found by a 3-hidden layer backprop network which employs "short-cut" connections. [(a) and (b) are from S. E. Fahlman and C. Lebiere, 1990, and (c) is from K. J. Lang and M. J. Witbrock, 1989, with permission of Morgan Kaufmann.]

On the other hand, when used for function approximation, backprop outperforms the CCN. Simulations with the Mackey-Glass time series prediction task show poor generalization performance for the CCN (Crowder, 1991); in this case, the CCN suffers from overfitting. Another undesirable feature of the CCN is its inefficient hardware implementation: The deep layered architecture leads to a delay in response proportional to the number of layers. Also, the high fan-in of the hidden units imposes additional implementation constraints for VLSI technology; high device fan-in leads to increased device capacitance and thus slower devices. Finally, it should be noted that the CCN requires that all training patterns be available for computing the averages and after each training cycle, which makes the CCN inappropriate for on-line implementations.

Back to the Table of Contents

Back to Main Menu