In Chapter Four (Section 4.9) we saw that training
a multilayer feedforward neural network with fixed resources requires,
in the worst case, exponential time. On the other hand, polynomial
time training is possible, in the worst case, if we are willing
to use unit-allocating nets which are capable of allocating new
units, as needed, as more patterns are learned ( Baum, 1989).

In the context of pattern classification, the classical
*k*-nearest neighbors classifier ( Fix and Hodges, 1951; Duda
and Hart, 1973) is an extreme example of a unit-allocating machine
with O(1) learning complexity. The *k*-nearest neighbors
classifier assigns to any new input the class most heavily represented
among its *k*-nearest neighbors. This classifier represents
an extreme unit-allocating machine since it allocates a new unit
for every learned example in a training set. There are no computations
involved in adjusting the "parameters" of allocated
units: Each new allocated unit stores exactly the current example
(vector) presented. In other words, no transformation or abstraction
of the examples in the training set is required and one can immediately
proceed to use this machine for classification regardless of the
size of the training set. Therefore, the training time of this
classifier does not scale with the number of examples, *m*,
which means that the complexity is O(1). Surprisingly, it has
been shown ( Cover and Hart, 1967) that in the limit of an infinite
training set, this simple classifier has a probability of classification
error less than twice the minimum achievable error probability
of the optimal Bayes classifier for any integer value of *k* 1.
Unfortunately, the performance of the nearest neighbor classifier
deteriorates for training sets of small size. Also, the convergence
to the above asymptotic performance can be arbitrarily slow, and
the classification error rate need not even decrease monotonically
with *m* ( Duda and Hart, 1973).

Even when utilizing a large training set, *k*-nearest
neighbors classifiers are impractical as on-line classifiers due
to the large number of computations required in classifying a
new input. Thus, we are forced to use far fewer than one unit
for every training sample; i.e., we must create and load an abstraction
of the training data. This obviously leads to higher learning
complexity than O(1).

Practical trainable networks should have a number
of desirable attributes. The most significant of these attributes
are fast learning speed, accurate learning, and compact representation
of training data. The reason for desiring the first two attributes
is obvious. On the other hand, the formation of a compact representation
is important for two reasons: Good generalization (less free parameters
leads to less overfitting) and feasibility of hardware (VLSI)
realization since silicon surface area is the premium.

In the following, we consider three practical unit
allocating networks. These networks are capable of forming compact
representations of data easily and rapidly. Two of the networks
considered are classifier networks. The third network is capable
of classification as well as approximation of continuous functions.

**6.3.1 Hyperspherical Classifiers**

Hyperspherical classifiers were introduced by Cooper
(1962, 1966), Batchelor and Wilkins (1968) and Batchelor (1969)
[See Batchelor (1974) for a summary of early work]. Like the
nearest-neighbor classifier, a hyperspherical classifier is based
upon the storage of examples represented as points in a metric
space (e.g., Euclidean space). The metric defined on this space
is a measure of the distance between an unknown input pattern
and a known category. Each stored point has associated with it
a finite "radius" that defines the point's region of
influence. The interior of the resulting hypersphere represents
the decision region associated with the center point's category.
This region of influence makes a hyperspherical classifier typically
more conservative in terms of storage than the nearest neighbor
classifier. Furthermore, the finite radii of the regions of influence
can make a hyperspherical classifier abstain from classifying
patterns from unknown categories (these patterns are typically
represented as points in the input space which are far away from
any underlying class regions). This later feature enhances the
classifiers ability to reject "rubbish."

Restricted Coulomb Energy (RCE) Classifier

The following is a description of a specific network
realization of a hyperspherical classifier proposed by Reilly
et al. (1982) and Reilly and Cooper (1990). This model is named
the "restricted Coulomb energy" (RCE) network. The
name is derived from the form of the "potential function"
governing the mapping characteristics, which has been interpreted
(Scofield et al., 1988) as a restricted form of a "high-dimensional
Coulomb potential" between a positive test charge and negative
charges placed at various sites.

The architecture of the RCE network contains two
layers: A hidden layer and an output layer. The hidden layer
is fully interconnected to all components of an input pattern
(vector) **x** R*n*.
The output layer consists of *L* units. The output layer
is sparsely connected to the hidden layer; each hidden unit projects
its output to one and only one output unit. The architecture
of the RCE net is shown in Figure 6.3.1. Each unit in the output
layer corresponds to a pattern category. The network assigns
an input pattern to a category *l* if the output cell *y**l*
is activated in response to the input. The decision of the network
is "unambiguous" if one and only one output unit is
active upon the presentation of an input, otherwise, the decision
is "ambiguous."

The transfer characteristics of the *j*th hidden
unit is given by

(6.3.1)

where *j* R*n*
is a parameter vector called "center", *r**j* R
is a threshold or "radius," and *D* is some predefined
distance metric between vectors *j*
and **x** (e.g., Euclidean distance between real-valued vectors
or Hamming distance between binary-valued vectors). Here, *f*
is the threshold activation function given by

(6.3.2)

On the other hand, the transfer function of a unit
in the output layer is the logical OR function. The *j*th
hidden unit in the RCE net is associated with a hyperspherical
region of the input space which defines the unit's region of influence.
The location of this region is defined by the center *j*
and its size is determined by the radius *r**j*.
According to Equation (6.3.1), any input pattern falling within
the influence region of a hidden unit will cause this unit to
fire. Thus, the hidden units define a collection of hyperspheres
in the space of input patterns. Some of these hyperspheres may
overlap. When a pattern falls within the regions of influence
of several hidden units, they will all "fire" and switch
on the output units they are connected to.

Training the RCE net involves two mechanisms: Unit
commitment and modification of hidden unit radii. Units may be
committed to the hidden and output layers. When committed, units
are interconnected so that they do not violate the RCE interconnectivity
pattern described above.

Initially, the network starts with no units. An
arbitrary sample pattern **x**1
is selected from the training set and one hidden unit and one
output unit are allocated. The allocated hidden unit center 1
is set equal to **x**1
and its radius *r*1
is set equal to a user-defined parameter *r*max
(*r*max is the maximum
size of the region of influence ever assigned to a hidden unit).
This unit is made fully interconnected to the input pattern and
projects its output *z*1
to the allocated output unit (OR gate). This output unit represents
the category of the input **x**1.
Next, we choose a second arbitrary example **x**2
and feed it to the current network. Here, one of three scenarios
emerges. First, if **x**2
causes the output unit to fire and if **x**2
belongs to the category represented by this unit, then do nothing
and continue training with a new input. In general, this scenario
might occur at a point during training where the network has multiple
hidden and output units representing various categories. In this
case, if the input pattern causes only the output unit representing
the correct category to fire, then do nothing and continue the
training session with a new input. On the other hand, the correct
output unit may fire along with one or more other output units.
This indicates that the regions of influence of hidden units
representing various categories overlap and that the present input
pattern lies inside the overlap region. Here, proceed by reducing
the threshold values (radii) of all active hidden units that are
associated with categories other than the correct one until they
become inactive.

The second scenario involves the case when the input
**x**2 happens to belong
to the same category as **x**1
but does not cause the output unit to fire. Here, allocate a
new hidden unit with center at 2 = **x**2
and radius *r*max
and connect the output *z*2
of this unit to the output unit. The general version of this
scenario occurs when the network has multiple output units. Now,
if the input pattern causes no output units (including the one
representing the category of the input) to fire, then allocate
a new hidden unit centered at the current input vector/pattern
and assign it a radius *r* = min(*r*max,
*d*min), where *d*min
is the distance from this new center to the nearest center of
a hidden unit representing any category different from that of
the current input pattern. The new allocated unit is connected
to the output unit representing the category of the input pattern.
Note that this setting of *r* may cause one or more output
units representing the wrong category to fire. This should not
be a problem since the shrinking of the region of influence mechanism
described under the first scenario will ultimately rectify the
situation. If, under this scenario, some hidden units representing
the wrong category fire, then the radii of such units are shrunk
as described earlier under the first scenario.

Finally, the third scenario represents the case
of an input with a new category that is not represented by the
network. Here, as in the first step of the training procedure,
a hidden unit centered at this input is allocated and its radius
is set as in the second scenario. Also, a new output unit representing
the new category is added which receives an input from the newly
allocated hidden unit. Again if existing hidden units become
active under this scenario, then their radii are shrunk until
they become inactive. The training phase continues (by cycling
through the training set or by updating in response to a stream
of examples) until no new units are allocated and the size of
the regions of influence of all hidden units converges.

The RCE net is capable of developing proper separating
boundaries for nonlinearly separable problems. The reader is
referred to Figure 6.3.2 for a schematic representation of separating
boundaries realized by the regions of influence for a nonlinearly
separable two-class problem in two-dimensional pattern space.
The RCE net can also handle the case where a single category
is contained in several disjoint regions. In principle, an arbitrary
degree of accuracy in the separating boundaries can be achieved
if no restriction is placed on the size of the training set.
Dynamic category learning is also possible with the RCE network.
That is, new classes of patterns can be introduced at arbitrary
points in training without always involving the need to retrain
the network on all of its previously trained data. Note that,
in its present form, the RCE network is not suitable for handling
overlapping class regions. Here, the learning algorithm will
tend to drive the radii of all hidden unit regions of influence
to zero. It also leads to allocating a large number of hidden
units approximately equal to the number of training examples coming
from the regions of overlap.

Figure 6.3.2. Schematic diagram for an RCE classifier
in two-dimensional space solving a nonlinearly separable problem.

Several variations of the RCE network are possible.
For example, one might employ mechanisms that allow the centers
of the hyperspheres to drift to more optimal locations in the
input space. A second variation would be to allow the hyperspheres
to grow. These two mechanisms have been considered for more general
hyperspherical classifiers than RCE classifiers (Batchelor, 1974).
Modifications to the RCE network for handling overlapping class
regions can be found in Reilly and Cooper (1990). Empirical examination
of RCE classifiers appeared in Lee and Lippmann (1990) and Hudak
(1992).

Polynomial-Time Trained Hyperspherical Classifier

In the following, we describe a classifier network
with an architecture identical to the RCE network (see Figure
6.3.1), but which employs a training algorithm that is shown to
construct and train the classifier network in polynomial time.
This polynomial time classifier (PTC) uses clustering and linear
programming models to incrementally generate the hidden layer.
Our description of the PTC net is based on the work of Roy and
Mukhopadhyay (1991), Mukhopadhyay et al. (1993), and Roy et al.
(1993).

As in the RCE net, the PTC net uses hidden units
to cover class regions. However, the region of influence of each
hidden unit need not be restricted to the hypersphere. A quadratic
region of influence is assumed which defines the transfer characteristics
of a hidden unit according to:

(6.3.3)

where

(6.3.4)

and *w*0,
*w**i*, and
*w**ij* are
modifiable real-valued parameters to be learned. Here, is greater
than or equal to a small positive constant (say 0.001) and its
value is computed as part of the training procedure described
below. With the transfer characteristics as in Equation (6.3.3),
one may view each hidden unit as a quadratic threshold gate (QTG),
introduced in Chapter One. A quadratic region of influence contains
the hypersphere as a special case but allows for the realization
of hyperellipsoids and hyperboloids. This enables the PTC to
form more accurate boundary regions than the RCE classifier.
Other regions of influence may be used in the PTC as long as they
are represented by functions which are linear in the parameters
to be learned: *w*0,
*w**i*, and
*w**ij*. For
example, is also an acceptable function
describing the regions of influence.

The learning algorithm determines the parameters
of the hidden units in such a way that only sample patterns of
a designated class are "covered" by a hidden unit representing
this class. The algorithm also attempts to minimize the number
of hidden units required to accurately solve a given classification
problem.

Initially, the learning algorithm attempts to use
a single hidden unit to cover a whole class region. If this fails,
the sample patterns in that class are split into two or more clusters,
using a clustering procedure (e.g., the *k*-means procedure
described in Section 6.1) and then attempts are made to adapt
separate hidden units to cover each of these clusters. If that
fails, or if only some clusters are covered, then the uncovered
clusters are further split to be separately covered until covers
are provided for each ultimate cluster. The idea here is to allow
a hidden unit to cover (represent) as many of the sample patterns
within a given class as is feasibly possible (without including
sample patterns from any other class), thereby minimizing the
total number of hidden units needed to cover that class.

The parameters of each hidden unit are computed
by solving a linear programming problem [for an accessible description
of linear programming, the reader is referred to Chapter Five
in Duda and Hart (1973)]. The linear program is used to adjust
the location and boundaries of the region of influence of a hidden
unit representing a given class cluster such that sample patterns
from this cluster cause the net inputs to the hidden unit [the
argument of *f* in Equation (6.3.3)] to be at least slightly
positive and those from all other classes to be at least slightly
negative. Linear programming is appropriate here because the
regions of influence are defined by quadratics which are linear
functions of their parameters. Formally put, the linear program
set up for adjusting the parameters of the *j*th hidden unit
is as follows:

The positive margin *j*
ensures a finite separation between the classes and prevents the
formation of common boundaries. Unit *j* becomes a permanent
fixed unit of the PTC net if and only if the solution to the above
problem is feasible. Roy et al. (1993) gave an extension to the
above training method which enhances the PTC performance for data
with outlier patterns. An alternative to linear programming is
to use the Ho-Kashyap algorithm described in Section 3.1.4 which
guarantees class separation (with finite separation between classes)
if a solution exists or, otherwise, gives an indication of nonlinear
separability.

Similar to the RCE net, all hidden units in the
PTC whose respective regions of influence cover clusters of the
same class have their outputs connected to a unique logical OR
unit (or an LTG realizing the OR function) in the output layer.

The polynomial complexity of the above training
algorithm is shown next. For each class label *c* = 1, 2, ..., *L*,
let *m**c* be
the number of pattern vectors (for a total of
patterns) to be covered. Consider the worst case scenario (from
a computational point of view) where a separate cover (hidden
unit) is required for each training pattern. Thus, in this case,
all of the linear programs from Equation (6.3.5) for a class will
be infeasible until the class is broken up into *m**c*
single point clusters. All single point clusters will produce
feasible solutions which implies that we have just designed a
PTC with perfect recall on the training set (note, however, that
such a design will have poor generalization!). Let us further
assume the worst case scenario in which the *m**c*
pattern vectors are broken up into one extra cluster at each clustering
stage. Using simple counting arguments, we can readily show that
for successful training a total of linear
programs (feasible and infeasible combined) are solved and clustering
performed *m* times. Now, since each linear program can
be solved in polynomial time (Karmarkar, 1984; Khachian, 1979)
and each clustering operation to obtain a specified number of
clusters can also be performed in polynomial time (Everitt, 1980;
Hartigan, 1975), it follows that the above learning algorithm
is of polynomial complexity.

What remains to be seen is the generalization capability
of PTC nets. As for most artificial neural nets, only empirical
studies of generalization are available. One such study (Roy
et al., 1993) reported comparable classification performance of
a PTC net to the *k*-nearest neighbors classifier and backprop
nets on relatively small sample classification tasks. In general,
a PTC net allocates a much smaller number of hidden units compared
to the RCE net when trained on the same data. However, the PTC
training procedure requires simultaneous access to all training
examples which makes the PTC net inapplicable for on-line implementations.

**6.3.2 Cascade-Correlation Network**

The Cascade-Correlation Network (CCN) proposed by
Fahlman and Lebiere (1990) is yet another example of a unit-allocating
architecture. The CCN was developed in an attempt to solve the
so-called "moving target problem," which is attributed
to the slowness of backprop learning. Because all of the weights
in a backprop net are changing at once, each hidden unit sees
a constantly changing environment. Therefore, instead of moving
quickly to assume useful roles in the overall problem solution,
the hidden units engage in a complex "dance" with much
wasted motion (Fahlman and Lebiere, 1990).

The CCN differs from all networks considered so
far in two major ways: (1) It builds a deep net of cascaded units
(as opposed to a net with a wide hidden layer) and (2) it can
allocate more than one type of hidden unit; for example, sigmoidal
units and Gaussian units may coexist in the same network. This
network is suited for classification tasks or approximation of
continuous functions. The CCN has a significant learning speed
advantage over backprop nets, since units are trained individually
without requiring back propagation of error signals.

Figure 6.3.3. The cascade-correlation network architecture
with three hidden units. The number of each hidden unit represents
the order in which it has been allocated.

The CCN architecture after allocating three hidden
units is illustrated in Figure 6.3.3. The number of output units
is dictated by the application at hand, and by the way the designer
chooses to encode the outputs. The hidden units can be sigmoid
units, Gaussian units, etc. or a mixture of such units. An important
requirement on a candidate hidden unit is that it has a differentiable
transfer function. The original CCN uses sigmoid units as hidden
units. The output units can also take various forms, but typically
a sigmoid (or hyperbolic tangent) activation unit is employed
for classification tasks. Linear output units are employed when
the CCN is used for approximating mappings with real-valued outputs
(e.g. function approximation/interpolation).

Every input (including a bias *x*0 = 1)
is connected to every output unit by a connection with an adjustable
weight. Each allocated hidden unit receives a connection from
each pre-existing hidden unit. Therefore, each added hidden unit
defines a new one-unit layer. This may lead to a very deep network
with high fan-in for the hidden and output units.

The learning algorithm consists of two phases: Output
unit training and hidden unit training. Initially, the network
has no hidden units. In the first phase, output unit weights
are trained (e.g., using the delta rule) to minimize the usual
sum of squared error (SSE) measure at the output layer. At the
completion of training, if the SSE remains above a predefined
threshold, the residual errors (the difference
between the actual and desired output)
is recorded for each output unit *l* on each training pattern
**x***k*, *k* = 1, 2, ..., *m*.
Also, a new hidden unit is inserted whose weights are determined
by the hidden unit training phase described below. Note that
if the first training phase converges (SSE error is very small)
with no hidden units, then training is stopped (this convergence
would be an indication that the training set is linearly separable,
assuming we are dealing with a classification task).

In the hidden unit training phase, a pool of randomly
initialized candidate hidden units (typically four to eight units)
is trained in parallel. Later, one trained unit from this pool,
the one which best optimizes some performance measure, is selected
for permanent placement in the net. This multiple candidate training
strategy minimizes the chance that a useless unit will be permanently
allocated because an individual candidate has gotten stuck at
a poor set of weights during training. Each candidate hidden
unit receives trainable input connections from all of the network's
external inputs and from all pre-existing hidden units, if any.
During this phase, the outputs of these candidate units are not
yet connected to any output units in the network. Next, the weights
of each candidate unit are adjusted, independently of other candidates
in the pool, in the direction of maximizing a performance measure
*E*. Here, *E* is chosen as the sum, over all output
units, of the magnitude of the covariance between the candidate
unit's output *z**k*
and the residual output error , observed
at unit *l*. Formally, the criterion *E* is defined
as:

(6.3.6)

where and
are average values taken over all patterns **x***k*.
The maximization of *E* by each candidate unit is done by
incrementing the weight vector **w** for this unit by an amount
proportional to the gradient ; i.e., performing
steepest gradient ascent on *E*. Note that
is recomputed every time **w** is incremented by averaging
the unit's outputs due to all training patterns. During this
training phase, the weights of any pre-existing hidden units are
frozen. In fact once allocated, a hidden unit never changes its
weights. After the training reaches an asymptote, the candidate
unit which achieves the highest covariance score *E* is added
to the network by fanning out its output to all output layer units
through adjustable weights. The motivation here is that, by maximizing
covariance, a unit becomes tuned to the features in the input
pattern which have not been captured by the existing net. This
unit then becomes a permanent feature-detector in the network,
available for producing outputs or for creating other, more complex
feature detectors. At this point, the network repeats the output
layer training phase using the delta learning rule, and the residual
output errors are recomputed. The two
training phases continue to alternate until the output SSE is
sufficiently small, which is almost always possible.

The CCN has been empirically shown to be capable
of learning some hard classification-type tasks 10 to 100 times
faster than backprop. Fahlman and Lebiere (1990) have empirically
estimated the learning time in epochs to be roughly *J *log(*J*),
where *J* is the number of hidden units ultimately needed
to solve the given task. Unfortunately, though, a precise value
of *J* is almost impossible to determine. In addition, the
CCN is capable of learning difficult tasks on which backprop nets
have been found to get stuck in local minima. One of these difficult
tasks is the *n*-bit parity problem. Another is the two
spiral problem shown in Figure 6.3.4 (a).

The two spiral problem was proposed by Alex Wieland
as a benchmark which is extremely hard for a standard backprop
net to solve. The task requires a network with two real-valued
inputs and a single output to learn a mapping that distinguishes
between points on two intertwined spirals. The associated training
set consists of 194 point coordinates (*x*1, *x*2),
half of which come from each spiral. The output should be +1
for the first spiral and -1
for the other spiral. Figure 6.3.4 (b) shows a solution to this
problem generated by a trained CCN (Fahlman and Lebiere, 1990).
This task requires, typically, 12 to 19 sigmoidal hidden units
and an average of 1,700 training cycles when a pool of 8 candidate
hidden units is used during training. For comparison purposes,
we show in Figure 6.3.4 (c) a solution generated by a backprop
network employing three hidden layers of five units each and with
"shortcut" connections between layers (Lang and Witbrock,
1989). Here, each unit receives incoming connections from every
unit in every earlier layer, not just from the immediately preceding
layer.

The backprop net requires about 20,000 training
cycles and about 8,000 cycles if Fahlman's version of backprop
(see Section 5.2.3) is used. Therefore, the CCN outperforms standard
backprop in training cycles by a factor of 10, while building
a network of about the same complexity. In terms of actual computation
on a serial machine, however, the CCN shows 50-fold speedup over
standard backprop on the two spiral task. This is due to the
lower number of computations constituting a single CCN training
cycle compared to that of standard backprop. Note also that the
solution generated by the CCN is qualitatively better than that
generated by backprop (the reader might have already noticed a
difference in the spiral directions between Figures 6.3.4 (b)
and (c). This is simply because the simulation in (c) used training
spirals which have the opposite direction to those shown in Figure
6.3.4 (a).

(a) (b) (c)

Figure 6.3.4. (a) Training samples for the two spiral
problem. (b) Solution found by a cascade-correlation network.
(c) Solution found by a 3-hidden layer backprop network which
employs "short-cut" connections. [(a) and (b) are from
S. E. Fahlman and C. Lebiere, 1990, and (c) is from K. J. Lang
and M. J. Witbrock, 1989, with permission of Morgan Kaufmann.]

On the other hand, when used for function approximation,
backprop outperforms the CCN. Simulations with the Mackey-Glass
time series prediction task show poor generalization performance
for the CCN (Crowder, 1991); in this case, the CCN suffers from
overfitting. Another undesirable feature of the CCN is its inefficient
hardware implementation: The deep layered architecture leads to
a delay in response proportional to the number of layers. Also,
the high fan-in of the hidden units imposes additional implementation
constraints for VLSI technology; high device fan-in leads to increased
device capacitance and thus slower devices. Finally, it should
be noted that the CCN requires that all training patterns be available
for computing the averages and
after each training cycle, which makes the CCN inappropriate for
on-line implementations.