Another neural network model which utilizes hidden
units with localized receptive fields and which allows for efficient
supervised training is the cerebellar model articulation controller
(CMAC). This network was developed by Albus (1971) as a model
of the cerebellum and was later applied to the control of robot
manipulators (Albus, 1975; 1979; 1981). A similar model for the
cerebellum was also independently developed by Marr (1969).

There exist many variants of and extensions to Albus's
CMAC. In this section, the CMAC version reported by Miller et
al. (1990c) is described. The CMAC consists of two mappings (processing
stages). The first is a nonlinear transformation which maps the
network input **x** R*n*
into a higher dimensional vector . The
vector **z** is a sparse vector in which at most *c* of
its components are non-zero (*c* is called a generalization
parameter and is user specified). The second mapping generates
the CMAC output **y** R*L*
through a linear matrix-vector product **Wz**, where **W**
is an *L* *J* matrix of modifiable real-valued weights.

The CMAC has a built-in capability of local generalization:
Two similar (in terms of Euclidean distance) inputs **x** and
are mapped by the first mapping stage
to similar binary vectors **z** and ,
respectively, while dissimilar inputs map into dissimilar vectors.
In addition to this local generalization feature, the first mapping
transforms the *n*-dimensional input vector **x** into
a *J*-dimensional binary vector **z**, with *J* >> *n*.

This mapping is realized as a cascade of three layers:
A layer of input sensor units that feeds its binary outputs to
a layer of logical AND units which, in turn, is sparsely interconnected
to a layer of logical OR units. The output of the OR layer is
the vector **z**. Figure 6.2.1 shows a schematic diagram of
the CMAC with the first processing stage (mapping) shown inside
the dashed rectangle. The specific interconnection patterns between
adjacent layers of the CMAC are considered next.

In addition to supplying the generalization parameter
*c*, the user must also specify a discretization of the
input space. Each component of the input vector **x** is fed
to a series of sensor units with overlapping receptive fields.
Each sensor unit produces a '1' if the input falls within its
receptive field and is '0' otherwise. The width of the receptive
field of each sensor controls input generalization, while the
offset of the adjacent fields controls input quantization. The
ratio of receptive field width to receptive field offset defines
the generalization parameter *c*.

The binary outputs of the sensor units are fed into
the layer of logical AND units. Each AND unit receives an input
from a group of *n* sensors, each sensor corresponds to one
distinct input variable, and thus the unit's input receptive field
is the interior of a hypercube in the input hyperspace (the interior
of a square in the two-dimensional input space of Figure 6.2.1).
The AND units are divided into *c* subsets. The receptive
fields of the sensor units connected to each of the subsets are
organized so as to span the input space without overlap. Each
input vector excites one AND unit from each subset, for a total
of *c* excited units for any input. There exists many ways
of organizing the receptive fields of the individual subsets which
produce the above excitation pattern (e.g., Miller et al., 1990c;
Parks and Militzer, 1991; Lane et al., 1992). Miller et al. employ
an organization scheme similar to Albus's original scheme where
each of the subsets of AND units is identical in its receptive
field organization, but each subset is offset relative to the
others along hyperdiagonals in the input hyperspace. Here, adjacent
subsets are offset by the quantization level of each input.

The resulting number of AND units (also called state-space
detectors) resulting from the above organization can be very large
for many practical problems. For example, a system with 10 inputs,
each quantized into 100 different levels, would have 10010
= 1020 vectors (points)
in its input space, and would require a correspondingly large
number of AND units! However, most practical problems do not
involve the whole input space; most of the possible input vectors
would never be encountered. Therefore, one can significantly
reduce the size of the adaptive output layer and hence reduce
storage requirements and training time by transforming the binary
output vector generated by the AND layer into a lower dimensional
vector **z** (however, the dimension of **z** is still much
larger than *n*). This is accomplished in the CMAC by randomly
connecting the AND unit outputs to a smaller set of OR units,
as shown in Figure 6.2.1. Since exactly *c* AND units are
excited by any input, at most *c* OR units will be excited
by any input. This leads to a highly sparse vector **z**.

The final output of the CMAC is generated by multiplying
the vector **z** by the weight matrix of the output layer.
The *l*th row of this weight matrix (corresponding to the
*l*th output unit) is adaptively and independently adjusted
(e.g., using LMS learning rule) in order to approximate a given
function *f**l*(**x**)
implemented by the *l*th output unit. We may also use the
CMAC as a classifier by adding nonlinear activations such as sigmoids
or threshold activation functions to the output units and employing
the delta rule or the perceptron rule (or adaptive Ho-Kashyap
rule), respectively. The high degree of sparsity of the vector
**z** typically leads to fast learning. Additional details
on the learning behavior of the CMAC can be found in Wong and
Sideris (1992).

Because of its intrinsic local generalization property,
the CMAC can most successfully approximate functions that are
slowly varying. The CMAC will fail to approximate functions that
oscillate rapidly or which are highly nonlinear (Cotter and Guillerm,
1992; Brown et al., 1993). Thus, the CMAC does not have universal
approximation capabilities like those of multilayer feedforward
nets of sigmoid units or RBF nets.

One appealing feature of the CMAC is its efficient
realization in software in terms of training time and real-time
operation. The CMAC also has practical hardware realizations
using logic cell arrays in VLSI technology (Miller et al., 1990b).
Examples of applications of CMAC include real-time robotics (Miller
et al., 1990d), pattern recognition (Glanz and Miller, 1987),
and signal processing (Glanz and Miller, 1989).

**6.2.1 CMAC Relation to Rosenblatt's Perceptron
and Other Models**

One of the earliest adaptive artificial neural network
models is Rosenblatt's perceptron (Rosenblatt, 1961). In its
most basic form, this model consists of a hidden layer of a large
number of units, which computes random Boolean functions, connected
to an output layer of one or more LTGs, as illustrated in Figure
6.2.2. (Historically speaking, the term perceptron was originally
coined for the architecture of Figure 6.2.2 or its variants.
However, in the current literature, the term perceptron is usually
used to refer to the unit in Figure 3.1.1 or 3.1.4).

In Rosenblatt's perceptron, each hidden unit is
restricted to receive a small number of inputs relative to the
total number of inputs in a given input pattern. Here, the input
pattern is typically assumed to be a two-dimensional binary image
formed on a "retina". As shown in Figure 6.2.2, each
hidden unit "sees" only a small piece of the binary
input image. The idea here is that some of the many random hidden
units might get "lucky" and detect "critical features"
of the input image, thus allowing the output LTG(s) to perfectly
classify the input image, after being trained with the perceptron
rule. Ultimately, the hidden random Boolean units are intended
to map a given nonlinearly separable training set of patterns
onto vectors **z** {0, 1}*J*
of a high-dimensional feature space such that the training set
becomes linearly separable. Note that these ideas are similar
to those discussed in relation to polynomial threshold gates (PTGs)
in Chapter One (refer to Figure 1.3.4). The basic difference
here is that a binary input PTG employs AND gates as its hidden
units, as opposed to the potentially more powerful Boolean units
employed in Rosenblatt's perceptron. Another important difference
is that a PTG allows one of the hidden AND units to "see"
(cover) the whole input pattern, with the other AND units covering
substantial portions of the input.

Rosenblatt's perceptron also has common features
with the CMAC model discussed above. Both models restrict the
amount of input information seen by each hidden unit, employ a
nonlinear Boolean transformation via the hidden units, and allow
for adaptive computation of the output layer weights. If we take
a second look at the CMAC architecture in Figure 6.2.1, we see
that the first mapping (represented by the circuitry inside the
dashed rectangle) is realized by a Boolean AND-OR network. Here,
the *j*th OR unit and the set of AND units feeding into it
can be thought of as generating a random Boolean function and
may be compared to a random Boolean unit in Figure 6.2.2. The
Boolean functions *z**j*
in the CMAC acquire their random nature from the random interconnectivity
pattern assumed between the two layers of AND and OR units. However,
and due to the sparsity of connections between these two layers,
the class of Boolean functions realized by the *z**j*'s
does not have the richness nor diversity of the uniformly random
Boolean functions *z**j*'s
in Rosenblatt's perceptron. These two models also have a minor
difference in that the CMAC normally uses linear output units
with LMS training, while the perceptron uses LTGs trained with
the perceptron rule. This difference is due to the different
nature of intended applications for each model: The CMAC is primarily
used as a continuous, smooth function approximator whereas Rosenblatt's
perceptron was originally intended as a pattern classifier. One
may also note the more structured receptive field organization
in the CMAC compared to the perceptron. Later in this section,
we will see that Rosenblatt's perceptron does not have the intrinsic
local generalization feature of the CMAC.

The non-universality of the CMAC is also shared
by the perceptron of Rosenblatt. This limitation is due to the
localized nature of the hidden unit receptive fields defined on
the input image. For example, it has been shown (Minsky and Papert,
1969) that this particular perceptron model cannot determine whether
or not all the parts of its input image (geometric figure) are
connected to one another, nor it can determine whether or not
the number of 'on' pixels in a finite input image is odd. The
later task is equivalent to the parity problem and can only be
solved by Rosenblatt's perceptron if at least one hidden unit
is allowed to have its receptive field span the entire input image.

The limitations of Rosenblatt's model can be relaxed
by allowing every hidden unit to see all inputs. However, this
becomes impractical when the dimension *n* of the input is
large since there would be possible Boolean
functions for each hidden random Boolean unit to choose from.
Therefore, there is very little chance for the hidden units to
randomly become a detector of "critical features," unless
we start with an exponentially large number of hidden units.
But this requirement renders the model impractical.

Yet another weakness of Rosenblatt's perceptron
is that it is not "robustness-preserving." In other
words, it does not allow for good local generalization. To see
this, consider two similar unipolar binary input patterns (vectors)
**x**' and **x**". Here, we use the similarity measure
of the normalized Hamming distance *D***x**
between the two input patterns

(6.2.1)

If **x**' is similar to **x**", then *D*(**x**', **x**")
is much smaller than 1. Now, because of the uniform random nature
of the hidden Boolean units, the output of any hidden unit *z**j*
is one (or zero) with a probability of 0.5 regardless of the input.
Thus, the activation patterns **z** at the output of the hidden
layer are completely uncorrelated. In particular, the normalized
Hamming distance *D***z**
between the two **z** vectors corresponding to any two input
vectors is approximately equal to . Therefore,
this model is not robustness-preserving.

Gallant and Smith (1987) and independently Hassoun
(1988) have proposed a practical classifier model inspired by
Rosenblatt's perceptron, but which solves some of the problems
associated with Rosenblatt's model. This model, Gallant-Smith-Hassoun
(GSH) model, is also similar to a version of an early model studied
by Gamba (1961), and referred to as the Gamba Perceptron (Minsky
and Papert, 1969). The main distinguishing features of the GSH
model are that every hidden unit sees the whole input pattern
**x** and that the hidden units are random LTGs. For the trainable
output units, the GSH model uses Ho-Kashyap learning in Hassoun's
version and the pocket algorithm [a modified version of the perceptron
learning rule that converges for nonlinearly separable problems.
For details, see Gallant (1993)] in the Gallant and Smith model.
The hidden LTGs assume fixed integer weights and bias (threshold)
generated randomly in some range [-*a*,
+*a*].

The use of random LTGs as opposed to random Boolean
units as hidden units has the advantage of a "rich"
distributed representation of the critical features in the hidden
activation vector **z**. This distributed representation coupled
with the ability of hidden LTGs to see the full input pattern
makes the GSH model a universal approximator of binary mappings.
However, an important question here is how many hidden random
LTGs are required to realize any arbitrary *n*-input binary
mapping of *m* points/vectors? This question can be easily
answered for the case where the hidden LTG's are allowed to have
arbitrary (not random) parameters (weights and thresholds). In
this case, and recalling Problem 2.1.2 and Theorem 1.4.1, we find
that *m* hidden LTGs are sufficient to realize any binary
function of *m* points in {0, 1}*n*.
Now, if we assume hidden LTGs with random parameters, we might
intuitively expect that the required number of hidden units in
the GSH model for approximating any binary function of *m*
points will be much greater than *m*. Empirical results
show that the above intuitive answer is not correct! Simulations
with the 7-bit parity function, random functions, and other completely
as well as partially specified Boolean functions reveal that the
required number of random LTGs is between *m*
and *m* (Gallant and Smith, 1987). Note that for the worst
case scenario of a complex completely specified *n*-input
Boolean function where *m* = 2*n*,
the number of hidden LTGs in the GSH model still scales exponentially
in *n*.

The above result on the size of the hidden layer
in the GSH model may be partially explained in terms of the mapping
properties of the random LTG layer. Consider two *n*-dimensional
binary input vectors **x**' and **x**" which are mapped
by the random LTG layer of the GSH model into the *J*-dimensional
binary vectors **z**' and **z**", respectively. Let
us assume *n* is large. Also, assume that the weights and
thresholds of the LTGs are generated according to the normal distributions
and , respectively.
Using a result of Amari (1974, 1990), we may relate the normalized
Hamming distances *D***z**
and *D***x** according
to

(6.2.2)

where

(6.2.3)

The parameter *A* is close to unity if the weights
and thresholds are identically distributed or as long as .
Figure 6.2.3 shows a plot of Equation (6.2.2) for *A* = 1
and .

Figure 6.2.3. Normalized Hamming distance between
two hidden layer activation vectors versus the distance between
the two corresponding input vectors.

For small values of *D***x**,
*D***z** is small
so that the output activity of the hidden layer is similar for
similar inputs. Therefore, the random LTG layer exhibits robustness-preserving
features. Also, since approaches infinity
when *D***x**
approaches zero, the differences among similar vectors in a small
neighborhood of the input space are amplified in the corresponding
**z** vectors. Such a property is useful for recognizing differences
among very similar input vectors as is often desired when dealing
with Boolean mappings (e.g., parity) or highly nonlinear functions.
Equation 6.2.2 also implies that very different inputs map into
very different outputs when . This richness
of distributed representation of input features at the hidden
layer's output helps increase the probability of the output unit(s)
to find good approximations to arbitrary mappings. In addition
to the above desirable features, the GSH model is very appealing
from a hardware realization point of view. Here, the restricted
dynamic range integer parameters of the hidden LTGs allow for
their simple realization in VLSI technology. We also note that
the range [-*a*, +*a*]
of the hidden LTGs integer parameters can be made as small as
[-1, +1];
however, we pay for this desirable reduced dynamic range by having
to increase the number of hidden LTGs. This phenomenon is intuitively
justifiable and has been verified empirically in Hassoun (1988).