6.2 Cerebellar Model Articulation Controller (CMAC)

Another neural network model which utilizes hidden units with localized receptive fields and which allows for efficient supervised training is the cerebellar model articulation controller (CMAC). This network was developed by Albus (1971) as a model of the cerebellum and was later applied to the control of robot manipulators (Albus, 1975; 1979; 1981). A similar model for the cerebellum was also independently developed by Marr (1969).

There exist many variants of and extensions to Albus's CMAC. In this section, the CMAC version reported by Miller et al. (1990c) is described. The CMAC consists of two mappings (processing stages). The first is a nonlinear transformation which maps the network input x  Rn into a higher dimensional vector . The vector z is a sparse vector in which at most c of its components are non-zero (c is called a generalization parameter and is user specified). The second mapping generates the CMAC output y  RL through a linear matrix-vector product Wz, where W is an L J matrix of modifiable real-valued weights.

The CMAC has a built-in capability of local generalization: Two similar (in terms of Euclidean distance) inputs x and are mapped by the first mapping stage to similar binary vectors z and , respectively, while dissimilar inputs map into dissimilar vectors. In addition to this local generalization feature, the first mapping transforms the n-dimensional input vector x into a J-dimensional binary vector z, with J >> n.

This mapping is realized as a cascade of three layers: A layer of input sensor units that feeds its binary outputs to a layer of logical AND units which, in turn, is sparsely interconnected to a layer of logical OR units. The output of the OR layer is the vector z. Figure 6.2.1 shows a schematic diagram of the CMAC with the first processing stage (mapping) shown inside the dashed rectangle. The specific interconnection patterns between adjacent layers of the CMAC are considered next.


Figure 6.2.1. Schematic illustration of a CMAC for a two-dimensional input.

In addition to supplying the generalization parameter c, the user must also specify a discretization of the input space. Each component of the input vector x is fed to a series of sensor units with overlapping receptive fields. Each sensor unit produces a '1' if the input falls within its receptive field and is '0' otherwise. The width of the receptive field of each sensor controls input generalization, while the offset of the adjacent fields controls input quantization. The ratio of receptive field width to receptive field offset defines the generalization parameter c.

The binary outputs of the sensor units are fed into the layer of logical AND units. Each AND unit receives an input from a group of n sensors, each sensor corresponds to one distinct input variable, and thus the unit's input receptive field is the interior of a hypercube in the input hyperspace (the interior of a square in the two-dimensional input space of Figure 6.2.1). The AND units are divided into c subsets. The receptive fields of the sensor units connected to each of the subsets are organized so as to span the input space without overlap. Each input vector excites one AND unit from each subset, for a total of c excited units for any input. There exists many ways of organizing the receptive fields of the individual subsets which produce the above excitation pattern (e.g., Miller et al., 1990c; Parks and Militzer, 1991; Lane et al., 1992). Miller et al. employ an organization scheme similar to Albus's original scheme where each of the subsets of AND units is identical in its receptive field organization, but each subset is offset relative to the others along hyperdiagonals in the input hyperspace. Here, adjacent subsets are offset by the quantization level of each input.

The resulting number of AND units (also called state-space detectors) resulting from the above organization can be very large for many practical problems. For example, a system with 10 inputs, each quantized into 100 different levels, would have 10010 = 1020 vectors (points) in its input space, and would require a correspondingly large number of AND units! However, most practical problems do not involve the whole input space; most of the possible input vectors would never be encountered. Therefore, one can significantly reduce the size of the adaptive output layer and hence reduce storage requirements and training time by transforming the binary output vector generated by the AND layer into a lower dimensional vector z (however, the dimension of z is still much larger than n). This is accomplished in the CMAC by randomly connecting the AND unit outputs to a smaller set of OR units, as shown in Figure 6.2.1. Since exactly c AND units are excited by any input, at most c OR units will be excited by any input. This leads to a highly sparse vector z.

The final output of the CMAC is generated by multiplying the vector z by the weight matrix of the output layer. The lth row of this weight matrix (corresponding to the lth output unit) is adaptively and independently adjusted (e.g., using LMS learning rule) in order to approximate a given function fl(x) implemented by the lth output unit. We may also use the CMAC as a classifier by adding nonlinear activations such as sigmoids or threshold activation functions to the output units and employing the delta rule or the perceptron rule (or adaptive Ho-Kashyap rule), respectively. The high degree of sparsity of the vector z typically leads to fast learning. Additional details on the learning behavior of the CMAC can be found in Wong and Sideris (1992).

Because of its intrinsic local generalization property, the CMAC can most successfully approximate functions that are slowly varying. The CMAC will fail to approximate functions that oscillate rapidly or which are highly nonlinear (Cotter and Guillerm, 1992; Brown et al., 1993). Thus, the CMAC does not have universal approximation capabilities like those of multilayer feedforward nets of sigmoid units or RBF nets.

One appealing feature of the CMAC is its efficient realization in software in terms of training time and real-time operation. The CMAC also has practical hardware realizations using logic cell arrays in VLSI technology (Miller et al., 1990b). Examples of applications of CMAC include real-time robotics (Miller et al., 1990d), pattern recognition (Glanz and Miller, 1987), and signal processing (Glanz and Miller, 1989).

6.2.1 CMAC Relation to Rosenblatt's Perceptron and Other Models

One of the earliest adaptive artificial neural network models is Rosenblatt's perceptron (Rosenblatt, 1961). In its most basic form, this model consists of a hidden layer of a large number of units, which computes random Boolean functions, connected to an output layer of one or more LTGs, as illustrated in Figure 6.2.2. (Historically speaking, the term perceptron was originally coined for the architecture of Figure 6.2.2 or its variants. However, in the current literature, the term perceptron is usually used to refer to the unit in Figure 3.1.1 or 3.1.4).

In Rosenblatt's perceptron, each hidden unit is restricted to receive a small number of inputs relative to the total number of inputs in a given input pattern. Here, the input pattern is typically assumed to be a two-dimensional binary image formed on a "retina". As shown in Figure 6.2.2, each hidden unit "sees" only a small piece of the binary input image. The idea here is that some of the many random hidden units might get "lucky" and detect "critical features" of the input image, thus allowing the output LTG(s) to perfectly classify the input image, after being trained with the perceptron rule. Ultimately, the hidden random Boolean units are intended to map a given nonlinearly separable training set of patterns onto vectors z  {0, 1}J of a high-dimensional feature space such that the training set becomes linearly separable. Note that these ideas are similar to those discussed in relation to polynomial threshold gates (PTGs) in Chapter One (refer to Figure 1.3.4). The basic difference here is that a binary input PTG employs AND gates as its hidden units, as opposed to the potentially more powerful Boolean units employed in Rosenblatt's perceptron. Another important difference is that a PTG allows one of the hidden AND units to "see" (cover) the whole input pattern, with the other AND units covering substantial portions of the input.


Figure 6.2.2. Rosenblatt's perceptron.

Rosenblatt's perceptron also has common features with the CMAC model discussed above. Both models restrict the amount of input information seen by each hidden unit, employ a nonlinear Boolean transformation via the hidden units, and allow for adaptive computation of the output layer weights. If we take a second look at the CMAC architecture in Figure 6.2.1, we see that the first mapping (represented by the circuitry inside the dashed rectangle) is realized by a Boolean AND-OR network. Here, the jth OR unit and the set of AND units feeding into it can be thought of as generating a random Boolean function and may be compared to a random Boolean unit in Figure 6.2.2. The Boolean functions zj in the CMAC acquire their random nature from the random interconnectivity pattern assumed between the two layers of AND and OR units. However, and due to the sparsity of connections between these two layers, the class of Boolean functions realized by the zj's does not have the richness nor diversity of the uniformly random Boolean functions zj's in Rosenblatt's perceptron. These two models also have a minor difference in that the CMAC normally uses linear output units with LMS training, while the perceptron uses LTGs trained with the perceptron rule. This difference is due to the different nature of intended applications for each model: The CMAC is primarily used as a continuous, smooth function approximator whereas Rosenblatt's perceptron was originally intended as a pattern classifier. One may also note the more structured receptive field organization in the CMAC compared to the perceptron. Later in this section, we will see that Rosenblatt's perceptron does not have the intrinsic local generalization feature of the CMAC.

The non-universality of the CMAC is also shared by the perceptron of Rosenblatt. This limitation is due to the localized nature of the hidden unit receptive fields defined on the input image. For example, it has been shown (Minsky and Papert, 1969) that this particular perceptron model cannot determine whether or not all the parts of its input image (geometric figure) are connected to one another, nor it can determine whether or not the number of 'on' pixels in a finite input image is odd. The later task is equivalent to the parity problem and can only be solved by Rosenblatt's perceptron if at least one hidden unit is allowed to have its receptive field span the entire input image.

The limitations of Rosenblatt's model can be relaxed by allowing every hidden unit to see all inputs. However, this becomes impractical when the dimension n of the input is large since there would be possible Boolean functions for each hidden random Boolean unit to choose from. Therefore, there is very little chance for the hidden units to randomly become a detector of "critical features," unless we start with an exponentially large number of hidden units. But this requirement renders the model impractical.

Yet another weakness of Rosenblatt's perceptron is that it is not "robustness-preserving." In other words, it does not allow for good local generalization. To see this, consider two similar unipolar binary input patterns (vectors) x' and x". Here, we use the similarity measure of the normalized Hamming distance Dx between the two input patterns

(6.2.1)

If x' is similar to x", then D(x', x") is much smaller than 1. Now, because of the uniform random nature of the hidden Boolean units, the output of any hidden unit zj is one (or zero) with a probability of 0.5 regardless of the input. Thus, the activation patterns z at the output of the hidden layer are completely uncorrelated. In particular, the normalized Hamming distance Dz between the two z vectors corresponding to any two input vectors is approximately equal to . Therefore, this model is not robustness-preserving.

Gallant and Smith (1987) and independently Hassoun (1988) have proposed a practical classifier model inspired by Rosenblatt's perceptron, but which solves some of the problems associated with Rosenblatt's model. This model, Gallant-Smith-Hassoun (GSH) model, is also similar to a version of an early model studied by Gamba (1961), and referred to as the Gamba Perceptron (Minsky and Papert, 1969). The main distinguishing features of the GSH model are that every hidden unit sees the whole input pattern x and that the hidden units are random LTGs. For the trainable output units, the GSH model uses Ho-Kashyap learning in Hassoun's version and the pocket algorithm [a modified version of the perceptron learning rule that converges for nonlinearly separable problems. For details, see Gallant (1993)] in the Gallant and Smith model. The hidden LTGs assume fixed integer weights and bias (threshold) generated randomly in some range [-a, +a].

The use of random LTGs as opposed to random Boolean units as hidden units has the advantage of a "rich" distributed representation of the critical features in the hidden activation vector z. This distributed representation coupled with the ability of hidden LTGs to see the full input pattern makes the GSH model a universal approximator of binary mappings. However, an important question here is how many hidden random LTGs are required to realize any arbitrary n-input binary mapping of m points/vectors? This question can be easily answered for the case where the hidden LTG's are allowed to have arbitrary (not random) parameters (weights and thresholds). In this case, and recalling Problem 2.1.2 and Theorem 1.4.1, we find that m hidden LTGs are sufficient to realize any binary function of m points in {0, 1}n. Now, if we assume hidden LTGs with random parameters, we might intuitively expect that the required number of hidden units in the GSH model for approximating any binary function of m points will be much greater than m. Empirical results show that the above intuitive answer is not correct! Simulations with the 7-bit parity function, random functions, and other completely as well as partially specified Boolean functions reveal that the required number of random LTGs is between m and m (Gallant and Smith, 1987). Note that for the worst case scenario of a complex completely specified n-input Boolean function where m = 2n, the number of hidden LTGs in the GSH model still scales exponentially in n.

The above result on the size of the hidden layer in the GSH model may be partially explained in terms of the mapping properties of the random LTG layer. Consider two n-dimensional binary input vectors x' and x" which are mapped by the random LTG layer of the GSH model into the J-dimensional binary vectors z' and z", respectively. Let us assume n is large. Also, assume that the weights and thresholds of the LTGs are generated according to the normal distributions and , respectively. Using a result of Amari (1974, 1990), we may relate the normalized Hamming distances Dz and Dx according to

(6.2.2)

where

(6.2.3)

The parameter A is close to unity if the weights and thresholds are identically distributed or as long as . Figure 6.2.3 shows a plot of Equation (6.2.2) for A = 1 and .










Figure 6.2.3. Normalized Hamming distance between two hidden layer activation vectors versus the distance between the two corresponding input vectors.

For small values of Dx, Dz is small so that the output activity of the hidden layer is similar for similar inputs. Therefore, the random LTG layer exhibits robustness-preserving features. Also, since approaches infinity when Dx approaches zero, the differences among similar vectors in a small neighborhood of the input space are amplified in the corresponding z vectors. Such a property is useful for recognizing differences among very similar input vectors as is often desired when dealing with Boolean mappings (e.g., parity) or highly nonlinear functions. Equation 6.2.2 also implies that very different inputs map into very different outputs when . This richness of distributed representation of input features at the hidden layer's output helps increase the probability of the output unit(s) to find good approximations to arbitrary mappings. In addition to the above desirable features, the GSH model is very appealing from a hardware realization point of view. Here, the restricted dynamic range integer parameters of the hidden LTGs allow for their simple realization in VLSI technology. We also note that the range [-a, +a] of the hidden LTGs integer parameters can be made as small as [-1, +1]; however, we pay for this desirable reduced dynamic range by having to increase the number of hidden LTGs. This phenomenon is intuitively justifiable and has been verified empirically in Hassoun (1988).

Back to the Table of Contents

Back to Main Menu