5.3 Applications

Backprop is byfar the most popular supervised learningmethod for multilayer neural networks. Backprop and its variationshave been applied to a wide variety of problems including patternrecognition, signal processing, image compression, speech recognition,medical diagnosis, prediction, nonlinear system modeling, andcontrol.

The most appealing feature of backprop is its adaptivenature which allows complex processes to be modeled through learningfrom measurements or examples. This method does not require theknowledge of specific mathematical models for or expert knowledgeof the problem being solved. The purpose of this section is togive the reader a flavor of the various areas of application ofbackprop and to illustrate some strategies that might be usedto enhance the training process on some nontrivial real-worldproblems.

5.3.1 NETtalk

One of the earliest applications of backprop wasto train a network to convert English text into speech (Sejnowskiand Rosenberg, 1987). The system, known as NETtalk, consistedof two modules: A mapping network and a commercial speech synthesismodule. The mapping network is a feedforward neural networkhaving 80 units in the hidden layer and 26 units in the outputlayer. The output units form a 1-out-of-26 code which encodes phonemes. The output of the neural network drives the speechsynthesizer, which in turn generates sounds associated with theinput phonemes. The input to the neural network is a 203-dimensionalbinary vector which encodes a window of seven consecutive characters(29 bits for each of the seven characters, including punctuation;each character is encoded using a 1-out-of-29 binary code). Thedesired output was a phoneme code giving the pronunciation ofthe letter at the center of the input window.

A block diagram for NETtalk is shown in Figure 5.3.1. When trained on 1024 words from a set of English phoneme exemplars,NETtalk was capable of intelligible speech after only 10 trainingcycles and it obtained an accuracy of 95% on the training setafter 50 cycles. The network first learned to recognize the divisionpoints between words and then gradually learned to map phonemes,sounding rather like a child learning to talk. The network wascapable of distinguishing between vowels and consonants, and whentested on new text it achieved a generalization accuracy of 78%. Upon adding random noise to the weights, or by removing a fewunits, the network's performance was found to degrade continuouslyas opposed to catastrophically (as for a serial digital system).

Figure 5.3.1. NETtalk: A backprop trained neuralnetwork that converts English text to speech.

A similar commercially available rule-based system(DEC-talk)which employs an expert system of hand-coded linguisticrules performs better than NETtalk on the same task. However,the significance of NETtalk is in its relatively short developmenttime; NETtalk simply learned from a limited set of examples butDEC-talk embodies rules which are the result of several yearsof analysis by many linguists. This application illustrates theease, relative to an expert system approach, by which a neuralnetwork-based system can be developed even when a problem is notfully understood.

5.3.2 Glove-Talk

Using adaptive networks, it is possible to builddevice interfaces between a person's movements and a complex physicaldevice. Such interfaces would simplify the process of designinga compatible mapping by adapting such a mapping automaticallyduring a training phase and also allow the mapping to be tailoredto individual users.

Glove-Talk is a neural network-based adaptive interfacesystem which maps hand gestures to speech (Fels and Hinton, 1993). Here, a bank of five feedforward neural networks with singlehidden layers and backprop training is used to map sensor signalsgenerated by a data glove to appropriate commands (words), whichin turn are sent to a speech synthesizer which then speaks theword. A block diagram of the Glove-Talk system is shown in Figure5.3.2. The hand-gesture data generated by the data glove consistof 16 parameters representing x, y, z, roll, pitch, andyaw of the hand relative to a fixed reference and ten finger flexangles. These parameters are measured every 1/60th second.

Figure 5.3.2. Glove-Talk: A neuralnetwork-basedsystem that maps hand gestures to speech. (from S. S. Fels andG. E. Hinton, 1993, Glove-Talk: A Neural Network Interface Betweena Data-Glove and a Speech Synthesizer, IEEE Transactions onNeural Networks, 4(1), pp. 2-8, ©1993 IEEE.)

Glove-Talk is designed to map complete hand-gesturesto whole words without mapping temporal constituents of the gesture. The trained system works as follows: The user forms a hand-shapefor a given root word (see Figure 5.3.3 for examples of rootwords/handgestures). Then a movement of the hand forward and back in oneof six directions determines the word ending: An up directionsignifies a -s (plural) ending, towards user direction signifiesan -ed ending, away from user direction signifies an -ing ending,to user's right direction signifies an -er ending, to user's leftdirection signifies a -ly ending, and down direction signifiesno ending (normal). The duration and magnitude of the gesturedetermine the speech rate and stress. The exact time at whichthe word is spoken is determined by the hand trajectory networkwhich upon detecting a deceleration phase of forward movementsends a signal back to the preprocessor to enable it to passappropriatebuffered data to each of four neural networks to do the hand-gestureto word mapping. The four neural networks are labeled in Figure5.3.2 as hand shape, hand direction, hand speed, and hand displacementnetworks.

Figure 5.3.3. Examples of root words for severalhand gestures (Adopted from Fels and Hinton, 1993). (from S.S. Fels and G. E. Hinton, 1993, Glove-Talk: A Neural Network InterfaceBetween a Data-Glove and a Speech Synthesizer, IEEE Transactionson Neural Networks, 4(1), pp. 2-8, ©1993 IEEE.)

The hand shape network has 80 hidden units whichare fully connected to 66 output units. The output units employa 1-out-of-66 binary encoding to classify a vocabulary of 66 rootwords used by Glove-Talk. The input vector to this network consistsof preprocessed versions of the finger's flex angles and the hand'sroll, pitch, and yaw (the flex angles are linearly scaled to liebetween zero and one, and the sines and cosines of the roll, pitch,and yaw are used). Full connectivity between the inputs and the80 hidden units is assumed. Since the outputs of the units inthe output layer may be viewed as representing probabilitydistributionsacross mutually exclusive alternatives, the activation function was used for the L output units. Also, the criterion/error function wasused, which is appropriate for the above selected activations. This error function reduces to E = -ln(yl) ifbinary targets dlare used (recall the binary 1-out-of-L encoding for theoutput vector), where ylis the output of the correct output unit. With these choicesof output unit activation and error function backprop is usedto train the hand shape net.

The role of the hand direction network is to translatethe direction of hand movement to one of the six possible endingsof the word. This network used ten hidden units feeding intosix output units. It employs output units whose activations andassociated error functions are similar to those in the hand shapenetwork; the six output units used a 1-out-of-6 encoding to representthe six possible encodings. The input to this network is a 10time step window of x, y, and z values. The window includes the data at the "enable signal time"of the hand trajectory network and at the previous nine times.

The hand speed network maps a 20 time step windowof directionless hand speed and acceleration(current speed minus previous speed), for a total of 40 inputs,to a real-valued speaking rate. This window is determined bythe hand trajectory network enable signal time and the previous19 time steps. Eight output units whose activities are in therange of 0 to 1 are used. Post processing is used to obtain areal-valued speaking rate (very fast, fast, slow, or very slow). This network has 15 hidden units. It employs standard backproplearning.

The hand displacement neural network of the hand-gestureto word mapping of Glove-Talk maps the same inputs as in the handspeed network to the output of a single sigmoid output unit, throughfive hidden units. Standard backprop is used for training. Duringretrieval, the value of the output unit was thresholded at 0.5to decide whether to stress the word. Here, the amount of handdisplacement determine whether a word is stressed.

Finally, the hand trajectory network has 10 hiddenunits feeding into a single sigmoidal unit which represents abinary decision about whether the most recent time is the righttime to read the hand shape. The input data is a window of 10time steps with five parameters at each time step, for a totalof 50 inputs. The parameters are x, y, z,speed, and acceleration. The performance of this neural net iscritical to the performance of the whole system; if the enablesignal controlling the output of the preprocessor to the restof the system is generated at the wrong time, the hand shape anddirection of the movement may be wrong. Once the hand trajectorynetwork is trained properly, the training of the other networksmay be performed. Targets are presented to the user (as instructions)and he/she simply makes the appropriate hand-gesture. An exampleof a target is "going-fast (stressed)." Here, the usermakes the second hand shape in Figure 5.3.3 and quickly moveshis/her hand far away from him/her and back again. Since theinput/output data required by each network differ, the neuralnetworks were trained independently. Thus, when training thehand shape network, the user focused attention on getting thecorrect hand shape and paid less attention to his/her hand'strajectory. Similarly, when training the other networks, attention was focusedon the relevant aspect of the gesture.

With a 203 gesture-to-word vocabulary (66 root words,each with up to six different endings), Glove-Talk produced thecorrect word more than 99% of the time, and no word (output) isproduced about 5% of the time (this is due to a no response fromthe hand trajectory network).

The major lesson to learn from this project is thatcomplex problems are some times best handled using a modular neuralnetwork architecture, where separate networks are used for eachdefined subtask. However, this approach is only meaningful ifthe problem we are trying to solve admits naturally defined subtasks.

5.3.3 Hand-Written ZIP Code Recognition

The recognition of handwritten digits is a classicalproblem in pattern recognition. Specifically, the post officeis interested in the recognition of handwritten ZIP codes on piecesof mail. A backprop network has been designed to recognize segmentednumerals digitized from handwritten ZIP codes that appeared onUS mail (Le Cun et al.,1989). Figure 5.3.4 shows examples ofsuch handwritten numerals.

Figure 5.3.4. Examples of normalized digits whichwere segmented from handwritten US ZIP codes. (From Y. Le Cunet al., 1989, with permission of the MIT Press).

The neural network was trained on 7,291 examplesand was tested on 2,007 new examples. As can be seen from thefigure, the data set contained numerous examples that are ambiguous,unclassifiable, or even misclassified. The examples are preprocessedthrough a simple linear transformation which makes the raw segmenteddigits fit in a 16 by 16 gray level image. The transformationpreserves the aspect ratios of the digits. The gray levels ineach image were scaled and translated to fall within the range-1 to +1.

The network consisted of three hidden layers H1,H2, and H3 and an output layer. Layer H1 is connected to theinput image and layer H3 feeds its outputs into the output layer. The output layer has 10 units and uses 1-out-of-10 coding. LayerH3 has 30 units and is fully connected to H2. The output layeris fully connected to H3. On the other hand, " weightsharing"interconnections are used between the inputs and layer H1 andbetween layers H1 and H2. The network is represented in Figure5.3.5.

Weight sharing refers to having several connectionscontrolled by a single weight (Rumelhart et al., 1986b). It imposesequality constraints among connection strengths, thus reducingthe number of free parameters in the network which leads to improvedgeneralization. Another motivation for employing weight sharingis to encourage the hidden layers H1 and H2 to develop featureselection properties that simplify the classification task ultimatelyimplemented by layer H3 and the output layer. This strategy iscrucial for high accuracy classification performance since weare dealing with low-level input images, as opposed to using arelatively small-sized training set consisting of invariant features.

Figure 5.3.5. Network architecture for the handwrittenZIP code recognition neural network (From Y. Le Cun et al., 1989,with permission of the MIT Press).

The first hidden layer (H1) is composed of "featuremaps". All units in a feature map share the same set ofweights (except for the threshold which may differ from unit tounit). There are 12 groups of 8 by 8 feature maps (64 units perfeature map). Each unit in a feature map receives inputs froma 5 by 5 window of the input image. Two neighboring units ina feature map in H1 have their receptive 5 by 5 field (in theinput layer) two pixels apart. Here, the motivation is that theexact position of a feature need not be determined with highprecision. Each of the 64 units in a given feature map performs the sameoperation on corresponding parts of the image. The function performedby a feature map can thus be interpreted as detecting the presenceof one out of 12 possible micro features at arbitrary positionsin the input image.

Layer H2 is also composed of 12 feature maps, eachcontaining 4 by 4 units. The connection scheme between layersH1 and H2 is quite similar to the one described above betweenH1 and the input layer, but with slightly more complications dueto the multiple 8 by 8 feature maps in H1. Each unit in H2 receivesinput from a subset (eight in this case) of the 12 maps in H1. Its receptive field is composed of a 5 by 5 window centered atidentical positions within each of the selected subset maps inH1. Again, all units in a given map in H2 share their weights.

As a result of the above structure, the networkhas 1256 units, 64,660 connections, and 9,760 independent weights. All units use a hyperbolic tangent activation function. Beforetraining, the weights were initialized with a uniform randomdistributionbetween -2.4and +2.4 and further normalized by dividing each weight by thefan-in of the corresponding unit.

Backprop based on the approximate Newton method(described in Section 5.2.3) was employed in an incremental mode. The network was trained for 23 cycles through the training set(which required 3 days of CPU time on a Sun SparcStation 1). The percentage of misclassified patterns was 0.14% on the trainingset and 5.0% on the test set. Another performance test was performedemploying a rejection criterion where an input pattern was rejectedif the levels of the two most active units in the output layerexceeded a given threshold. For this given rejection threshold,the network classification error on the test patterns was reducedto 1%, but resulted in a 12% rejection rate. Additional weightpruning based on information theoretic ideas in a four hiddenlayer architecture similar to the one described above resultedin a network with only about as manyfree parameters as that described above, and improved performanceto 99% generalization error with a rejection rate of only 9% (LeCun et al., 1990). For comparison, a fully interconnectedfeedforwardneural network with 40 hidden units (10,690 free weights), employingno weight sharing, trained on the same task produced a 1.6%misclassificationon the training set, and 19.4% rejections for 1% error rate onthe test set.

Thus, when dealing with large amounts of low-levelinformation (as opposed to carefully preprocessed feature data),proper constraints should be placed on the network so that thenumber of free parameters in the network is reduced as much aspossible without overly reducing its computational power. Also,incorporating a priori knowledge about the task (like the architecturefor developing translation invariant features implemented by layersH1 and H2 in the above network) can be very helpful in arrivingat a practical solution to an otherwise difficult problem.

A similar experiment on hand-written digits scannedfrom real bank checks was reported by Martin and Pittman (1991). Digits were automatically presegmented and size normalized toa 15 by 24 gray scale array, with pixel values from 0 to 1.0. A total of 35,200 samples were available for training and another4000 samples for testing. Here, various nets were trained usingbackprop to error rates of 2 - 3%. All nets had two hidden layersand ten units in their output layers, which employed 1-out-of-10encoding. Three types of networks were trained. Global fullyinterconnected nets with 150 units in the first hidden layer and50 units in the second layer; Local nets with 540 units in thefirst hidden layer receiving input from 5 by 8 local and overlappingregions (offset by 2 pixels) on the input array. These hiddenunits were fully interconnected to 100 units in the second hiddenlayer which, in turn, were fully interconnected to units in theoutput layer. Finally, shared weight nets were also used whichhad approximately the same number of units in each layer as inthe local nets. These shared weight nets employed a weight sharingstrategy, similar to the one in Figure 5.3.5, between the inputand the first hidden layer and between the first and second hiddenlayers. Full interconnectivity was assumed between the secondhidden layer and the output layer.

With the full 35,200 samples training set, and witha rejection rate of 9.6%, the generalization error was 1.7%, 1.1%,and 1.7%, for the global, local, and local shared weights nets,respectively. When the size of the training set was reduced tothe 1000 to 4000 range, the local shared weights net (with about6,500 independent weights) were substantially better than theglobal (at 63,000 independent weights) and local (at approximately79,000 independent weights) nets. All these results suggest anotherway for achieving good generalization: Use a very large,"representative"training set. This works as long as the network is big enoughto load the training set without too much effort to customizethe interconnectivity patterns between hidden layers. However,a network which is about one order of magnitude smaller in termsof independent connections is easier to implement because of thereduced storage it requires.

Improved recognition performance can be achievedby training the above networks to reject the type of unclassifiableimages ("rubbish") typically produced by the segmentationprocess, by actually including images of rubbish to the trainingset (Bromley and Denker,1993). Yet, another approach for improvingrecognition performance involves integrating character segmentationand recognition within one neural network (Rumelhart, 1989; Martin,1990; Keeler et al.,1991; Keeler and Rumelhart,1992; Martin,1993).

5.3.4 ALVINN:A Trainable Autonomous Land Vehicle

ALVINN (Autonomous Land Vehicle In a Neural Network)is a backprop trained feedforward network designed to drive amodified Chevy van (Pomerleau, 1991). It is anexample of a successfulapplication using sensor data in real time to perform a real-worldperception task. Using a real-time learning technique, ALVINNquickly learned to autonomously control the van by observing thereactions of a human driver.

ALVINN's architecture consists of a single hiddenlayer fully interconnected feedforward net with 5 sigmoidal unitsin the hidden layer and 30 linear output units. The input isa 30 by 32 pattern reduced from the image of an on board camera. The steering direction generated by the network is taken to bethe center of mass of the activity pattern generated by the outputunits. This allows finer steering corrections, as compared tousing the most active output unit.

During the training phase, the network is presentedwith road images as inputs and the corresponding steering signalgenerated by the human driver as the desired output. Backproptraining is used with a constant learning rate for each weightthat is scaled by the fan-in of the unit to which the weight projects. A steadily increasing momentum coefficient is also used duringtraining. The desired steering angle is presented to the networkas a Gaussian distribution of activation centered around the steeringdirection that will keep the vehicle centered on the road. Thedesired activation pattern was generated as where dlrepresents the desired output for unit l and Dlis the lth unit's distance from the correct steering directionpoint along the output vector. The variance 10 was determinedempirically. The Gaussian target pattern makes the learning taskeasier than a "1-of-30" binary target pattern sinceslightly different road images require the network to respondwith only slightly different output vectors.

Since the human driver tends to steer the vehicledown the center of the road, the network will not be presentedwith enough situations where it must recover from misalignmenterrors. A second problem may arise such that when training thenetwork with only the current image of the road, one runs therisk of over learning from repetitive inputs; thus causing thenetwork to "forget" what it had learned from earliertraining.

These two problems are handled by ALVINN as follows. First, each input image is laterally shifted to create 14 additionalimages in which the vehicle appears to be shifted by various amountsrelative to the road center. These images are shown in Figure5.3.6. A correct steering direction is then generated and usedas the desired target for each of the shifted images. Secondly,in order to eliminate the problem of over training on repetitiveimages, each training cycle consisted of a pass through a bufferof 200 images which includes the current original image and its14 shifted versions. After each training cycle, a new road imageand its 14 shifted versions are used to replace 15 patterns fromthe current set of 200 road scenes. Ten of the fifteen patternsto be replaced are ones with the lowest error. The other fivepatterns are chosen randomly.

Figure 5.3.6. Shifted video images from a singleoriginal video image used to enrich the training set used to trainALVINN. (From D. A. Pomerleau, 1991, with permission of the MITPress.)

ALVINN requires approximately 50 iterations throughthis dynamically evolving set of 200 patterns to learn to driveon roads it had been trained to drive on (an on-board Sun-4Workstationtook 5 minutes to do the training during which a teacher driverdrives at about 4 miles per hour over the test road). In additionto being able to drive along the same stretch of road it trainedon, ALVINN can also generalize to drive along parts of the roadit has never encountered, even under a wide variety of weatherconditions. In the retrieval phase (autonomous driving), thesystem is able to process 25 images per second, allowing it todrive up to the van's maximum speed of 20 miles per hour (thismaximum speed is due to constraints imposed by the hydraulic drivesystem.) This speed is over twice the speed of any other sensor-basedautonomous system that was able to drive this van. Furtherrefinementsof ALVINN can be found in Pomerleau (1993).

In contrast to other traditional navigation systems(e.g., Dickmanns and Zapp,1987) which are designed to track programmerchosen features (such as lines painted on the road), ALVINN isable to learn for each new domain what road features are importantand then develop its own steering strategy to stay on the road. When trained on multiple roads, the network developed hiddenunit feature detectors for the lines painted on the road, whilein the absence of the painted lines, some hidden units becamesensitive to road edges. As a result, ALVINN is able to drivein a wider variety of situations than any other autonomous navigationsystem.

5.3.5 Medical Diagnosis Expert Net

Clinical diagnosis is often fraught with greatdifficultybecause multiple, often unrelated, disease states can surfacewith very similar historical, symptomalogic, and clinical data. As a result, physicians' accuracy in diagnosing such diseasesis often poor.

Feedforward multilayer neural networks trained withbackprop have been reported to exhibit improved clinical diagnosisover physicians and traditional expert system approaches (Boundset al., 1988; Yoon et al.,1989; Baxt, 1990). Inthis section,an illustrative example of a neural network-based medical diagnosissystem is described which is applied to the diagnosis of coronaryocclusion (Baxt, 1990).

Acute myocardial infarction (coronary occlusion)is an example of a disease which is difficult to diagnose. Therehave been a number of attempts to automate the diagnosis process. The most promising automated solution (Goldmann et al., 1988)is able to achieve a detection rate of 88%, which is about thesame rate at which physicians are able to detect the disease,and a false alarm rate of 26%, which is slightly better than the29% false alarm rate achieved by physicians. In the followingstudy, a feedforward fully interconnected neural network withtwo hidden layers and a single output unit is trained to diagnosecoronary occlusion. The two hidden layers have 10 units each. All units are assumed to have unipolar sigmoidal activation,and backprop is used to train the network.

The training set consisted of data on 356 patientswho have been admitted to the coronary care unit. Out of the356 patients, 236 did not have the coronary disease and 120 didhave it. The network was trained on a randomly chosen set ofhalf of the patients who had sustained infarction and half ofthe patients who had not sustained infarction. The data on eachpatient consisted of twenty variables which were found to bepredictiveof the presence of acute myocardial infarction (examples of suchvariables are age, sex, nausea and vomiting, shortness of breath,diabetes, hypertension, and angina). These variables are a subsetof forty-one variables collected on all patients from the emergencydepartment records of admitted patients to the coronary care unit. A procedure was subsequently used to confirm the presence ofinfarction (Goldman et al.,1988) in all 356 cases. Most of theclinical input variables were coded in binary such that 1 representedthe presence of a finding and 0 represented the absence of a finding. Other variables such as patient age were coded as analog valuesbetween 0.0 and 1.0. The target value for the output was 1 forthe subsequently confirmed presence of acute myocardial infarctionand 0 for the confirmed absence of infarction.

After training, the network was tested on the remaining178 patients (118 noninfarction, 60 infarction) to which it hadnot been exposed. This resulted in about 92% correct identificationof presence of infarction and about 96% correct identificationof absence of infarction. These results had no substantial changewhen the training and subsequent testing of the network were repeatedafter swapping the original training and testing sets.

The 92% detection rate and 4% false alarm for theneural network-based diagnosis of acute myocardial infarctionshow substantial improvements over physicians performance of 88%detection rate and 29% false alarm. The network used routinelyavailable data that are utilized by physicians screening patientsfor the presence of infarction and was able to discover relationshipsin these data that are not immediately apparent to physicians.

5.3.6 Image Compression and DimensionalityReduction

Image compression techniques exploit the redundancythat naturally exists in most images for efficient storage and/ortransmission purposes. Here, a picture is encoded with a muchsmaller number of bits than the total number of bits requiredto describe it exactly. After retrieval or at the receiver endof a transmission link, the encoded or "compressed"image may then be decoded into a full-sized picture. The compressionof images can be posed as an optimization problem where, ideally,the encoding and decoding is done in a way that optimizes thequality of the decoded picture. A number of image compressionschemes have been reported in the literature (see Gonzalez andWintz, 1987). In the following, a neural network-based solutionto this problem is described.

Consider the architecture of a single hidden layerfeedforward neural network shown in Figure 5.3.7. This networkhas the same number of units in its output layer as inputs, andthe number of hidden units is assumed to be much smaller thanthe dimension of the input vector. The hidden units are assumedto be of the bipolar sigmoid type, and the output units are linear. This network is trained on a set of n-dimensional real-valuedvectors (patterns) xksuch that each xkis mapped to itself at the output layer in an autoassociativemode. Thus, the network is trained to act as an encoder ofreal-valuedpatterns. Backprop may be used to learn such a mapping. Cottrellet al. (1987; 1989) proposed this architecture for imagecompression. One network they studied received inputs from an 8  8-pixelregion (= 64) and had 16 hidden units. Backpropwas used to train the network to autoassociate randomly selected8  8 patches (windows) of a given image. After training,the network was used to compress and then reconstruct the image,patch by patch, using a set of non-overlapping patches which coveredthe whole image. Now, to store this image, we need only store16 data points for each 8  8 patch of the original image. This amounts to a 4:1 compression ratio.

Next, we illustrate the above method of imagecompression/encodingfor the 256  256-pixel (8-bits per pixel) image"Lenna"shown in Figure 5.3.8a. Following the above training procedure,a 16 hidden unit autoassociative net was trainedon random 8  8patches of the image using incremental backprop learning. Here,all pixel values are normalized in the range [-1, +1]. Typically, the learning consisted of 50,000 to 100,000 iterationsat a learning rate of 0.01 and 0.1 for the hidden and output layerweights, respectively. Figure 5.3.8b shows the reproduced imageby the autoassociative net when tested on the training image. The reproduced image is quite close (to the eye) to the trainingimage in Figure 5.3.8a; hence the reconstructed image is of goodquality.

In order to achieve true compression for the purposeof efficient transmission over a digital communication link, theoutputs of the hidden units must be quantized. Quantization consistsof transforming the outputs of the hidden units [which are inthe open interval (-1, +1)]to some integer range corresponding to the number of bits requiredfor transmission. This, effectively, restricts the informationin the hidden unit outputs to the number of bits used. In general,this transformation should be designed with care (Gonzalez andWintz, 1987). However, in the above compression net, a simpleuniformly spaced quantization may be used for two reasons. First,the squashing activation function forces all outputs into therange -1to +1, so no scaling is necessary. Second, backprop tended tomake the hidden unit output variances about equal, so specialblock quantization (Huang andSchultheiss, 1963) is not neededhere. The effects of quantization are tested for the trainingimage in Figure 5.3.8a. Here, the hidden unit outputs are restrictedto eight bits; i.e. 256 quantized values per output. This correspondsto a data rate of 2 bits per pixel (a total of 128 bits are neededto code each 8  8 patch of pixels, resulting in a datarate of 2 bits per pixel). That is, only two binary bits aretransmitted for each pixel in the original image. The reconstructedimage was as good to the eye as the one shown in Figure 5.3.8b;i.e., it was as good as the network does without any quantization. The network was also capable of respectable reconstructions ofthe training image with fewer quantized values for hidden unitoutputs. For example, Figures 5.3.8c and d show image reconstructionswith 64 quantization values (1.5 bits per pixel data rate) and16 quantization values (1 bit per pixel data rate), respectively.

How does this autoassociative net perform on new(non-training) images? Intuitively, one would not expect thisnet to generalize from a single training image; since, the networklearns, in some sense, the statistics of the image it is trainedon, and different images have different statistics. Surprisingly,it turns out that the network does a respectable job of reproducingimages that it was not trained on, even when quantization is used. For example, Figures 5.3.9 b - d show reproductions of the image"Amal" in Figure 5.3.9a, using the above net (i.e.,the autoassociative net trained with the image "Lenna"in Figure 5.3.8a). Figure 5.3.9b corresponds to the case withno quantization of hidden layer outputs. On the other hand, Figures5.3.9c and d correspond to the cases of 1.5 bits per pixel and1 bit per pixel data compression rate, respectively. Soneharaet al. (1989). reported similar simulations with an autoassociativeimage compression net. They showed that the reproduction of newimages improves as the number of training images increases. Theyalso showed, empirically, that the network achieves betterreproductionwhen the quantized hidden unit outputs are used during learning,as opposed to using quantization only during reproduction.

Figure 5.3.7. A two layer feedforward autoassociativenetwork for image compression and dimensionality reduction.

(a) (b)

(c) (d)

Figure 5.3.8. (a) The original image "Lenna"used to train the autoassociative image compression net. (b)The reconstructed image using no quantization of the hidden unitoutputs. (c) Reconstructed image using a 1.5 bits per pixel datarate. (d) Reconstructed image using a 1 bit per pixel data rate.

(a) (b)

(c) (d)

Figure 5.3.9. (a) The image "Amal" usedto test the autoassociative image compression net trained on theimage in Figure 5.3.9a. (b) Reconstructed test image using noquantization. (c) Reconstructed test image using a 1.5 bits perpixel data rate. (d) Reconstructed test image using a 1 bit perpixel data rate.

See how it works interactively

Since the input is forced to be reproduced througha narrow hidden layer (bottleneck), backprop attempts toextractregularities (significant features) from the input vectors. Here,the hidden layer, which is also known as the representation layer,is expected to evolve an internal low-dimensional distributedrepresentation of the training data. Empirical analysis of thetrained compression network shows that the hidden unit activitiesspan the principal component subspace of the image vector(s),with some noise on the first principal component due to the nonlinearnature of the hidden unit activation's (Cottrell and Munro, 1988). In this net, the nonlinearity in the hidden units is theoreticallyof no help (Bourlard and Kamp,1988), and indeed Cottrellet al.(1987) and Cottrell andMunro (1988) found that the nonlinearityhas little added advantages in their simulations. These resultsare further supported by Baldiand Hornik (1989) who showed thatif J linear hidden units are used, the network learns toproject the input onto the subspace spanned by the first Jprincipal components of the input. Thus, the network's hiddenunits discard as little information as possible by evolving theirrespective weight vectors to point in the direction of the input'sprincipal components. This means that autoassociative backproplearning in a two layer feedforward neural network with linearunits has no processing capability beyond those of unsupervisedHebbian PCA nets of Section 3.3.5. [For an application of aHebbian-typePCA net to image compression, the reader is referred to Sanger(1989)].

The addition of one or more encoding hidden layerswith nonlinear units between the inputs and the representationlayer, and one or more decoding layers between the representationlayer and the output layer provides a network which is capableof learning nonlinear representations (Kramer, 1991; Oja, 1991 ;Usui et al., 1991). Suchnetworks can perform the nonlinear analogto principal component analysis (recall the discussion of nonlinearPCA in Section 3.3.6), and extract " principal manifolds." These principal manifolds can, in some cases, serve aslow-dimensionalrepresentations of the data which are more useful than principalcomponents. A three hidden layer autoassociative net can,theoretically,compute any continuous mapping from the inputs to the second hiddenlayer (representation layer), and another mapping from the secondhidden layer to the output layer. Thus, a 3 hidden layerautoassociativenet (with a linear or nonlinear representation layer) may in principlebe considered as a universal nonlinear PCA net. However, sucha highly nonlinear net may be problematic to train by backpropdue to local minima.

Another way of interpreting the above autoassociativefeedforward network is from the point of view of feature extraction(Kuczewski et al., 1987; Cottrell, 1991 ; Hassoun et al., 1992). Here, the outputs from the representation layer are taken aslow-dimensional feature vectors associated with complete images(or any other high-dimensional raw data vectors) presented atthe input layer. Whereas, the decoder (reconstruction) subnetis only needed during the training phase and is eliminated duringretrieval. The output from the representation layer can now beused as an information-rich, low-dimensional feature vector whichis easy to process/classify. Reducing dimensionality of datawith minimal information loss is also important from the pointof view of computational efficiency. Here, the high-dimensionalinput data can be transformed into "good" representationsin a lower dimensional space for further processing. Since manyalgorithms are exponential in the dimensionality of the input,a reduction by even a single dimension may provide significantcomputational savings.

DeMers andCottrell (1993) presented impressiveresults whereby the encoder subnet of a four hiddenlayer autoassociativenet is used to supply five dimensional inputs to a feedforwardneural classifier. The classifier was trained to recognize thegender of a limited set of subjects. Here, the autoassociativenet was first trained using backprop, with pruning of representationlayer units, to generate a five-dimensional representation from50-dimensional inputs. The inputs were taken as the first 50principal components of a 64 × 64-pixel, 8-bitgray scale images, each of which can be considered to be a pointin a 4,096 dimensional "pixel space." Here, the trainingset is comprised of 160 images of various facial impressionsof 10 male and 10 female subjects, of which 120 images were usedfor training and 40 for testing. The images are captured by aframe grabber, and reduced to 64 × 64 pixels byaveraging. Each image is then aligned along the axes of the eyesand mouth. All images are normalized to have equal brightnessand variance, in order to prevent the use of first order statisticsfor discrimination. Finally, the grey levels of image pixelsare linearly scaled to the range [0, 0.8]. The overallencoder/classifiersystem resulted in a 95% correct gender recognition on both trainingand test sets which was found to be comparable to the recognitionrate of human beings on the same images.

The high rate of correct classification in the abovesimulation is a clear indication of the "richness" andsignificance of the representations/feature vectors discoveredby the nonlinear PCA autoassociative net. For another significantapplication of nonlinear PCA autoassociative nets, the readeris referred to Usui et al.(1991). A somewhat related recurrentmultilayer autoassociative net for data clustering and signaldecomposition is presented in Section 6.4.2.

Goto [5.0] [5.1] [5.2] [5.4] [5.5]

Back to the Table of Contents

Back to Main Menu