5.1 Learning Rule for Multilayer Feedforward Neural Networks

Consider the two-layer feedforward architecture shown in Figure 5.1.1. This network receives a set of scalar signals {x0, x1, ... , xn}, where x0 is a bias signal equal to 1. This set of signals constitutes an input vector x, x Rn+1. The layer receiving the input signal is called the hidden layer. Figure 5.1.1 shows a hidden layer having J units. The output of the hidden layer is a (J+1)-dimensional real-valued vector z, z = [z0, z1, ..., zJ]T. Again, z0 = 1 represents a bias input and can be thought of as being generated by a "dummy" unit (with index zero) whose output z0 is clamped at 1. The vector z supplies the input for the output layer of L units. The output layer generates an L-dimensional vector y in response to the input x which, when the network is fully trained, should be identical (or very close) to a "desired" output vector d associated with x.


Figure 5.1.1. A two layer fully interconnected feedforward neural network architecture. For clarity, only selected connections are drawn.

The activation function fh of the hidden units is assumed to be a differentiable nonlinear function (typically, fh is the logistic function defined by , or a hyperbolic tangent function fh(net) = tanh ( net), with values for and close to unity). Each unit of the output layer is assumed to have the same activation function, denoted fo; the functional form of fo is determined by the desired output signal/pattern representation or the type of application. For example, if the desired output is real-valued (as in some function approximation applications), then a linear activation fo (net) = net may be used. On the other hand, if the network implements a pattern classifier with binary outputs, then a saturating nonlinearity similar to fh may be used for fo. In this case, the components of the desired output vector d must be chosen within the range of fo. It is important to note that if fh is linear, then one can always collapse the net in Figure 5.1.1 to a single layer net and thus lose the universal approximation/mapping capabilities discussed in Chapter 2. Finally, we denote by wji the weight of the jth hidden unit associated with the input signal xi. Similarly, wlj is the weight of the lth output unit associated with the hidden signal zj.

Next, consider a set of m input/output pairs {xk, dk}, where dk is an L-dimensional vector representing the desired network output upon the presentation of xk. The objective here is to adaptively adjust the J(+ 1) + L(+& nbsp;1) weights of this network such that the underlying function/mapping represented by the training set is approximated or learned. Since the learning here is supervised, i.e., target outputs are available, we may define an error function to measure the degree of approximation for any given setting of the network's weights. A commonly used error function is the SSE measure, but this is by no means the only possibility, and later in this chapter, several other error functions will be discussed. Once a suitable error function is formulated, learning can be viewed (as was done in Chapters 3 and 4) as an optimization process. That is, the error function serves as a criterion function, and the learning algorithm seeks to minimize the criterion function over the space of possible weight settings. For instance, if a differentiable criterion function is used, gradient descent on such a function will naturally lead to a learning rule. This idea has been invented independently by Bryson and Ho (1969), Amari (1967; 1968), Werbos (1974), and Parker (1985). Next, we illustrate the above idea by deriving a supervised learning rule for adjusting the weights wji and wlj such that the following error function is minimized (in a local sense) over the training set (Rumelhart et al., 1986b):

(5.1.1)

Here, w represents the set of all weights in the network. Note that Equation (5.1.1) is the " instantaneous" SSE criterion of Equation (3.1.32) generalized for a multiple output network.

5.1.1 Error Backpropagation Learning Rule

Since the targets for the output units are explicitly specified, one can directly use the delta rule, derived in Section 3.1.3 for updating the wlj weights. That is,

(5.1.2)

where is the weighted sum for the lth output unit, is the derivative of fo with respect to net, and and represent the updated (new) and current weight values, respectively. The zj's are computed by propagating the input vector x through the hidden layer according to:

(5.1.3)

The learning rule for the hidden layer weights wji is not as obvious as that for the output layer since we do not have available a set of target values (desired outputs) for hidden units. However, one may derive a learning rule for hidden units by attempting to minimize the output layer error. This amounts to propagating the output errors (d- yl) back through the output layer towards the hidden units in an attempt to estimate "dynamic" targets for these units. Such a learning rule is termed error back-propagation or the backprop learning rule and may be viewed as an extension of the delta rule (Equation 5.1.2) used for updating the output layer. To complete the derivation of backprop for the hidden layer weights, and similar to the above derivation for the output layer weights, we perform gradient descent on the criterion function in Equation (5.1.1), but this time, the gradient is calculated with respect to the hidden weights:

(5.1.4)

where the partial derivative is to be evaluated at the current weight values. Using the chain rule for differentiation, one may express the partial derivative in Equation (5.1.4) as

(5.1.5)

with

(5.1.6)

(5.1.7)

and


Now, upon substituting Equations (5.1.6) through (5.1.8) into Equation (5.1.5) and using Equation (5.1.4), we arrive at the desired learning rule:

(5.1.9)

By comparing Equation (5.1.9) to (5.1.2), one can immediately define an "estimated target" dj for the jth hidden unit implicitly in terms of the back propagated error signal dj-zj as follows:

(5.1.10)

It is usually possible to express the derivatives of the activation functions in Equations (5.1.2) and (5.1.9) in terms of the activations themselves. For example, for the logistic activation function, we have

(5.1.11)

and for the hyperbolic tangent function, we have

(5.1.12)

The above learning equations may also be extended to feedforward nets with more than one hidden layer and/or nets with connections that  jump  over  one  or  more  layers  (see Problems 5.1.2 and 5.1.3). The complete procedure for updating the weights in a feedforward neural net utilizing the above rules is summarized below for the two layer architecture of Figure 5.1.1. We will refer to this learning procedure as incremental backprop or just backprop:

1. Initialize all weights and refer to them as "current" weights and . (see Section 5.2.1 for details).

2. Set the learning rates o and h to small positive values (refer to Section 5.2.2 for additional details).

3. Select an input pattern xk from the training set (preferably at random) and propagate it through the network, thus generating hidden and output unit activities based on the current weight settings.

4. Use the desired target dk associated with xk and employ Equation (5.1.2) to compute the output layer weight changes .

5. Employ Equation (5.1.9) to compute the hidden layer weights changes . Normally, the current weights are used in these computations. In general, enhanced error correction may be achieved if one employs the updated output layer weights . However, this comes at the added cost of recomputing yl and fo'(netl).

6. Update all weights according to and for the output and hidden layers, respectively.

7. Test for convergence. This is done by checking some preselected function of the output errors to see if its magnitude is below some preset threshold. If convergence is met, stop; otherwise, set and and go to step 3. It should be noted that backprop may fail to find a solution which passes the convergence test. In this case, one may try to reinitialize the search process, tune the learning parameters, and/or use more hidden units.

The above procedure is based on "incremental" learning, which means that the weights are updated after every presentation of an input pattern. Another alternative is to employ " batch" learning where weight updating is performed only after all patterns (assuming a finite training set) have been presented. The batch learning is formally stated by summing the right hand side of Equations (5.1.2) and (5.1.9) over all patterns xk. This amounts to gradient descent on the criterion function

(5.1.13)

Even though batch updating moves the search point w in the direction of the true gradient at each update step, the "approximate" incremental updating is more desirable for two reasons: (1) It requires less storage, and (2) it makes the search path in the weight space stochastic (here, at each time step the input vector x is drawn at random) which allows for a wider exploration of the search space and, potentially, leads to better quality solutions. When backprop converges, it converges to a local minima of the criterion function (McInerny et al., 1989). This fact is true of any gradient descent-based learning rule when the surface being searched is nonconvex (Amari, 1990); i.e., it admits local minima. Using stochastic approximation theory, Finnoff (1993; 1994) showed that for "very small" learning rates (approaching zero), incremental backprop approaches batch backprop and produces essentially the same results. However, for small constant learning rates there is a nonnegligible stochastic element in the training process which gives incremental backprop a "quasiannealing" character in which the cumulative gradient is continuously perturbed, allowing the search to escape local minima with small shallow basins of attraction. Thus, solutions generated by incremental backprop are often practical ones. The local minima problem can be further eased by heuristically adding random noise to the weights (von Lehman et al., 1988) or by adding noise to the input patterns (Sietsma and Dow, 1988). In both cases, some noise reduction schedule should be employed to dynamically reduce the added noise level towards zero as learning progresses.

Next, the incremental backprop learning procedure is applied to solve a two-dimensional, two-class pattern classification problem. This problem should help give a good feel for what is learned by the hidden units in a feedforward neural network, and how the various units work together to generate a desired solution.

Example 5.1.1: Consider the two-class problem shown in Figure 5.1.2. The points inside the shaded region belong to class B and all other points are in class A. A three layer feedforward neural network with backprop training is employed which is supposed to learn to distinguish between these two classes. The network consists of an 8-unit first hidden layer, followed by a second hidden layer with 4 units, followed by a 1-unit output layer. We will refer to such a network as having an 8-4-1 architecture. All units employ a hyperbolic tangent activation function. The output unit should encode the class of each input vector; a positive output indicates class B and a negative output indicates class A. Incremental backprop was used with learning rates set to 0.1. The training set consists of 500 randomly chosen points, 250 from region A and another 250 from region B. In this training set, points representing class B and class A were assigned desired output (target) values of +1 and -1, respectively. Training was performed for several hundred cycles over the training set.










Figure 5.1.2 Decision regions for the pattern classification problem in Example 5.1.1

Figure 5.1.3 shows geometrical plots of all unit responses upon testing the network with a new set of 1000 uniformly randomly generated points inside the [-1, +1]2 region. In generating each plot, a black dot was placed at the exact coordinates of the test point (input) in the input space if and only if the corresponding unit response is positive. The boundaries between the dotted and the white regions in the plots represent approximate decision boundaries learned by the various units in the network. Figure 5.1.3 (a)-(h) represent the decision boundaries learned by the eight units in the first hidden layer. Figure 5.1.3 (i)-(l) shows the decision boundaries learned by the four units of the second hidden layer. Figure 5.1.3 (m) shows the decision boundary realized by the output unit. Note the linear nature of the separating surface realized by the first hidden layer units, from which complex nonlinear separating surfaces are realized by the second hidden layer units and ultimately by the output layer unit. This example also illustrates how a single hidden layer feedforward net (counting only the first two layers) is capable of realizing convex, concave, as well as disjoint decision regions, as can be seen from Figure 5.1.3 (i)-(l). Here, we neglect the output unit and view the remaining net as one with an 8-4 architecture.

The present problem can also be solved with smaller networks (fewer number of hidden units or even a network with a single hidden layer). However, the training of such smaller networks with backprop may become more difficult. A smaller network with a 5-3-1 architecture utilizing a variant backprop learning procedure (Hassoun et al., 1990) is reported in Song (1992), which has a comparable separating surface to the one in Figure 5.1.3 (m).



















Figure 5.1.3. Separating surfaces generated by the various units in the 8-4-1 network of Example 5.1.1 (a)-(h): Separating surfaces realized by the units in the first hidden layer; (i)-(l): Separating surface realized by the units in the second hidden layer; and (m): Separating surface realized by the output unit.

Huang and Lippmann (1988) employed Monte Carlo simulations to investigate the capabilities of backprop in learning complex decision regions (see Figure 2.3.3). They reported no significant performance difference between two and three layer feedforward nets when forming complex decision regions using backprop. They also demonstrated that backprop's convergence time is excessive for complex decision regions and the performance of such trained classifiers is similar to that obtained with the k-nearest neighbor classifier (Duda and Hart, 1973). Villiers and Barnard (1993) reported similar simulations but on data sets which consisted of a "distribution of distributions" where a typical class is a set of clusters (distributions) in the feature space, each of which can be more or less spread out and which might involve some or all of the dimensions of the feature space; the distribution of distributions thus assigns a probability to each distribution in the data set. It was found for networks of equal complexity (same number of weights), that there is no significant difference between the quality of "best" solutions generated by two and three layer backprop-trained feedforward networks; actually, the two layer nets demonstrated better performance, on the average. As for the speed of convergence, three layer nets converged faster if the number of units in the two hidden layers were roughly equal.

Gradient descent search may be eliminated all together in favor of a stochastic global search procedure that guarantees convergence to a global solution with high probability; genetic algorithms and simulated annealing are examples of such procedures and are considered in Chapter 8. However, the assured (in probability) optimality of these global search procedures comes at the expense of slow convergence. Next, a deterministic search procedure termed global descent is presented which helps backprop reach globally optimal solutions.

5.1.2 Global Descent-Based Error Backpropagation

Here, we describe a learning method in which the gradient descent rule in batch backprop is replaced with a "global descent" rule (Cetin et al. 1993a). This methodology is based on a global optimization scheme, acronymed TRUST: terminal repeller unconstrained subenergy tunneling, which formulates optimization in terms of the flow of a special deterministic dynamical system (Cetin et al., 1993b).

Global descent is a gradient descent on a special criterion function C(w, w*) given by

(5.1.14)

where w*, with component values , is a fixed weight vector which can be a local minimum of E(w) or an initial weight state w0, is the unit step function, is a shifting parameter (typically set to 2), and k is a small positive constant. The first term in the right-hand side in Equation (5.1.14) is a monotonic transformation of the original criterion function (e.g., SSE criterion may be used) which preserves all critical points of E(w) and has the same relative ordering of the local and global minima of E(w). It also flattens the portion of E(w) above E(w*) with minimal distortion elsewhere. On the other hand, the term is a "repeller term" which gives rise to a convex surface with a unique minimum located at . The overall effects of this energy transformation is schematically represented for a one-dimensional criterion function in Figure 5.1.4.

Performing gradient descent on C(w, w*) leads to the "global descent" update rule

(5.1.15)

The first term on the right-hand side of Equation (5.1.15) is a "subenergy gradient", while the second term is a "non-Lipschitzian" terminal repeller (Zak, 1989). Upon replacing the gradient descent in Equation (5.1.2) and (5.1.4) by Equation (5.1.15) where wi represents an arbitrary hidden unit or output unit weight, the modified backprop procedure may escape local minima of the original criterion function E(w) given in Equation (5.1.13). Here, the batch training is required since Equation (5.1.15) necessitates a unique error surface for all patterns.










Figure 5.1.4. A plot of a one-dimensional criterion function E(w) with local minimum at w*. The function E(w) - E(w*) is plotted below, as well as the global descent criterion function C(w, w*).

The update rule in Equation (5.1.15) automatically switches between two phases: A tunneling phase and a gradient descent phase. The tunneling phase is characterized by . Since for this condition the subenergy gradient term is nearly zero in the vicinity of the local minimum w*, the terminal repeller term in Equation (5.1.15) dominates, leading to the dynamical system

(5.1.16)

This system has an unstable repeller equilibrium point at ; i.e., at the local minimum of E(w). The "power" of this repeller is determined by the constant k. Thus, the dynamical system given by Equation (5.1.15), when initialized with a small perturbation from w*, is repelled from this local minimum until it reaches a lower energy region ; i.e., tunneling through portions of E(w) where is accomplished. The second phase is a gradient descent minimization phase, characterized by . Here, the repeller term is identically zero. Thus, Equation (5.1.15) becomes

(5.1.17)

where (w) is a dynamic learning rate (step size) equal to . Note that (w) is approximately equal to when E(w*) is larger than E(w)+.

Initially, w* is chosen as one corner of a domain in the form of a hyperparallelepiped of dimension , which is the dimension of w in the architecture of Figure 5.1.1. A slightly perturbated version of w*, namely , is taken as the initial state of the dynamical system in Equation (5.1.15). Here w is a small perturbation which drives the system into the domain of interest. If , the system immediately enters a gradient descent phase which equilibrates at a local minimum. Every time a new equilibrium is reached, w* is set equal to this equilibrium and Equation (5.1.15) is reinitialized with which assures a necessary consistency in the search flow direction. Since w* is now a local minimum, holds in the neighborhood of w*. Thus, the system enters a repelling (tunneling) phase, and the repeller at w* repels the system until it reaches a lower basin of attraction where . As the dynamical system enters the next basin, the system automatically switches to gradient descent and equilibrates at the next lower local minimum. We then set w* equal to this new minimum and repeat the process. If, on the other hand, at the onset of training, then the system is initially in a tunneling phase. The tunneling will proceed to a lower basin, at which point it enters the minimization phase and follows the behavior discussed above. Training can be stopped when a minimum w* corresponding to is reached, or when E(w*) becomes smaller than a preset threshold.

The global descent method is guaranteed to find the global minimum for functions of one variable, but not for multivariate functions. However, in the multidimensional case, the algorithm will always escape from one local minimum to another with a lower or equal functional value. Figure 5.1.5 compares the learning curve for the global descent-based backprop to that of batch backprop for the four-bit parity problem in a feedforward net with four hidden units and a single output unit. The same initial random weights are used in both cases. The figure depicts one tunneling phase for the global descent algorithm before convergence to a (perfect) global minimum solution. In performing this simulation, it is found that the direction of the perturbation vector w is very critical in regard to successfully reaching a global minimum. On the other hand, batch backprop converges to the first local minimum it reaches. This local solution represents a partial solution to the 4-bit parity problem (i.e., mapping error is present). Simulations using incremental backprop with the same initial weights as in the above simulations are also performed, but are not shown in the figure. Incremental backprop was able to produce both of the solutions shown in Figure 5.1.5; very small learning rates (0 and n) often lead to imperfect local solutions, while relatively larger learning rates may lead to a perfect solution.








Figure 5.1.5. Learning curves for global descent- and gradient descent-based batch backprop for the 4-bit parity.

Goto [5.0] [5.2] [5.3] [5.4] [5.5]

Back to the Table of Contents

Back to Main Menu