5.2 Backprop Enhancements and Variations

Learning with backprop is slow (Sutton, 1986 ; Huang and Lippmann, 1988). This is due to the characteristics of the error surface. The surface is characterized by numerous flat and steep regions. In addition, it has many troughs which are flat in the direction of search. These characteristics are particularly pronounced in classification problems, especially when the size of the training set is small (Hush et al., 1991).

Many enhancements of and variations to backprop have been proposed. These are mostly heuristic modifications with goals of increased speed of convergence, avoidance of local minima, and/or improvement in the network's ability to generalize. In this section, we present some common heuristics which may improve these aspects of backprop learning in multilayer feedforward neural networks.

5.2.1 Weight Initialization

Due to its gradient descent nature, backprop is very sensitive to initial conditions. If the choice of the initial weight vector w0 (here w is a point in the weight space being searched by backprop) happens to be located within the attraction basin of a strong local minima attractor (one where the minima is at the bottom of a steep-sided valley of the criterion/error surface), then the convergence of backprop will be fast and the solution quality will be determined by the depth of the valley relative to the depth of the global minima. On the other hand, backprop converges very slowly if w0 starts the search in a relatively flat region of the error surface.

An alternative explanation for the sensitivity of backprop to initial weights (as well as to other learning parameters) is advanced by Kolen and Pollack (1991). Using Monte Carlo simulations on simple feedforward nets with incremental backprop learning of simple functions, they discovered a complex fractal-like structure for convergence as a function of initial weights. They reported regions of high sensitivity in the weight space where two very close initial points can lead to substantially different learning curves. Thus, they hypothesize that these fractal-like structures arise in backprop due to the nonlinear nature of the dynamic learning equations which exhibit multiple attractors; rather than the gradient descent metaphor with local valleys to get stuck in, they advance a many-body metaphor where the search trajectory is determined by complex interactions with the systems attractors.

In practice, the weights are normally initialized to small zero-mean random values (Rumelhart et al., 1986b). The motivation for starting from small weights is that large weights tend to prematurely saturate units in a network and render them insensitive to the learning process (Hush et al., 1991 ; Lee et al., 1991) (this phenomenon is known as "flat spot" and is considered in Section 5.2.4). On the other hand, randomness is introduced as a symmetry breaking mechanism; it prevents units from adopting similar functions and becoming redundant.

A sensible strategy for choosing the magnitudes of the initial weights for avoiding premature saturation is to choose them such that an arbitrary unit i starts with a small and random weighted sum neti. This may be achieved by setting the initial weights of unit i to be on the order of where fi is the number of inputs (fan-in) for unit i (Wessels and Barnard, 1992). It can be easily shown that for zero-mean random uniform weights in and assuming normalized inputs which are randomly and uniformly distributed in the range , neti has zero-mean and has standard deviation . Thus, by generating uniform random weights within the range , the input to unit i (neti) is a random variable with zero mean and a standard deviation of unity, as desired.

In simulations involving single hidden layer feedforward networks for pattern classification and function approximation tasks, substantial improvements in backprop convergence speed and avoidance of "bad" local minima are possible by initializing the hidden unit weight vectors to normalized vectors selected randomly from the training set (Denoeux and Lengellé, 1993).

5.2.2 Learning Rate

The convergence speed of backprop is directly related to the learning rate parameter (o and h in Equations (5.1.2) and (5.1.9), respectively); if is small, the search path will closely approximate the gradient path, but convergence will be very slow due to the large number of update steps needed to reach a local minima. On the other hand, if is large, convergence will initially be very fast, but the algorithm will eventually oscillate and thus not reach a minimum. In general, it is desirable to have large steps when the search point is far away from a minimum with decreasing step size as the search approaches a minimum. In this section, we give a sample of the various approaches for selecting the proper learning rate.

One early proposed heuristic (Plaut et al., 1986) is to use constant learning rates which are inversely proportional to the fan-in of the corresponding units. Extensions of the idea of fan-in dependence of the learning rate have also been proposed by (Tesauro and Janssens, 1988). The increased convergence speed of backprop due to the utilization of the above method of setting the individual learning rates for each unit inversely proportional to the number of inputs to that unit has been theoretically justified by analyzing the eigenvalue distribution of the Hessian matrix of the criterion function, (Le Cun et al., 1991a). Such learning rate normalization can be intuitively thought of as maintaining balance between the learning speed of units with different fan-in. Without this normalization, after each learning iteration, units with high fan-in have their input activity (net) changed by a larger amount than units with low fan-in. Thus, and due to the nature of the sigmoidal activation function used, the units with large fan-in tend to commit their output to a saturated state prematurely and are rendered difficult to adapt (see Section 5.2.4 for additional discussion). Therefore, normalizing the learning rates of the various units by dividing by their corresponding fan-in helps speed up learning.

The optimal learning rate for fast convergence of backprop/gradient descent search is the inverse of the largest eigenvalue of the Hessian matrix H of the error function E, evaluated at the search point w. Computing the full Hessian matrix is prohibitively expensive for large networks with thousands of parameters are involved. Therefore, finding the largest eigenvalue max for speedy convergence seems rather inefficient. However, one may employ a shortcut to efficiently estimate max (Le Cun et al., 1993). This shortcut is based on a simple way of approximating the product of H by an arbitrarily chosen (random) vector z through Taylor expansion: where is a small positive constant. Now, using the power method, which amounts to iterating the procedure , the vector z converges to where cmax is the normalized eigenvector of H corresponding to max. Thus, the norm of the converged vector z gives a good estimate of |max|, and its reciprocal may now be used as the learning rate in backprop. An on-line version of this procedure is reported by Le Cun et al. (1993).

Many heuristics have been proposed so as to adapt the learning rate automatically. Chan and Fallside (1987) proposed an adaptation rule for that is based on the cosine of the angle between the gradient vectors and (here, t is an integer which represents iteration number). Sutton (1986) presented a method which can increase or decrease for each weight wi according to the number of sign changes observed in the associated partial derivative . This method was also studied empirically by Jacobs (1988). Franzini (1987) investigated a technique that heuristically adjusts , increasing it whenever is close to and decreasing it otherwise. Cater (1987) suggested using separate parameters, , one for each pattern xk. Silva and Almeida (1990, see also Vogl et al., 1988) used a method where the learning rate for a given weight wi is set to if and have the same sign, with ; if the partial derivatives have different signs, then a learning rate of is used, with . A similar, theoretically justified method for increasing the convergence speed of incremental gradient descent search is to set (t) = (t - 1) if E(t) has the same sign as , and otherwise (Pflug, 1990).

When the input vectors are assumed to be randomly and independently chosen from a probability distribution, we may view incremental backprop as a stochastic gradient descent algorithm, along the lines of the theory in Section 4.2.2. Thus, simply setting the learning rate to a constant results in persistent residual fluctuations around a local minimum w*. The variance of such fluctuations depends on the size of , the criterion function being minimized, and the training set. Based on results from stochastic approximation theory (Ljung, 1977), the "running average" schedule , with sufficiently small 0, guarantees asymptotic convergence to a local minimum w*. However, this schedule leads to very slow convergence. Here, one would like to start the search with a learning rate faster than but then ultimately converge to the rate as w* is approached. Unfortunately, increasing 0 can lead to instability for small t. Darken and Moody (1991) proposed the "search then converge" schedule which allows for faster convergence without compromising stability. In this schedule, the learning rate stays relatively high for a "search time" during which it is hoped that the weights will hover about a good minimum. Then, for times , the learning rate decreases as and the learning converges. Note that for , this schedule reduces to the running average schedule. So a procedure for optimizing is needed. A completely automatic "search then converge" schedule can be found in Darken and Moody (1992).

5.2.3 Momentum

Another simple approach to speed up backprop is through the addition of a momentum term (Plaut et al., 1986) to the right-hand side of the weight update rules in Equations (5.1.2) and (5.1.9). Here, each weight change wi is given some momentum so that it accelerates in the average downhill direction, instead of fluctuating with every change in the sign of the associated partial derivative . The addition of momentum to gradient search is formally stated as


where a is a momentum rate normally chosen between 0 and 1 and . Equation (5.2.1) is a special case of multi-stage gradient methods which have been proposed for accelerating convergence (Wegstein, 1958) and escaping local minima (Tsypkin, 1971).

The momentum term can also be viewed as a way of increasing the effective learning rate in almost-flat regions of the error surface while maintaining a learning rate close to r (here 0 r 1) in regions with high fluctuations. This can be seen by employing an N-step recursion and writing (5.2.1) as


If the search point is caught in a flat region, then will be about the same at each time-step and Equation (5.2.2) can be approximated as (with 0 a 1 and N large)


Thus, for flat regions, a momentum term leads to increasing the learning rate by a factor . On the other hand, if the search point is in a region of high fluctuation, the weight change will not gain momentum; i.e., the momentum effect vanishes. An empirical study of the effects of r and a on the convergence of backprop and on its learning curve can be found in Tollenaere (1990).

Adaptive momentum rates may also be employed. Fahlman (1989) proposed and extensively simulated a heuristic variation of backprop, called quickprop, which employs a dynamic momentum rate given by


With this adaptive a(t) substituted in (5.2.1), if the current slope is persistently smaller than the previous one but has the same sign, then a(t) is positive and the weight change will accelerate. Here, the acceleration rate is determined by the magnitude of successive differences between slope values. If the current slope is in the opposite direction from the previous one, it signals that the weights are crossing over a minimum. In this case, a(t) has a negative sign and the weight change starts to decelerate. Additional heuristics are used to handle the undesirable case where the current slope is in the same direction as the previous one, but has the same or larger magnitude; otherwise, this scenario would lead to taking an infinite step or moving the search point backwards, or up the current slope and toward a local maximum. Substituting Equation (5.2.4) in (5.2.1) leads to the update rule


It is interesting to note that Equation (5.2.5) corresponds to steepest gradient descent-based adaptation with a dynamically changing effective learning rate (t). This learning rate is given by the sum of the original constant learning rate and the reciprocal of the denominator of the second term in the right-hand side of Equation (5.2.5).

The use of error gradient information at two consecutive time steps in Equation (5.2.4) to improve convergence speed can be justified as being based on approximations of second-order search methods such as Newton's method. The Newton method (e.g., Dennis and Schnabel, 1983) is based on a quadratic model of the criterion E(w) and hence uses only the first three terms in a Taylor series expansion of E about the "current" weight vector wc:

This quadratic function is minimized by solving the equation which leads to Newton's method: . Here H is the Hessian matrix with components .

Newton's algorithm iteratively computes the weight changes w and works well when initialized within a convex region of E. In fact, the algorithm converges quickly if the search region is quadratic or nearly so. However this method is very computationally expensive since the computation H-1 requires O(N3) operations at each iteration (here, N is the dimension of the search space). Several authors have suggested computationally efficient ways of approximating Newton's method (Parker, 1987 ; Ricotti et al., 1988 ; Becker and Le Cun, 1989). Becker and Le Cun proposed an approach whereby the off-diagonal elements of H are neglected, thus arriving at the approximation


which is a "decoupled" form of Newton's rule where each weight is updated separately. The second term in the right-hand side of Equation (5.2.5) can now be viewed as an approximation of Newton's rule, since its denominator is a crude approximation of the second derivative of E at step t. In fact, this suggests that the weight update rule in Equation (5.2.5) may be used with  = 0.

As with Equation (5.2.4), special heuristics must be used in order to prevent the search from moving in the wrong gradient direction and in order to deal with regions of very small curvature, such as inflection points and plateaus, which cause wi in Equation (5.2.6) to blow-up. A simple solution is to replace the term in Equation (5.2.6) by   where m is a small positive constant. The approximate Newton method described above is capable of scaling the descent step in each direction. However, because it neglects off-diagonal Hessian terms, it is not able to rotate the search direction as in the exact Newton's method. Thus, this approximate rule is only efficient if the directions of maximal and minimal curvature of E happen to be aligned with the weight space axes. Bishop (1992) reported a somewhat efficient technique for computing the elements of the Hessian matrix exactly, using multiple feedforward propagation through the network, followed by multiple backward propagation.

Another approach for deriving theoretically justifiable update schedules for the momentum rate in Equation (5.2.1) is to adjust a(t) at each update step such that the gradient descent search direction is "locally" optimal. In "optimum" steepest descent (also known as best-step steepest descent), the learning rate is set at time t such that it minimizes the criterion function E at time step + 1; i.e., we desire a r which minimizes . Unfortunately, this optimal learning step is impractical since it requires the computation of the Hessian 2E at each time step (refer to Problem 5.2.12 for an expression for the optimal ). However, we may still use some of the properties of the optimal in order to accelerate the search, as we demonstrate next.

When w(t) is specified, the necessary condition for minimizing E[w(t+1)] is (Tompkins, 1956; Brown, 1959)


This implies that the search direction in two successive steps of optimum steepest descent are orthogonal. The easiest method to enforce the orthogonal requirement is the Gram-Schmidt orthogonalization method. Suppose that we know the search direction at time - 1, denoted d(- 1), and that we compute the "exact" gradient E(t) (used in batch backprop) at time step t [to simplify notation, we write E[w(t)] as E(t)]. Now, we can satisfy the condition of orthogonal consecutive search directions by computing a new search direction, employing Gram-Schmidt orthogonolization (Yu et al., 1993)


Performing descent search in the direction d(t) in Equation (5.2.8) leads to the weight vector update rule


where the relation w(- 1) = w(t) - w(- 1) = +r d(- 1) has been used. Comparing the component-wise weight update version of Equation (5.2.9) to Equation (5.2.1) reveals another adaptive momentum rate given by

Another similar approach is to set the current search direction d(t) to be a compromise between the current "exact" gradient E(t) and the previous search direction d(t - 1), i.e., , with . This is the basis for the conjugate gradient method in which the search direction is chosen (by appropriately setting b) so that it distorts as little as possible the minimization achieved by the previous search step. Here, the current search direction is chosen to be conjugate (with respect to H) to the previous search direction. Analytically, we require d(- 1)T H(- 1)d(t) = 0 where the Hessian is assumed to be positive-definite. In practice, b, which plays the role of an adaptive momentum, is chosen according to the Polack-Ribiére rule (Polack and Ribiére, 1969 ; Press et al., 1986) :

Thus, the search direction in the conjugate gradient method at time t is given by

Now, using and substituting the above expression for d(t) in w(t) =  d(t) leads to the weight update rule:


When E is quadratic, the conjugate gradient method theoretically converges in N or fewer iterations. In general, E is not quadratic and therefore this method would be slower than what the theory predicts. However, it is reasonable to assume that E is approximately quadratic near a local minimum. Therefore, conjugate gradient descent is expected to accelerate the convergence of backprop once the search enters a small neighborhood of a local minimum. As a general note, the basic idea of conjugate gradient search was introduced by Hestenes and Stiefel (1952). Beckman (1964) gives a good account of this method. Battiti (1992) and van der Smagt (1994) gave additional characterization of second-order backprop (such as conjugate gradient-based backprop) from the point of view of optimization. The conjugate gradient method has been applied to multilayer feedforward neural net training (Kramer and Sangiovanni-Vincentelli, 1989 ; Makram-Ebeid et al., 1989 ; van der Smagt, 1994) and is shown to outperform backprop in speed of convergence.

It is important to note that the above second-order modifications to backprop improve the speed of convergence of the weights to the "closest" local minimum. This faster convergence to local minima is the direct result of employing a better search direction as compared to incremental backprop. On the other hand, the stochastic nature of the search directions of incremental backprop and its fixed learning rates can be an advantage since it allows the search to escape shallow local minima, which generally leads to better solution quality. These observations suggest the use of hybrid learning algorithms (Møller, 1990 ; Gorse and Shepherd, 1992) where one starts with incremental backprop and then switches to conjugate gradient-based backprop for the final convergence phase. This hybrid method has its roots in a technique from numerical analysis known as Levenberg-Marquardt optimization (Press et al., 1986).

As a historical note, we mention that the concept of gradient descent was first introduced by Cauchy (1847) for use in the solution of simultaneous equations; the method has enjoyed popularity ever since. It should also be noted that some of the above enhancements to gradient search date back to the fifties and sixties and are discussed in Tsypkin (1971). Additional modifications of the gradient descent method which enhances its convergence to global minima are discussed in Section 8.1. For a good survey of gradient search the reader is referred to the book by Polyak (1987).

5.2.4 Activation Function

As indicated earlier in this section, backprop suffers from premature convergence of some units to flat spots. During training, if a unit in a multilayer network receives a weighted signal net with a large magnitude, this unit outputs a value close to one of the saturation levels of its activation function. If the corresponding target value (desired target value for an output unit, or an unknown "correct" hidden target for a hidden unit) is substantially different from that of the saturated unit, we say that the unit is incorrectly saturated or has entered a flat spot. When this happens, the size of the weight update due to backprop will be very small even though the error is relatively large, and it will take an excessively long time for such incorrectly saturated units to reverse their states. This situation can be explained by referring to Figure 5.2.1., where the activation function f(net) = tanh(b net) and its derivative f ' [given in Equation (5.1.12)] are plotted for b = 1.

Figure 5.2.1. Plots of f(net) = tanh(net) and its derivative '(net).

Here, when net has large magnitude, f ' approaches zero. Thus, the weight change approaches zero in Equations (5.1.2) and (5.1.9) even when there is a large difference between the actual and desired/correct output for a given unit.

A simple solution to the flat spot problem is to bias the derivative of the activation function (Fahlman, 1989); i.e., replace fo' and fh' in Equation (5.1.2) and (5.1.9) by fo' + e and fh' + e, respectively (a typical value for e is 0.1). Hinton (1987a) suggested the use of a nonlinear error function that goes to infinity at the points where f ' goes to zero, resulting in a finite non-zero error value (see Franzini (1987) for an example of using such an error function). The entropic criterion of Chapter 3, Equation (3.1.76), is a good choice for the error function since it leads to an output unit update rule similar to that of Equation (5.1.2) but without the fo' term (note, however, that the update rule for the hidden units would still have the derivative term. See Equation (5.2.18) in Section 5.2.6). One may also modify the basic sigmoid activation function in backprop in order to reduce flat spot effects. The use of the homotopy activation function is one such example (Yang and Yu, 1993). Here, forms a homotopy between a linear and a sigmoid function with . Initially, is set to 1; that is, all nodes have linear activations. Backprop is used to achieve a minimum in E, then is decreased (monotonically) and backprop is continued until is zero. That is, the activation function recovers its sigmoidal nature gradually as training progresses. Since and is nonzero for most of the training phase, flat spot effects are eliminated. Besides reducing the effects of flat spot, the homotopy function also helps backprop escape some local minima. This can be seen by noting that when  = 1, the error function is a polynomial of w which has a relatively smaller number of local minima than . Because we have achieved a minimum point of , which can provide a relatively better initial point for minimizing , many unwanted local minima are avoided. An alternative explanation of the effect of a gradually increasing activation function slope on the avoidance of local minima is given in Section 8.4 based on the concept of mean-field annealing.

Another method for reducing flat spot effects involves dynamically updating the activation slope (l and b in Equations (5.1.11) and (5.1.12), respectively) such that the slope of each unit is adjusted, independently, in the direction of reduced output error (Tawel, 1989 ; Kufudaki and Horejs, 1990 ; Rezgui and Tepedelenlioglu, 1990 ; Kruschke and Movellan, 1991; Sperduti and Starita, 1991, 1993). Gradient descent on the error surface in the activation function's slope space leads to the following update rules (assuming hyperbolic tangent activation functions)




for the lth output unit and the jth hidden unit, respectively. Here, o and h are small positive constants. Typically, when initialized with slopes near unity, Equations (5.2.10) and (5.2.11) reduce the activation slopes toward zero, which increases the effective dynamic range of the activation function which, in turn, reduces flat spot effects and therefore allows the weights to update rapidly in the initial stages of learning. As the algorithm begins to converge, the slope starts to increase and thus restores the saturation properties of the units. It is important to note here that the slope adaptation process just described becomes a part of the backprop weight update procedure; the slopes are updated after every weight update step.

Other nonsigmoid activation functions may be utilized as long as they are differentiable (Robinson et al., 1989). From the discussion on the approximation capabilities of multilayer feedforward networks in Section 2.3, a wide range of activation functions may be employed without compromising the universal approximation capabilities of such networks. However, the advantages of choosing one particular class of activation functions (or a mixture of various functions) is not completely understood. Moody and Yarvin (1992) reported an empirical study where they have compared feedforward networks with a single hidden layer feeding into a single linear output unit, each network employing different type of differentiable nonlinear activation function. The types of activation functions considered by Moody and Yarvin included the sigmoid logistic function, polynomials, rational functions (ratios of polynomials), and Fourier series (sums of cosines). Benchmark simulations on a few data sets representing noisy data with only mild nonlinearity and noiseless data with a high degree of nonlinearity were performed. It was found that the networks with nonsigmoidal activations attained superior performance on the highly nonlinear noiseless data. On the set of noisy data with mild nonlinearity, however, polynomials did poorly, whereas rationals and Fourier series showed better performance and were comparable to sigmoids.

Other methods for improving the training speed of feedforward multilayer networks involve replacing the sigmoid units by Gaussian or other units. These methods are covered in Chapter 6.

5.2.5 Weight Decay, Weight Elimination, and Unit Elimination

In Chapter 4 (Section 4.8), we saw that in order to guarantee good generalization, the number of degrees of freedom or number of weights (which determines a network's complexity) must be considerably smaller than the amount of information available for training. Some insight into this matter can be gained from considering an analogous problem in curve fitting (Duda and Hart, 1973 ; Wieland and Leighton, 1987). For example, consider the rational function which is plotted in Figure 5.2.2 (solid line). And, assume that we are given a set of 15 samples (shown as small circles) from which we are to find a "good" approximation to g(x). Two polynomial approximations are shown in Figure 5.2.2: An eleventh-order polynomial (dashed line) and an eighth-order polynomial (dotted line). These approximations are computed by minimizing the SSE criterion over the sample points. The higher order polynomial has about the same number of parameters as the number of training samples, and thus is shown to give a very close fit to the data, this is referred to as " memorization." However, it is clear from the figure that this polynomial does not provide good "generalization" (i.e., it does not provide reliable interpolation and/or extrapolation) over the full range of the data. On the other hand, fitting the data by an eighth-order polynomial leads to relatively better overall interpolations over a wider range of x values (refer to the dotted line in Figure 5.2.2). In this case, the number of free parameters is equal to nine which is smaller than the number of training samples. This "undetermined" nature leads to an approximation function that better matches the "smooth" function g(x) being approximated. Trying to use a yet lower order polynomial (e.g., fifth-order or less) leads to a poor approximation because this polynomial would not have sufficient "flexibility" to capture the nonlinear structure in g(x).

Figure 5.2.2. Polynomial approximation for the function (shown as a solid line), based on the 15 samples shown (small circles). The objective of the approximation is to minimize the sum of squared error criterion. The dashed line represents an eleventh-order polynomial. A better overall approximation for g(x) is given by an eighth-order polynomial (dotted line).

The reader is advised to consider the nature and complexity of this simple approximation problem by carefully studying Figure 5.2.2. Here, the total number of possible training samples of the form (xg(x)) is uncountably infinite. From this huge set of potential data, though, we close only 15 samples to try to approximate the function. In this case, the approximation involved minimizing an SSE criterion function over these few sample points. Clearly, however, a solution which is globally (or near globally) optimal in terms of sum-squared error over the training set (for example, the eleventh order polynomial) may be hardly appropriate in terms of interpolation (generalization) between data points. Thus, one should choose a class of approximation functions which penalizes unnecessary fluctuations between training sample points. Neural networks satisfy this approximation property and are thus superior to polynomials in approximating arbitrary nonlinear functions from sample points (see further discussion given below). Figure 5.2.3 show the results of simulations involving the approximation of the function g(x), with the same set of samples used in the above simulations, using single hidden layer feedforward neural nets. Here, all hidden units employ the hyperbolic tangent activation function (with a slope of 1), and the output unit is linear. These nets are trained using the incremental backprop algorithm [given by Equations (5.1.2) and (5.1.9)] with and h = 0.01. Weights are initialized randomly and uniformly over the range [-0.2, +0.2]. The training was stopped when the rate of change of the SSE became insignificantly small. The dotted line in Figure 5.2.3 is for a net with three hidden units (which amounts to 10 degrees of freedom). Surprisingly, increasing the number of hidden units to 12 units (37 degrees of freedom) improved the quality of the fit as shown by the dashed line in the figure. By comparing Figures 5.2.3 and 5.2.2, it is clear that the neural net approximation for g(x) is superior to that of polynomials in terms of accurate interpolation and extrapolation.

Figure 5.2.3. Neural network approximation for the function (shown as a solid line). The dotted line was generated by a 3-hidden unit feedforward net. The dashed line, which is shown to have substantial overlap with g(x), was generated by a 12-hidden unit feedforward net. In both cases, standard incremental backprop training was used.

The generalization superiority of the neural net can be attributed to the bounded and smooth nature of the hidden unit responses, as compared to the potentially divergent nature of polynomials. The bounded unit response localizes the nonlinear effects of individual hidden units in a neural network and allows for the approximations in different regions of the input space to be independently tuned. This approximation process is similar in its philosophy to the traditional spline technique for curve fitting (Schumaker, 1981). Hornik et al. (1990) gave related theoretical justification for the usefulness of feedforward neural nets with sigmoidal hidden units in function approximation. They showed that, in addition to approximating the training set, the derivative of the output of the network evaluated at the training data points is also a good approximation of the derivative of the unknown function being approximated. This result explains the good extrapolation capability of neural nets observed in simulations. For example, the behavior of the neural net output shown in Figure 5.2.3 for x > 10 and x < -5 is a case in point. It should be noted, though, that in most practical situations the training data is noisy. Hence, an exact fit of this data must be avoided, which means that the degrees of freedom of a neural net approximator must be constrained. Otherwise, the net will have a tendency for overfitting. This issue is explored next.

Once we decide on a particular approximation function or network architecture, generalization can be improved if the number of free parameters in the net is optimized. Since it is difficult to estimate the optimal number of weights (or units) a priori, there has been much interest in techniques that automatically remove excess weights and/or units from a network. These techniques are sometimes referred to as network " pruning" algorithms and are surveyed in Reed (1993).

One of the earliest and simplest approaches to remove excess degrees of freedom from a neural network is through the use of simple weight decay (Plaut et al., 1986 ; Hinton, 1986) in which each weight decays towards zero at a rate proportional to its magnitude, so that connections disappear unless reinforced. Hinton (1987b) gave empirical justification by showing that such weight decay improves generalization in feedforward networks. Krogh and Hertz (1992) gave some theoretical justification for this generalization phenomena.

Weight decay in the weight update equations of backprop can be accounted for by adding a complexity (regularization) term to the criterion function E that penalizes large weights,


Here, represents the relative importance of the complexity term with respect to the error term E(w) [note that the second term in Equation (5.2.12) is a regularization term as in Equation (4.1.3)]. Now, gradient search for minima of J(w) leads to the following weight update rule


which shows an exponential decay in wi if no learning occurs. Because it penalizes more weights than necessary, the criterion function in Equation (5.1.12) overly discourages the use of large weights where a single large weight costs much more than many small ones. Weigend et al. (1991) proposed a procedure of weight-elimination given by minimizing


where the penalty term on the right-hand side helps regulate weight magnitudes and w0 is a positive free parameter which must be determined. For large w0, this procedure reduces to the weight decay procedure described above and hence favors many small weights, whereas if w0 is small, fewer large weights are favored. Also, note that when , the cost of the weight approaches one (times ) which justifies the interpretation of the penalty term as a counter of large weights. In practice, a w0 close to unity is used. It should be noted that the above weight elimination procedure is very sensitive to the choice of . A heuristic for adjusting dynamically during learning is described in Weigend et al. (1991). For yet other forms of the complexity term, the reader is referred to Nowlan and Hinton (1992a and 1992b), and Section 5.2.7.

The above ideas have been extended to unit elimination (e.g., see Hanson and Pratt, 1989 ; Chauvin, 1989 ; Hassoun et al., 1990). Here, one would start with an excess of hidden units and dynamically discard redundant ones. As an example, one could penalize redundant units by replacing the weight decay term in Equation (5.2.13) by for all weights of hidden units, which leads to the hidden unit update rule


Generalization in feedforward networks can also be improved by utilizing network construction procedures, as opposed to weight or unit pruning. Here, we start with a small network and allow it to grow gradually (add more units) in response to incoming data. The idea is to keep the network as small as possible. In Chapter 6 (Section 6.3) three adaptive networks having unit-allocation capabilities are discussed. Further details on network construction procedures can be found in Marchand et al. (1990) , Frean (1990) , Fahlman and Lebiere (1990) , and Mézard and Nadal (1989).

5.2.6 Cross Validation

An alternative or complementary strategy to the above methods for improving generalization in feedforward neural networks is suggested by findings based on empirical results (Morgan and Bourlard, 1990; Weigend et al., 1991 ; Hergert et al., 1992). In simulations involving backprop training of feedforward nets on noisy data, it is found that the validation (generalization) error decreases monotonically to a minimum but then starts to increase, even as the training error continues to decrease. This phenomenon is depicted in the conceptual plot in Figure 5.2.4, and is illustrated through the computer simulation given next.

Figure 5.2.4. Training error (dashed curve) and validation error (solid curve) encountered in training multilayer feedforward neural nets using backprop.

Consider the problem of approximating the rational function g(x) plotted in Figure 5.2.3, from a set of noisy sample points. This set of points is generated from the 15 perfect samples, shown in Figure 5.2.3, by adding zero-mean normally distributed random noise whose variance is equal to 0.25. A single hidden layer feedforward neural net is used with 12 sigmoidal hidden units and a single linear output unit. It employs incremental backprop training with o = 0.05, h = 0.01, and initial random weights in . After 80 training cycles on the 15 noisy samples, the net is tested for uniformly sampled inputs x in the range [-8, 12]. The output of this 80-cycle net is shown as a dashed line in Figure 5.2.5. Next, the training continued and then stopped after 10,000 cycles. The output of the resulting net is shown as a dotted line in the figure. Comparing the two approximations in Figure 5.2.5 leads to the conclusion that the partially trained net is superior to the excessively trained net in terms of overall interpolation and extrapolation capabilities. Further insight into the dynamics of the generalization process for this problem can be gained from Figure 5.2.6. Here, the validation RMS error is monitored by testing the net on a validation set of 294 perfect samples, uniformly spaced in the interval [-8, 12], after every 10 training cycles. This validation error is shown as the dashed line in Figure 5.2.6. The training error (RMS error on the training set of 15 points) is also shown in the figure as a solid line. Note that the optimal net in terms of overall generalization capability is the one obtained after about 80 to 90 training cycles. Beyond this training point, the training error keeps decreasing, while the validation error increases. It is interesting to note the non-monotonic behavior of the validation error between training cycles 2000 and 7000. This suggests that, in general, multiple local minima may exist in the validation error curves of backprop trained feedforward neural networks. The location of these minima is a complex function of the network size, weight initialization, and learning parameters. To summarize, when training with noisy data, excessive training usually leads to overfitting. On the other hand, partial training may lead to a better approximation of the unknown function in the sense of improved interpolation and, possibly, improved extrapolation.

Figure 5.2.5. Two different neural network approximations of the rational function g(x), shown as a solid line, from noisy samples. The training samples shown are generated from the 15 perfect samples in Figure 5.2.3 by adding zero-mean normally distributed random noise with 0.25 variance. Both approximations resulted from the same net with 12 hidden units and incremental backprop learning. The dashed line represents the output of the net after 80 learning cycles. After completing 10,000 learning cycles, the same net generates the dotted line output.

Figure 5.2.6. Training and validation RMS errors for the neural net approximation of the function g(x). The training set consists of the 15 noisy samples in Figure 5.2.5. The validation set consists of 294 perfect samples uniformly spaced in the interval [-8, 12]. The validation error starts lower than the training error mainly because perfect samples are used for validation.

The generalization phenomenon depicted in Figure 5.2.4 (and illustrated by the simulation in Figures 5.2.5 and 5.2.6) does not currently have a complete theoretical justification. A qualitative explanation for it was advanced by Weigend et al. (1991). They explain that to a first approximation, backprop initially adapts the hidden units in the network such that they all attempt to fit the major features of the data. Later, as training proceeds, some of the units then start to fit the noise in the data. This later process continues as long as there is error and as long as training continues (this is exactly what happens in the simulation of Figure 5.2.5). The overall process suggests that the effective number of free parameters (weights) starts small (even if the network is oversized) and gets larger approaching the true number of adjustable parameters in the network as training proceeds. Baldi and Chauvin (1991) derived analytical results on the behavior of the validation error in LMS-trained single layer feedforward networks learning the identity map from noisy autoassociation pattern pairs. Their results agree with the above generalization phenomenon in nonlinear multilayer feedforward nets.

Therefore, a suitable strategy for improving generalization in networks of non-optimal size is to avoid " overtraining" by carefully monitoring the evolution of the validation error during training and stopping just before it starts to increase. This strategy is based on one of the early criteria in model evaluation known as cross-validation (e.g., see Stone, 1978). Here, the whole available data set is split into three parts: Training set, validation set, and prediction set. The training set is used to determine the values of the weights of the network. The validation set is used for deciding when to terminate training. Training continues as long as the performance on the validation set keeps improving. When it ceases to improve, training is stopped. The third part of the data, the prediction set, is used to estimate the expected performance (generalization) of the trained network on new data. In particular, the prediction set should not be used for validation during the training phase. Note that this heuristic requires the application to be data-rich. Some applications, though, suffer from scarcity of training data which makes this method inappropriate. The reader is referred to Finnoff et al., (1993) for an empirical study of cross-validation-based generalization and its comparison to weight decay and other generalization-inducing methods.

See how it works interactively

5.2.7 Criterion Functions

As seen earlier in Section 3.1.5, other criterion/error functions can be used from which new versions of the backprop weight update rules can be derived. Here, we consider two such criterion functions: (1) Relative entropy and (2) Minkowski-r<. Starting from the instantaneous entropy criterion ( Baum and Wilczek, 1988)


and employing gradient descent search, we may obtain the learning equations




for the output and hidden layer units, respectively. Equations (5.2.17) and (5.2.18) assume hyperbolic tangent activations at both layers.

From Equation (5.2.17) we see that the fo' term present in the corresponding equation of standard backprop [Equation (5.1.2)] has now been eliminated. Thus, the output units do not have a flat spot problem; on the other hand, fh' still appears in Equation (5.2.18) for the hidden units (this derivative appears implicitly as the term in the standard backprop equation). Therefore, the flat spot problem is only partially solved by employing the entropy criterion.

The entropy-based backprop is well suited to probabilistic training data. It has a natural interpretation in terms of learning the correct probabilities of a set of hypotheses represented by the outputs of units in a multilayer neural network. Here, the probability that the lth hypothesis is true given an input pattern xk is determined by the output of the lth unit as . The entropy criterion is a " well formed" error function ( Wittner and Denker, 1988); the reader is referred to Section 3.1.5 for a definition and discussion of "well formed" error functions. Such functions have been shown in simulations to converge faster than standard backprop ( Solla et al., 1988).

Another choice is the Minkowski-rcriterion function ( Hanson and Burr, 1988):


which leads to the following weight update equations:




where sgn is the sign function. These equations reduce to those of standard backprop for the case r = 2. The motivation behind the use of this criterion is that it can lead to maximum likelihood estimation of weights for Gaussian and non-Gaussian input data distributions by appropriately choosing r (e.g., r = 1 for data with Laplace distributions). A small r (1 r < 2) gives less weight for large deviations and tends to reduce the influence of outlier-points in the input space during learning. On the other hand, when noise is negligible, the sensitivity of the separating surfaces implemented by the hidden units to the geometry of the problem may be increased by employing r > 2. Here, fewer hidden units are recruited when learning complex nonlinearly separable mappings for larger r values ( Hanson and Burr, 1988).

If no a priori knowledge is available about the distribution of the training data, it would be difficult to estimate a value for r, unless extensive experimentation with various r values (e.g., r = 1.5, 2, 3) is done. Alternatively, an automatic method for estimating r is possible by adaptively updating r in the direction of decreasing E. Here, steepest gradient descent on E(r) results in the update rule


which when restricting r to be strictly greater than 1 (metric error measure case) may be approximated as


Note that it is important that the r update rule be invoked much less frequently than the weight update rule (for example, r is updated once every 10 training epochs of backprop).

The idea of increasing the learning robustness of backprop in noisy environments can be placed in a more general statistical framework ( White, 1989) where the technique of robust statistics ( Huber, 1981) takes effect. Here, robustness of learning refers to insensitivity to small perturbations in the underlying probability distribution p(x) of the training set. These statistical techniques motivate the replacement of the linear error in Equations (5.1.2) and (5.1.9) by a nonlinear error suppressor function which is compatible with the underlying probability density function p(x). One example is to set with 1 r < 2. This error suppresser leads to exactly the Minkowski-r weight update rule of Equations (5.2.20) and (5.2.21). In fact, the case r = 1 is equivalent to minimizing the summed absolute error criterion which is known to suppress outlier data points. Similarly, the selection ( Kosko, 1992) leads to robust backprop if p(x) has long tails such as a Cauchy distribution or some other infinite variance density.

Furthermore, regularization terms may be added to the above error functions E(w) in order to introduce some desirable effects such as good generalization, smaller effective network size, smaller weight magnitudes, faster learning, etc. ( Poggio and Girosi, 1990a; Mao and Jain, 1993). The regularization terms in Equations (5.2.12) and (5.2.14) used for enhancing generalization through weight pruning/elimination are examples. Another possible regularization term is ( Drucker and Le Cun, 1992) which has been shown to improve backprop generalization by forcing the output to be insensitive to small changes in the input. It also helps speed up convergence by generating hidden layer weight distributions that have smaller variances than that generated by standard backpropagation. (Refer to Problem 5.2.11 for yet another form of regularization.)

Weight-sharing, a method where several weights in a network are controlled by a single parameter, is another way for enhancing generalization ( Rumelhart et al., 1986b, also see Section 5.3.3 for an application). It imposes equality constraints among weights, thus reducing the number of free (effective) parameters in the network which leads to improved generalization. An automatic method for affecting weight sharing can be derived by adding the regularization term ( Nowlan and Hinton, 1992a and 1992b):


to the error function, where each pj(wi) is a Gaussian density with mean j and variance j, the j is the mixing proportion of Gaussian pj with , and wi represents an arbitrary weight in the network. The j, j, and j parameters are assumed to adapt as the network learns. The use of multiple adaptive Gaussians allows the implementation of "soft-weight sharing," in which the learning algorithm decides for itself which weights should be tied together. If the Gaussians all start with high variance, the initial grouping of weights into subsets will be very soft. As the network learns and the variance shrinks, those groupings become more and more distinct and converge to subsets influenced by the task being learned.

For gradient descent-based adaptation, one may employ the partial derivatives

, (5.2.25)

, (5.2.26)

, (5.2.27)





It should be noted that the derivation of the partial of R with respect to the mixing proportions is less straight forward than those in Equations (5.2.25) through (5.2.27) since we must maintain the sum of the j's to 1. Thus, the result in Equation (5.2.28) has been obtained by appropriate use of a Lagrange multiplier method and a bit of algebraic manipulation. The term rj(wi) in Equations (5.2.25) through (5.2.29) is the posterior probability of Gaussian j given weight wi; i.e., it measures the responsibility of Gaussian j for the ith weight. Equation (5.2.25) attempts to pull the weights towards the center of the "responsible" Gaussian. It realizes a competition mechanism among the various Gaussians for taking on responsibility of weight wi. The partial derivative for j drives j toward the weighted average of the set of weights for which Gaussian j is responsible. Similarly, one may come up with simple interpretations for the derivatives in Equations (5.2.27) and (5.2.28). To summarize, the penalty term in (5.2.24) leads to unsupervised clustering of weights (weight sharing) driven by the biases in the training set.

Goto [5.0] [5.1] [5.3] [5.4] [5.5]

Back to the Table of Contents

Back to Main Menu