8.1 Local Versus Global Search

Consider the optimization problem of finding the extreme point(s) of a real-valued multi-dimensional scalar function (objective function) of the form y :   R, where the search space is a compact subset of Rn. An extreme point is a point x* in such that y(x*) takes on its maximum (or minimum) value. In the following discussion, it will be assumed, without loss of generality, that by optimization we mean minimization. Thus, an extreme point of y is the "global" minimum of y. Multiple extreme points may exist. In addition to the global minimum (minima), the function y may also admit local minima. A point x* is a local minimum of y if y(x*) < y(x) for all x such that ||x* - x||  , for some  > 0. Figure 8.1.1 illustrates the concepts of global and local minima for the uni-variate scalar function y(x) = x sin for x  [0.05, 0.5].

Figure 8.1.1. Local and global minima for the function y(x) = x sin.

There exists several ways of determining the minima of a given function. Analytical techniques exploit Fermat's Stationarity Principle which states that the gradient (or derivative in the uni-variate case) of y with respect to x is zero at all minima (and maxima). Thus, one can find these minima (and maxima) by solving the set of equations (possibly nonlinear) of the form y(x) = 0. Here, a solution x* is a minimum of y if the Hessian matrix, , evaluated at x*, is positive definite (i.e., xTH(x*)x > 0 for all x  0), or if in the uni-variate case. The global minimum may then be identified by direct evaluation of y(x*). This method is theoretically sound as long as the function y is twice differentiable. In practical situations, though, the above approach is inefficient due to the computational overhead involved in its implementation on a digital computer.

Other, more efficient optimization techniques do exist. One example is the simple (steepest) gradient descent algorithm introduced in Chapter 3 [e.g., see Equation (3.1.21)], which formed the basis for most of the learning rules discussed so far in this book. Assuming y is differentiable, gradient descent can be expressed according to the recursive rule:

(8.1.1)

where is a small positive constant, and t is a positive integer representing the iteration number. Given an initial "guess" x(0) (e.g., a random point in ), the recursive rule in Equation (8.1.1) implements a search strategy, whereby a sequence of vectors:

x(0), x(1), x(2), ..., x(t), ...

is generated such that

y(x(0))  y(x(1))  y(x(2))  ...  y(x(t))  ...

and hence at each iteration we move closer to an optimal solution. Although computationally efficient (its convergence properties and some extensions were considered in Section 5.2.3), gradient search methods may lead only to local minima of y which happen to be "close" to the initial search point x(0). As an example, consider the one-dimensional function y(x) = x sin shown in Figure 8.1.1. It is clear that from this figure that given any initial search point x(0)  [0.05, 0.12], the gradient descent algorithm will always converge to one of the two local minima shown. In this case, the region of search space containing the global solution will never be explored. Hence for local search algorithms, such as gradient descent, the quality (optimality) of the final solution is highly dependent upon the selection of the initial search point.

Global minimization (optimization) requires a global search strategy; a strategy that cannot be easily fooled by local minima. In Sections 8.2, 8.4 and 8.5 we will discuss three commonly used global search strategies. For a survey of global optimization methods, the reader is referred to the book by Törn and ilinskas (1989). As for the remainder of this section, we shall discuss extensions which help transform a simple gradient search into a global search.

Using intuition and motivated by the saying: "There is a valley behind every mountain," we may employ gradient descent/ascent in a simple search strategy which will allow the discovery of global minima. Assuming a uni-variate objective function y(x), x  [a, b] with a < b, start with x(0) = a and use gradient descent search to reach the first local minimum . Next, save the value and proceed by ascending the function y (using Equation (8.1.1) with the negative sign replaced by a positive sign) starting from the initial point where x is a sufficiently small positive constant. Continue ascending until a local maximum is reached. Now, switch back to gradient descent starting from the current point (maximum) perturbed by x until convergence to the second local minimum , and save the value . This full search cycle is repeated until the search reaches the point x = b. At this point, all minima of y over x  [ab] have been obtained and the global minimum is the point satisfying . This strategy will always lead to the global minimum when y is a function of a single variable; i.e., when the search space is one-dimensional.

One may also use this strategy for multi-dimensional functions, though the search is not guaranteed to find the global minimum. This is because when the search space has two or more dimensions, the perturbation x is now a vector and it is unclear how to set the direction of x so that the search point visits all existing minima. Also, in the multidimensional case, the above gradient descent/ascent strategy may get caught in a repeating cycle, where the same local minimum and maximum solutions are repeatedly found. The reader is encouraged to stare at Figure 8.1.2 which depicts a plot of a two-dimensional function to see how (starting from a local minimum point) the choice of x effects the sequence of visited peaks and valleys.

Figure 8.1.2. A plot of a two-variable function showing multiple minima, maxima, and saddle points.

For such differentiable multivariate functions, an example of a search strategy which would be useful for finding some "good" (sufficiently deep) local minima may be given as follows. Start the initial search point at a corner of the search region () and randomly vary the direction of the perturbation x after each local minimum (maximum) is found, such that x points in a random direction, away from the current minimum (maximum). A similar global search strategy is one which replaces the gradient ascent step in the above descent/ascent search strategy by a "tunneling step." Here, tunneling is used to move the search point away from the current minimum to another point in its vicinity such that the new point is (hopefully) on a surface of the search landscape which leads into a deeper minimum. This idea is similar to the one embodied in the global descent search strategy discussed in Section 5.1.2.

The above search methods are deterministic in nature. Stochastic (non-deterministic) ideas may also be incorporated in gradient search-based optimization, thus leading to stochastic gradient algorithms. In fact, for problems of moderate to high dimensions, the only feasible methods for global optimization are stochastic in nature (Schoen, 1991). These methods utilize noise to perturb the function y(x) being minimized in order to avoid being trapped in "bad" local minima. Though in order for the stochastic method to converge to a "good" solution, this noise should be introduced appropriately and then subsequently removed. That is, during the search process the perturbations to y are gradually removed so that the effective function being minimized will become exactly y prior to reaching the final solution.

In its most basic form, the stochastic gradient algorithm performs gradient descent search on a perturbed form (x, N) of the function y(x) where the perturbation is additive in nature:

(8.1.2)

Here, N = [N1N2, ..., Nn]T is a vector of independent noise sources, and c(t) is a parameter which controls the magnitude of noise. To achieve the gradual reduction in noise mentioned above, c(t) must be selected in such a way that it approaches zero as t tends to infinity. A simple choice for c(t) is given by (Cichocki and Unbehauen, 1993)

c(t) = e-t (8.1.3)

with   0 and  > 0. Substituting the perturbed function in the gradient descent rule in Equation (8.1.1) gives the stochastic gradient rule:

(8.1.4)

The stochastic gradient algorithm in Equation (8.1.4) is inspired by the dynamics of the diffusion process in physical phenomena such as atomic migration in crystals or chemical reactions. The dynamics of the diffusion process across potential barriers involve a combination of gradient descent and random motion according to the Smoluchowski-Kramers equation (Aluffi-Pentini et al., 1985):

(8.1.5)

where E(x) is the potential energy, T is the absolute temperature, k is the Boltzmann constant, and m is the reduced mass. The term is a stochastic force. Geman and Hwang (1986) [see also Chiang et al. (1987)] developed a method for global optimization, which is essentially the simulation of an annealed diffusion process. This method is based on Equation (8.1.5) with the temperature T made inversely proportional to the logarithm of positive time (t) for almost guaranteed convergence to the global minimum. The discrete-time version of Equation (8.1.5) with annealed temperature leads to the stochastic gradient rule in Equation (8.1.4). Convergence analysis of a slightly modified version of Equation (8.1.4) can be found in Gelfand and Mitter (1991).

The search rule in Equation (8.1.4) may be applied to any function y as long as the gradient information can be determined or estimated. Unlike the gradient search rule in Equation (8.1.1), the stochastic rule in Equation (8.1.4) may allow the search to escape local minima. Note that for zero mean statistically independent noise, the present search method will follow, on average, the gradient of y.

The probability that stochastic gradient search leads to the global minimum solution critically depends on the functional form of the noise amplitude schedule c(t). In the exponential schedule in Equation (8.1.3), the coefficient controls the amplitude of noise and determines the rate of damping. A sufficiently large should be used in order for the search to explore a large range of the search space. For large , the stochastic effects decay very rapidly, and prematurely reduce the search to the simple deterministic gradient search, thus increasing the probability of reaching a local suboptimal solution. Small values for are desirable since they allow the stochastic search to explore a sufficiently large number of points on the search surface, which is necessary for global optimization. However, very small values of lead to a very slow convergence process. Thus, the coefficient needs to be chosen such that we strike a balance between the desire for fast convergence and the need to ensure an optimal (or near optimal) solution. The following is an example of the application of stochastic gradient search for finding the global solution.

Example 8.1.1: Let us use the stochastic gradient rule in Equation (8.1.4) to search for minima of the function y(x) = x sin near the origin. The function y(x) has an infinite number of local minima which become very dense in the region closest to x = 0. The function has three minima in the region x  [0.05, 0.5], as is shown in Figure 8.1.1, which are located approximately at 0.058, 0.091, and 0.223. This function is even; thus for each minima x* > 0 there exists a symmetric (with respect to the vertical axis) minima at -x*. The global minima of y are approximately at 0.223 and -0.223.

Differentiating y(x) and substituting in Equation (8.1.4) leads to the search rule

where c(t) =  exp(-t), N(t) is normally distributed random noise with zero mean and unity variance, and . Initial simulations are performed to test for proper setting of the parameters of c(t). Values of  > 100 and allowed the search to converge to "deep" minima of y(x). Figure 8.1.3 shows two trajectories of the search point x for  = 50 (dashed line) and  = 200 (solid line) with  = 0.02, and x(0) = 0.07. The same noise sequence N(t) is used in computing these trajectories. In most simulations with x(0) close to zero, searches with large lead to "deeper" minima compared to searches with small ( < 100). For  = 0, the search becomes a pure gradient descent search, and with x(0) = 0.07 the local minima at 0.058 will always be reached. On the other hand, for  > 200, the search has a very good chance of converging to the global minimum at 0.223 or its neighboring minimum at x*  0.091 (some simulations also led to the minima at 0.058, -0.058, -0.091 and ).

Figure 8.1.3. Search trajectories generated by the stochastic gradient search method of Equation (8.1.4) for the function y(x) = x sin. The search started at x(0) = 0.07 and converged to the local minimum x*  0.091 for a noise gain coefficient  = 50 (dashed line) and to the global minimum x*  0.223 for (solid line).

The stochastic gradient rule in Equation (8.1.4) can be applied to all of the gradient descent-based learning rules for both single and multilayer feedforward neural nets. This type of weight update is sometimes referred to as Langevin-type learning (Heskes and Kappen, 1993b). For example, the Langevin-type backprop is easily formulated by adding the decaying noise terms cl(t)N(t) and cj(t)N(t) to the right hand side of the batch version of Equations (5.1.2) and (5.1.9), respectively. The subscripts l and j imply the possibility for using different noise magnitude schedules for the output and hidden layers and/or units. The training of multilayer nets according to Langevin-type backprop can be more computationally effective than deterministic gradient descent (i.e., batch backprop). This is because a fast suboptimal schedule for c(t) can be used and still lead (on the average) to better solutions than gradient descent. Also, stochastic gradient search has a better chance of escaping local minima and areas of shallow gradient, which may allow it to converge faster (Hoptroff and Hall, 1989). It should be noted here that the incremental update version of backprop [Equations (5.1.2) and (5.1.9)] may also be viewed as a stochastic gradient rule; though, the stochasticity is intrinsic to the gradient itself, due to the nature of the minimization of an "instantaneous" error and the random presentation order of the training vectors, as opposed to being artificially introduced, as in Langevin-type learning.