Consider the optimization problem of finding the
extreme point(s) of a real-valued multi-dimensional scalar function
(objective function) of the form *y* : R,
where the search space is a compact subset of R*n*.
An extreme point is a point **x***
in such that *y*(**x***)
takes on its maximum (or minimum) value. In the following discussion,
it will be assumed, without loss of generality, that by optimization
we mean minimization. Thus, an extreme point of *y* is the
"global" minimum of *y*. Multiple extreme points
may exist. In addition to the global minimum (minima), the function
*y* may also admit local minima. A point **x***
is a local minimum of *y* if *y*(**x***) < *y*(**x**)
for all **x** such that ||**x*** - **x**|| ,
for some > 0. Figure 8.1.1 illustrates the concepts
of global and local minima for the uni-variate scalar function
*y*(*x*) = *x* sin
for *x* [0.05, 0.5].

There exists several ways of determining the minima
of a given function. Analytical techniques exploit Fermat's Stationarity
Principle which states that the gradient (or derivative in the
uni-variate case) of *y* with respect to **x** is zero
at all minima (and maxima). Thus, one can find these minima (and
maxima) by solving the set of equations (possibly nonlinear) of
the form *y*(**x**) = **0**. Here, a solution
**x*** is a minimum
of *y* if the Hessian matrix, , evaluated
at **x***, is positive
definite (i.e., **x**T**H**(**x***)**x** > 0
for all **x** **0**), or if
in the uni-variate case. The global minimum may then be identified
by direct evaluation of *y*(**x***).
This method is theoretically sound as long as the function *y*
is twice differentiable. In practical situations, though, the
above approach is inefficient due to the computational overhead
involved in its implementation on a digital computer.

Other, more efficient optimization techniques do
exist. One example is the simple (steepest) gradient descent
algorithm introduced in Chapter 3 [e.g., see Equation (3.1.21)],
which formed the basis for most of the learning rules discussed
so far in this book. Assuming *y* is differentiable, gradient
descent can be expressed according to the recursive rule:

(8.1.1)

where is a small positive constant, and *t*
is a positive integer representing the iteration number. Given
an initial "guess" **x**(0) (e.g., a random point
in ), the recursive rule in Equation (8.1.1) implements a search
strategy, whereby a sequence of vectors:

is generated such that

and hence at each iteration we move closer to an
optimal solution. Although computationally efficient (its convergence
properties and some extensions were considered in Section 5.2.3),
gradient search methods may lead only to local minima of *y*
which happen to be "close" to the initial search point
**x**(0). As an example, consider the one-dimensional function
*y*(*x*) = *x* sin
shown in Figure 8.1.1. It is clear that from this figure that
given any initial search point **x**(0) [0.05, 0.12],
the gradient descent algorithm will always converge to one of
the two local minima shown. In this case, the region of search
space containing the global solution will never be explored.
Hence for local search algorithms, such as gradient descent,
the quality (optimality) of the final solution is highly dependent
upon the selection of the initial search point.

Global minimization (optimization) requires a global
search strategy; a strategy that cannot be easily fooled by local
minima. In Sections 8.2, 8.4 and 8.5 we will discuss three commonly
used global search strategies. For a survey of global optimization
methods, the reader is referred to the book by Törn and ilinskas
(1989). As for the remainder of this section, we shall discuss
extensions which help transform a simple gradient search into
a global search.

**8.1.1 A Gradient Descent/Ascent Search Strategy**

Figure 8.1.2. A plot of a two-variable function showing
multiple minima, maxima, and saddle points.

**8.1.2 Stochastic Gradient Search: Global Search
via Diffusion**

The above search methods are deterministic in nature.
Stochastic (non-deterministic) ideas may also be incorporated
in gradient search-based optimization, thus leading to stochastic
gradient algorithms. In fact, for problems of moderate to high
dimensions, the only feasible methods for global optimization
are stochastic in nature (Schoen, 1991). These methods utilize
noise to perturb the function *y*(**x**) being minimized
in order to avoid being trapped in "bad" local minima.
Though in order for the stochastic method to converge to a "good"
solution, this noise should be introduced appropriately and then
subsequently removed. That is, during the search process the
perturbations to *y* are gradually removed so that the effective
function being minimized will become exactly *y* prior to
reaching the final solution.

In its most basic form, the stochastic gradient
algorithm performs gradient descent search on a perturbed form
(**x**, **N**) of the function *y*(**x**)
where the perturbation is additive in nature:

(8.1.2)

Here, **N** = [*N*1, *N*2, ..., *N**n*]T
is a vector of independent noise sources, and *c*(*t*)
is a parameter which controls the magnitude of noise. To achieve
the gradual reduction in noise mentioned above, *c*(*t*)
must be selected in such a way that it approaches zero as *t*
tends to infinity. A simple choice for *c*(*t*) is
given by (Cichocki and Unbehauen, 1993)

*c*(*t*) = e-*t*
(8.1.3)

with 0 and > 0. Substituting
the perturbed function in the gradient
descent rule in Equation (8.1.1) gives the stochastic gradient
rule:

(8.1.4)

The stochastic gradient algorithm in Equation (8.1.4)
is inspired by the dynamics of the diffusion process in physical
phenomena such as atomic migration in crystals or chemical reactions.
The dynamics of the diffusion process across potential barriers
involve a combination of gradient descent and random motion according
to the Smoluchowski-Kramers equation (Aluffi-Pentini et al., 1985):

(8.1.5)

where *E*(**x**) is the potential energy,
*T* is the absolute temperature, *k* is the Boltzmann
constant, and *m* is the reduced mass. The term
is a stochastic force. Geman and Hwang (1986) [see also Chiang
et al. (1987)] developed a method for global optimization, which
is essentially the simulation of an annealed diffusion process.
This method is based on Equation (8.1.5) with the temperature
*T* made inversely proportional to the logarithm of positive
time (*t*) for almost guaranteed convergence to the global
minimum. The discrete-time version of Equation (8.1.5) with annealed
temperature leads to the stochastic gradient rule in Equation
(8.1.4). Convergence analysis of a slightly modified version
of Equation (8.1.4) can be found in Gelfand and Mitter (1991).

The search rule in Equation (8.1.4) may be applied
to any function *y* as long as the gradient information can
be determined or estimated. Unlike the gradient search rule in
Equation (8.1.1), the stochastic rule in Equation (8.1.4) may
allow the search to escape local minima. Note that for zero mean
statistically independent noise, the present search method will
follow, on average, the gradient of *y*.

The probability that stochastic gradient search
leads to the global minimum solution critically depends on the
functional form of the noise amplitude schedule *c*(*t*).
In the exponential schedule in Equation (8.1.3), the coefficient
controls the amplitude of noise and determines the rate of damping.
A sufficiently large should be used in order for the search
to explore a large range of the search space. For large , the
stochastic effects decay very rapidly, and prematurely reduce
the search to the simple deterministic gradient search, thus increasing
the probability of reaching a local suboptimal solution. Small
values for are desirable since they allow the stochastic search
to explore a sufficiently large number of points on the search
surface, which is necessary for global optimization. However,
very small values of lead to a very slow convergence process.
Thus, the coefficient needs to be chosen such that we strike
a balance between the desire for fast convergence and the need
to ensure an optimal (or near optimal) solution. The following
is an example of the application of stochastic gradient search
for finding the global solution.

**Example 8.1.1:** Let us
use the stochastic gradient rule in Equation (8.1.4) to search
for minima of the function *y*(*x*) = *x* sin
near the origin. The function *y*(*x*) has an infinite
number of local minima which become very dense in the region closest
to *x* = 0. The function has three minima in the
region *x* [0.05, 0.5], as is shown in Figure
8.1.1, which are located approximately at 0.058, 0.091, and 0.223.
This function is even; thus for each minima *x** > 0
there exists a symmetric (with respect to the vertical axis) minima
at -*x**.
The global minima of *y* are approximately at 0.223 and
-0.223.

Differentiating *y*(*x*) and substituting
in Equation (8.1.4) leads to the search rule

where *c*(*t*) = exp(-*t*),
*N*(*t*) is normally distributed random noise with zero
mean and unity variance, and . Initial
simulations are performed to test for proper setting of the parameters
of *c*(*t*). Values of > 100 and
allowed the search to converge to "deep" minima of *y*(*x*).
Figure 8.1.3 shows two trajectories of the search point *x*
for = 50 (dashed line) and = 200 (solid
line) with = 0.02, and *x*(0) = 0.07.
The same noise sequence *N*(*t*) is used in computing
these trajectories. In most simulations with *x*(0) close
to zero, searches with large lead to "deeper" minima
compared to searches with small ( < 100). For = 0,
the search becomes a pure gradient descent search, and with *x*(0) = 0.07
the local minima at 0.058 will always be reached. On the other
hand, for > 200, the search has a very good chance
of converging to the global minimum at 0.223 or its neighboring
minimum at *x** 0.091
(some simulations also led to the minima at 0.058, -0.058,
-0.091 and
).

Figure 8.1.3. Search trajectories generated by the
stochastic gradient search method of Equation (8.1.4) for the
function *y*(*x*) = *x* sin.
The search started at *x*(0) = 0.07 and converged
to the local minimum *x** 0.091
for a noise gain coefficient = 50 (dashed line) and
to the global minimum *x** 0.223
for (solid line).

The stochastic gradient rule in Equation (8.1.4)
can be applied to all of the gradient descent-based learning rules
for both single and multilayer feedforward neural nets. This
type of weight update is sometimes referred to as Langevin-type
learning (Heskes and Kappen, 1993b). For example, the Langevin-type
backprop is easily formulated by adding the decaying noise terms
*c**l*(*t*)*N*(*t*)
and *c**j*(*t*)*N*(*t*)
to the right hand side of the batch version of Equations (5.1.2)
and (5.1.9), respectively. The subscripts *l* and *j*
imply the possibility for using different noise magnitude schedules
for the output and hidden layers and/or units. The training of
multilayer nets according to Langevin-type backprop can be more
computationally effective than deterministic gradient descent
(i.e., batch backprop). This is because a fast suboptimal schedule
for *c*(*t*) can be used and still lead (on the average)
to better solutions than gradient descent. Also, stochastic gradient
search has a better chance of escaping local minima and areas
of shallow gradient, which may allow it to converge faster (Hoptroff
and Hall, 1989). It should be noted here that the incremental
update version of backprop [Equations (5.1.2) and (5.1.9)] may
also be viewed as a stochastic gradient rule; though, the stochasticity
is intrinsic to the gradient itself, due to the nature of the
minimization of an "instantaneous" error and the random
presentation order of the training vectors, as opposed to being
artificially introduced, as in Langevin-type learning.