Consider the optimization problem of finding the
extreme point(s) of a real-valued multi-dimensional scalar function
(objective function) of the form y : R,
where the search space is a compact subset of Rn.
An extreme point is a point x*
in such that y(x*)
takes on its maximum (or minimum) value. In the following discussion,
it will be assumed, without loss of generality, that by optimization
we mean minimization. Thus, an extreme point of y is the
"global" minimum of y. Multiple extreme points
may exist. In addition to the global minimum (minima), the function
y may also admit local minima. A point x*
is a local minimum of y if y(x*) < y(x)
for all x such that ||x* - x|| ,
for some > 0. Figure 8.1.1 illustrates the concepts
of global and local minima for the uni-variate scalar function
y(x) = x sin
for x [0.05, 0.5].
.
There exists several ways of determining the minima
of a given function. Analytical techniques exploit Fermat's Stationarity
Principle which states that the gradient (or derivative in the
uni-variate case) of y with respect to x is zero
at all minima (and maxima). Thus, one can find these minima (and
maxima) by solving the set of equations (possibly nonlinear) of
the form y(x) = 0. Here, a solution
x* is a minimum
of y if the Hessian matrix,
, evaluated
at x*, is positive
definite (i.e., xTH(x*)x > 0
for all x 0), or if
in the uni-variate case. The global minimum may then be identified
by direct evaluation of y(x*).
This method is theoretically sound as long as the function y
is twice differentiable. In practical situations, though, the
above approach is inefficient due to the computational overhead
involved in its implementation on a digital computer.
Other, more efficient optimization techniques do
exist. One example is the simple (steepest) gradient descent
algorithm introduced in Chapter 3 [e.g., see Equation (3.1.21)],
which formed the basis for most of the learning rules discussed
so far in this book. Assuming y is differentiable, gradient
descent can be expressed according to the recursive rule:
(8.1.1)
where is a small positive constant, and t
is a positive integer representing the iteration number. Given
an initial "guess" x(0) (e.g., a random point
in ), the recursive rule in Equation (8.1.1) implements a search
strategy, whereby a sequence of vectors:
is generated such that
and hence at each iteration we move closer to an
optimal solution. Although computationally efficient (its convergence
properties and some extensions were considered in Section 5.2.3),
gradient search methods may lead only to local minima of y
which happen to be "close" to the initial search point
x(0). As an example, consider the one-dimensional function
y(x) = x sin
shown in Figure 8.1.1. It is clear that from this figure that
given any initial search point x(0) [0.05, 0.12],
the gradient descent algorithm will always converge to one of
the two local minima shown. In this case, the region of search
space containing the global solution will never be explored.
Hence for local search algorithms, such as gradient descent,
the quality (optimality) of the final solution is highly dependent
upon the selection of the initial search point.
Global minimization (optimization) requires a global
search strategy; a strategy that cannot be easily fooled by local
minima. In Sections 8.2, 8.4 and 8.5 we will discuss three commonly
used global search strategies. For a survey of global optimization
methods, the reader is referred to the book by Törn and
ilinskas
(1989). As for the remainder of this section, we shall discuss
extensions which help transform a simple gradient search into
a global search.
8.1.1 A Gradient Descent/Ascent Search Strategy
Using intuition and motivated by the saying: "There
is a valley behind every mountain," we may employ gradient
descent/ascent in a simple search strategy which will allow the
discovery of global minima. Assuming a uni-variate objective
function y(x), x [a, b]
with a < b, start with x(0) = a
and use gradient descent search to reach the first local minimum
One may also use this strategy for multi-dimensional
functions, though the search is not guaranteed to find the global
minimum. This is because when the search space has two or more
dimensions, the perturbation x
is now a vector and it is unclear how to set the direction of
x so that the search
point visits all existing minima. Also, in the multidimensional
case, the above gradient descent/ascent strategy may get caught
in a repeating cycle, where the same local minimum and maximum
solutions are repeatedly found. The reader is encouraged to stare
at Figure 8.1.2 which depicts a plot of a two-dimensional function
to see how (starting from a local minimum point) the choice of
x effects the sequence
of visited peaks and valleys.
Figure 8.1.2. A plot of a two-variable function showing
multiple minima, maxima, and saddle points.
For such differentiable multivariate functions,
an example of a search strategy which would be useful for finding
some "good" (sufficiently deep) local minima may be
given as follows. Start the initial search point at a corner
of the search region () and randomly vary the direction of the
perturbation x
after each local minimum (maximum) is found, such that x
points in a random direction, away from the current minimum (maximum).
A similar global search strategy is one which replaces the gradient
ascent step in the above descent/ascent search strategy by a "tunneling
step." Here, tunneling is used to move the search point
away from the current minimum to another point in its vicinity
such that the new point is (hopefully) on a surface of the search
landscape which leads into a deeper minimum. This idea is similar
to the one embodied in the global descent search strategy discussed
in Section 5.1.2.
8.1.2 Stochastic Gradient Search: Global Search
via Diffusion
The above search methods are deterministic in nature.
Stochastic (non-deterministic) ideas may also be incorporated
in gradient search-based optimization, thus leading to stochastic
gradient algorithms. In fact, for problems of moderate to high
dimensions, the only feasible methods for global optimization
are stochastic in nature (Schoen, 1991). These methods utilize
noise to perturb the function y(x) being minimized
in order to avoid being trapped in "bad" local minima.
Though in order for the stochastic method to converge to a "good"
solution, this noise should be introduced appropriately and then
subsequently removed. That is, during the search process the
perturbations to y are gradually removed so that the effective
function being minimized will become exactly y prior to
reaching the final solution.
In its most basic form, the stochastic gradient
algorithm performs gradient descent search on a perturbed form
Here, N = [N1, N2, ..., Nn]T
is a vector of independent noise sources, and c(t)
is a parameter which controls the magnitude of noise. To achieve
the gradual reduction in noise mentioned above, c(t)
must be selected in such a way that it approaches zero as t
tends to infinity. A simple choice for c(t) is
given by (Cichocki and Unbehauen, 1993)
c(t) = e-t
(8.1.3)
with 0 and > 0. Substituting
the perturbed function
The stochastic gradient algorithm in Equation (8.1.4)
is inspired by the dynamics of the diffusion process in physical
phenomena such as atomic migration in crystals or chemical reactions.
The dynamics of the diffusion process across potential barriers
involve a combination of gradient descent and random motion according
to the Smoluchowski-Kramers equation (Aluffi-Pentini et al., 1985):
where E(x) is the potential energy,
T is the absolute temperature, k is the Boltzmann
constant, and m is the reduced mass. The term
The search rule in Equation (8.1.4) may be applied
to any function y as long as the gradient information can
be determined or estimated. Unlike the gradient search rule in
Equation (8.1.1), the stochastic rule in Equation (8.1.4) may
allow the search to escape local minima. Note that for zero mean
statistically independent noise, the present search method will
follow, on average, the gradient of y.
The probability that stochastic gradient search
leads to the global minimum solution critically depends on the
functional form of the noise amplitude schedule c(t).
In the exponential schedule in Equation (8.1.3), the coefficient
controls the amplitude of noise and determines the rate of damping.
A sufficiently large should be used in order for the search
to explore a large range of the search space. For large , the
stochastic effects decay very rapidly, and prematurely reduce
the search to the simple deterministic gradient search, thus increasing
the probability of reaching a local suboptimal solution. Small
values for are desirable since they allow the stochastic search
to explore a sufficiently large number of points on the search
surface, which is necessary for global optimization. However,
very small values of lead to a very slow convergence process.
Thus, the coefficient needs to be chosen such that we strike
a balance between the desire for fast convergence and the need
to ensure an optimal (or near optimal) solution. The following
is an example of the application of stochastic gradient search
for finding the global solution.
Example 8.1.1: Let us
use the stochastic gradient rule in Equation (8.1.4) to search
for minima of the function y(x) = x sin
Differentiating y(x) and substituting
in Equation (8.1.4) leads to the search rule
where c(t) = exp(-t),
N(t) is normally distributed random noise with zero
mean and unity variance, and
Figure 8.1.3. Search trajectories generated by the
stochastic gradient search method of Equation (8.1.4) for the
function y(x) = x sin
The stochastic gradient rule in Equation (8.1.4)
can be applied to all of the gradient descent-based learning rules
for both single and multilayer feedforward neural nets. This
type of weight update is sometimes referred to as Langevin-type
learning (Heskes and Kappen, 1993b). For example, the Langevin-type
backprop is easily formulated by adding the decaying noise terms
cl(t)N(t)
and cj(t)N(t)
to the right hand side of the batch version of Equations (5.1.2)
and (5.1.9), respectively. The subscripts l and j
imply the possibility for using different noise magnitude schedules
for the output and hidden layers and/or units. The training of
multilayer nets according to Langevin-type backprop can be more
computationally effective than deterministic gradient descent
(i.e., batch backprop). This is because a fast suboptimal schedule
for c(t) can be used and still lead (on the average)
to better solutions than gradient descent. Also, stochastic gradient
search has a better chance of escaping local minima and areas
of shallow gradient, which may allow it to converge faster (Hoptroff
and Hall, 1989). It should be noted here that the incremental
update version of backprop [Equations (5.1.2) and (5.1.9)] may
also be viewed as a stochastic gradient rule; though, the stochasticity
is intrinsic to the gradient itself, due to the nature of the
minimization of an "instantaneous" error and the random
presentation order of the training vectors, as opposed to being
artificially introduced, as in Langevin-type learning.
. Next, save the value
and proceed by ascending the function y (using Equation
(8.1.1) with the negative sign replaced by a positive sign) starting
from the initial point
where x
is a sufficiently small positive constant. Continue ascending
until a local maximum is reached. Now, switch back to gradient
descent starting from the current point (maximum) perturbed by
x until convergence
to the second local minimum
, and save
the value
. This full search cycle is
repeated until the search reaches the point x = b.
At this point, all minima of y over x [a, b]
have been obtained and the global minimum is the point
satisfying
. This strategy will always
lead to the global minimum when y is a function of a single
variable; i.e., when the search space is one-dimensional.
(x, N) of the function y(x)
where the perturbation is additive in nature:
(8.1.2)
in the gradient
descent rule in Equation (8.1.1) gives the stochastic gradient
rule:
(8.1.4)
(8.1.5)
is a stochastic force. Geman and Hwang (1986) [see also Chiang
et al. (1987)] developed a method for global optimization, which
is essentially the simulation of an annealed diffusion process.
This method is based on Equation (8.1.5) with the temperature
T made inversely proportional to the logarithm of positive
time (t) for almost guaranteed convergence to the global
minimum. The discrete-time version of Equation (8.1.5) with annealed
temperature leads to the stochastic gradient rule in Equation
(8.1.4). Convergence analysis of a slightly modified version
of Equation (8.1.4) can be found in Gelfand and Mitter (1991).
near the origin. The function y(x) has an infinite
number of local minima which become very dense in the region closest
to x = 0. The function has three minima in the
region x [0.05, 0.5], as is shown in Figure
8.1.1, which are located approximately at 0.058, 0.091, and 0.223.
This function is even; thus for each minima x* > 0
there exists a symmetric (with respect to the vertical axis) minima
at -x*.
The global minima of y are approximately at 0.223 and
-0.223.

. Initial
simulations are performed to test for proper setting of the parameters
of c(t). Values of > 100 and
allowed the search to converge to "deep" minima of y(x).
Figure 8.1.3 shows two trajectories of the search point x
for = 50 (dashed line) and = 200 (solid
line) with = 0.02, and x(0) = 0.07.
The same noise sequence N(t) is used in computing
these trajectories. In most simulations with x(0) close
to zero, searches with large lead to "deeper" minima
compared to searches with small ( < 100). For = 0,
the search becomes a pure gradient descent search, and with x(0) = 0.07
the local minima at 0.058 will always be reached. On the other
hand, for > 200, the search has a very good chance
of converging to the global minimum at 0.223 or its neighboring
minimum at x* 0.091
(some simulations also led to the minima at 0.058, -0.058,
-0.091 and
).
.
The search started at x(0) = 0.07 and converged
to the local minimum x* 0.091
for a noise gain coefficient = 50 (dashed line) and
to the global minimum x* 0.223
for
(solid line).