**4.1 Learning as a Search Mechanism**

Supervised learning can be related to classical
approximation theory (Poggio and Girosi, 1990a). Here, the idea
is to approximate or interpolate a continuous multivariate function
*g*(**x**), from samples {**x**, *g*(**x**)},
by an approximation function (or class of functions) *G*(**w**,**
x**), where **w** R*d*
is a parameter vector with *d* degrees of freedom, and **x**
belongs to a compact set R*n*.
In this case, the set of samples {**x**, *g*(**x**)},
**x** , is referred to as a training set. The approximation
problem is to find an optimal parameter vector that provides the
"best" approximation of *g* on the set** for
a given class of functions ***G*. Formally stated, we
desire a solution **w******* R*d***
**such that

(4.1.1)

where the tolerance is a positive real number and
is any appropriate norm. For the case
where is the Euclidean norm, it is well
known that an appropriate criterion function *J* is

(4.1.2)

whose global minimum represents the minimum sum of
square error (SSE) solution. The choice of approximation function
*G*, the criterion function *J*, and the search mechanism
for **w*** all play
critical roles in determining the quality and properties of the
resulting solution/approximation.

Usually we know our objective and this knowledge
is translated into an appropriate criterion function *J*.
In terms of search mechanisms, gradient-based search is appropriate
(and is simple to implement) for cases where it is known that
*J* is differentiable and bounded. Of course, if gradient
search is used, we must be willing to accept locally-optimal solutions.
In general, only global search mechanisms, which are computationally
very expensive, may lead to global optimal solutions (global search-based
learning is the subject of Chapter 8). Among the above three
factors, the most critical in terms of affecting the quality of
the approximation in Equation (4.1.1) is the choice of approximation
function *G*.

In classical analysis, polynomials and rational
functions are typically used for function approximation. On the
other hand, for artificial neural networks, the approximation
functions are usually chosen from the class of smooth sigmoidal-type
functions, and the approximation is constructed as a superposition
of such sigmoidal functions. In general, there are two important
issues in selecting an appropriate class of basis functions: universality
and generalization. By universality, we mean the ability of *G*
to represent, to any desired degree of accuracy, the class of
functions *g* being approximated; in Chapter 2, we have established
the universality of feedforward layered neural nets. On the other
hand, generalization means the ability of *G* to correctly
map new points **x**, drawn from the same underlying input
distribution *p*(**x**), which were not seen during the
learning phase. Thus, an interesting question here is how well
does a neural network compare to other universal approximation
functions in terms of generalization (some insight into answering
this question is given in Section 5.2.5). Later in this chapter,
we will see that for feedforward neural networks, the number of
degrees of freedom (weights) in *G* plays a critical role
in determining the degree of data overfitting, which directly
affects generalization quality.

Another way to control generalization is through
criterion function conditioning. A "regularization"
term (Poggio and Girosi, 1990a) may be added to an initial criterion
function according to

(4.1.3)

The ||P(**w**)||2
term in Equation (4.1.3) is used to imbed *a priori* information
about the function *g*, such as smoothness, invariance, etc.
In this case, is the quantity to be
minimized subject to the regularization constraint that
is kept "small" with the Lagrange multiplier determining
the degree of regularization. These ideas can also be extended
to unsupervised learning where the term
may be thought of as a constraint satisfaction term; such a term
may help condition the criterion so that
the search process is stabilized. Examples of this regularization
strategy have already been encountered in the unsupervised Hebbian-type
learning rules presented in Chapter 3 [e.g., refer to Equation
(3.3.8)].

Goto [4.0] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [4.8] [4.9] [4.10]