Learning in an artificial neural network, whether it is supervised, reinforcement, or unsupervised, can be viewed as a process of searching a multi-dimensional parameter space for a state which optimizes a predefined criterion function J (J is also commonly referred to as an error function, a cost function, or an objective function). In fact, all of the learning rules considered in Chapter 3, except Oja's rule (refer to Table 3.1), have well-defined analytical criterion functions. These learning rules implement local search mechanisms (i.e., gradient search) to obtain weight vector solutions which (locally) optimize the associated criterion function. Therefore, it is the criterion function that determines the nature of a learning rule. For example, supervised learning rules are designed so as to minimize an error measure between the network's output and the corresponding desired output, unsupervised Hebbian learning is designed to maximize the variance of the output of a given unit, and reinforcement learning is designed to maximize the average reinforcement signal.

Supervised learning can be related to classical approximation theory (Poggio and Girosi, 1990a). Here, the idea is to approximate or interpolate a continuous multivariate function g(x), from samples {x, g(x)}, by an approximation function (or class of functions) G(w, x), where w  Rd is a parameter vector with d degrees of freedom, and x belongs to a compact set   Rn. In this case, the set of samples {xg(x)}, x  , is referred to as a training set. The approximation problem is to find an optimal parameter vector that provides the "best" approximation of g on the set for a given class of functions G. Formally stated, we desire a solution w*  Rd such that

(4.1.1)

where the tolerance is a positive real number and is any appropriate norm. For the case where is the Euclidean norm, it is well known that an appropriate criterion function J is

(4.1.2)

whose global minimum represents the minimum sum of square error (SSE) solution. The choice of approximation function G, the criterion function J, and the search mechanism for w* all play critical roles in determining the quality and properties of the resulting solution/approximation.

Usually we know our objective and this knowledge is translated into an appropriate criterion function J. In terms of search mechanisms, gradient-based search is appropriate (and is simple to implement) for cases where it is known that J is differentiable and bounded. Of course, if gradient search is used, we must be willing to accept locally-optimal solutions. In general, only global search mechanisms, which are computationally very expensive, may lead to global optimal solutions (global search-based learning is the subject of Chapter 8). Among the above three factors, the most critical in terms of affecting the quality of the approximation in Equation (4.1.1) is the choice of approximation function G.

In classical analysis, polynomials and rational functions are typically used for function approximation. On the other hand, for artificial neural networks, the approximation functions are usually chosen from the class of smooth sigmoidal-type functions, and the approximation is constructed as a superposition of such sigmoidal functions. In general, there are two important issues in selecting an appropriate class of basis functions: universality and generalization. By universality, we mean the ability of G to represent, to any desired degree of accuracy, the class of functions g being approximated; in Chapter 2, we have established the universality of feedforward layered neural nets. On the other hand, generalization means the ability of G to correctly map new points x, drawn from the same underlying input distribution p(x), which were not seen during the learning phase. Thus, an interesting question here is how well does a neural network compare to other universal approximation functions in terms of generalization (some insight into answering this question is given in Section 5.2.5). Later in this chapter, we will see that for feedforward neural networks, the number of degrees of freedom (weights) in G plays a critical role in determining the degree of data overfitting, which directly affects generalization quality.

Another way to control generalization is through criterion function conditioning. A "regularization" term (Poggio and Girosi, 1990a) may be added to an initial criterion function according to

(4.1.3)

The ||P(w)||2 term in Equation (4.1.3) is used to imbed a priori information about the function g, such as smoothness, invariance, etc. In this case, is the quantity to be minimized subject to the regularization constraint that is kept "small" with the Lagrange multiplier determining the degree of regularization. These ideas can also be extended to unsupervised learning where the term may be thought of as a constraint satisfaction term; such a term may help condition the criterion so that the search process is stabilized. Examples of this regularization strategy have already been encountered in the unsupervised Hebbian-type learning rules presented in Chapter 3 [e.g., refer to Equation (3.3.8)].

Goto [4.0] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [4.8] [4.9] [4.10]