4.8 Generalization

In Chapter 2, we analyzed the capabilities of some neural network architectures for realizing arbitrary mappings. We have found that a feedforward neural net with a single hidden layer having an arbitrary number of sigmoidal activation units is capable of approximating any mapping (or continuous multivariate function) to within any desired degree of accuracy. The results of Chapter 2 on the computational capabilities of layered neural networks say nothing about the synthesis/learning procedure needed to set the interconnection weights of these networks. What remains to be seen is whether such networks are capable of finding the necessary weight configuration for a given mapping by employing a suitable learning algorithm.

Later chapters in this book address the question of learning in specific neural network architectures by extending appropriate learning rules covered in Chapter 3. In the remainder of this chapter, we address two important issues related to learning in neural networks: generalization and complexity. The following discussion is general in nature, and thus it holds for a wide range of neural network paradigms.

Generalization is measured by the ability of a trained network to generate the correct output for a new randomly chosen input belonging to the same probability density p(x) governing the training set. In this section, two cases are considered: Average generalization and worst case generalization. This section also considers the generalization capabilities of stochastic neural networks.

4.8.1 Generalization Capabilities of Deterministic Networks

One important performance measure of trainable neural networks is the size of the training set needed to bound their generalization error below some specified number. Schwartz et al. (1990) gave a theoretical framework for calculating the average probability of correct generalization for a neural net trained with a training set of size m. Here, the averaging is over all possible networks (of fixed architecture) consistent with the training set, and the only assumptions about the network's architecture is that it is deterministic (employs deterministic units) and that it is a universal architecture (or faithful model) of the class of functions/mappings being learned. It is assumed that these functions are of the form : Rn {0, 1}, but the ideas can be extended to multiple and/or continuous-valued outputs as well.

The following analysis is based on the theoretical framework of Schwartz et al. (1990), and it also draws on some clarifications given by Hertz et al. (1991). The main result of this analysis is rather surprising; one can calculate the average probability of correct generalization for any training set of size m if one knows a certain function that can (in theory) be calculated before training begins. However, it should be kept in mind that this result is only meaningful when interpreted in an average sense, and does not necessarily represent the typical situation encountered in a specific training scheme.

Consider a class of networks with a certain fixed architecture specified by the number of layers, the number of units within each layer, and the interconnectivity pattern between layers. Let us define the quantity V0 as

(4.8.1)

which stands for the total "volume" of the weight space, where w represents the weights of an arbitrary network, and is some a priori weight probability density function. Thus, each network is represented as a point w in weight space which implements a function fw(x). We may now partition the weight space into a set of disjoint regions, one for each function fw that this class of networks can implement. The volume of the region of weight space that implements a particular function f(x) is given by

(4.8.2)

where


Each time an example {xkfd(xk)} of a desired function fd is presented and is successfully learned (supervised learning is assumed), the weight vector w is modified so that it enters the region of weight space that is compatible with the presented example. If m examples are learned, then the volume of this region is given by

(4.8.3)

where


The region Vm represents the total volume of weight space which realizes the desired function fd as well as all other functions f that agree with fd on the desired training set. Thus, if a new input is presented to the trained network it will be ambiguous with respect to a number of functions represented by Vm (recall the discussion on ambiguity for the simple case of a single threshold gate in Section 1.5). As the number of learned examples m is increased, the expected ambiguity decreases.

Next, the volume of weight space consistent with both the training examples and a "particular" function f is given by

(4.8.4)

where Equation (4.8.2) was used. Note that fw in I has been replaced by f and that the product term factors outside of the integral. Now, assuming independent input vectors xk generated randomly from distribution p(x), the factors I(f, xk) in Equation (4.8.4) are independent. Thus, averaging Vm() over all xk gives

(4.8.5)

The quantity g() takes on values between 0 and 1. It is referred to as the generalization ability of ; i.e., g(f ) may be viewed as the probability that for an input x randomly chosen from p(x). As an example, for a completely specified n-input Boolean function fd, g() is given by (assuming that all 2n inputs are equally likely)


where dH(f, fd) is the number of bits by which f and fd differ.

Let us define the probability Pm() that a particular function f can be implemented after training on m examples of fd. This probability is equal to the average fraction of the remaining weight space that f occupies:

(4.8.6)

The approximation in Equation (4.8.6) is based on the assumption that Vm does not vary much with the particular training sequence; i.e., Vm   for each probable sequence. This assumption is expected to be valid as long as m is small compared to the total number of possible input combinations.

Good generalization requires that Pm() be small. Let us use Equation (4.8.6) to compute the distribution of generalization ability g() across all possible 's after successful training with m examples:

(4.8.7)

Note that an exact m(g) can be derived by dividing the right-hand side of Equation (4.8.7) by . The above result is interesting since it allows us to compute, before learning, the distribution of generalization ability after training with m examples. The form of Equation (4.8.7) shows that the distribution m(g) tends to get concentrated at higher and higher values of g as more and more examples are learned. Thus, during learning, although the allowed volume of weight (or function) space shrinks, the remaining regions tend to have large generalization ability.

Another useful measure of generalization is the average generalization ability G(m) given by

(4.8.8)

which is the ratio between the + 1 and the mth moments of 0(g) and can be computed if 0(g) is given or estimated. G(m) gives the entire "learning curve"; i.e., it gives the average expected success rate as a function of m. Equation (4.8.8) allows us to predict the number of examples, m, necessary to train the network to a desired average generalization performance. We may also define the average prediction error as 1 - G(m). The asymptotic behavior (m  ) of the average prediction error is determined by the form of the initial distribution 0(g) near g = 1. If a finite gap between g = 1 and the next highest g for which 0(g) is nonzero exists, then the prediction error decays to 1 exponentially as . If, on the other hand, there is no such gap in 0(g), then the prediction error decays as . These two behaviors of the learning curve have also been verified through numerical experiments. The nature of the gap in the distribution of generalizations near the region of perfect generalization (g = 1) is not completely understood. These gaps have been detected in experiments involving the learning of binary mappings (Cohn and Tesauro, 1991 and 1992). It is speculated that such a gap could be due to the dynamic effects of the learning process where the learning algorithm may, for some reason, avoid the observed near-perfect solutions. Another possibility is that the gap is inherent in the nature of the binary mappings themselves.

The above approach, though theoretically interesting, is of little practical use for estimating m, since it requires the knowledge of the distribution 0(g), whose estimation is computationally expensive. It also gives results that are only valid in an average sense, and does not necessarily represent the typical situation encountered in a specific training scheme.

Next, we summarize a result that tells us about the generalization ability of a deterministic feedforward neural network in the "worst" case. Here, also, the case of learning a binary-valued output function :Rn  {0, 1} is treated. Consider a set of m labeled training example pairs (x, y) selected randomly from some arbitrary probability distribution p(x, y), with x Rn and y = f(x) {0, 1}. Also, consider a single hidden layer feedforward neural net with k LTG's and d weights that has been trained on the m examples so that at least a fraction 1 - , where , of the examples are correctly classified. Then, with a probability approaching 1, this network will correctly classify the fraction 1 - of future random test examples drawn from p(x, y), as long as (Baum and Haussler, 1989)

(4.8.9)

Ignoring the log term, we may write Equation (4.8.9), to a first order approximation, as

(4.8.10)

which requires m d for good generalization. It is interesting to note that this is the same condition for "good" generalization (low ambiguity) for a single LTG derived by Cover (1965) (refer to Section 1.5) and obtained empirically by Widrow (1987). Therefore, one may note that in the limit of large m the architecture of the network is not important in determining the worst case generalization behavior; what matters is the ratio of the number of degrees of freedom (weights) to the training set size. On the other hand, none of the above theories may hold for the case of a small training set. In this later case, the size and architecture of the network and the learning scheme all play a role in determining generalization quality (see the next chapter for more details). It should also be noted that the architecture of the net can play an important role in determining the speed of convergence of a given class of learning methods, as discussed later in Section 4.9.

Similar results for worst case generalization are reported in Blumer et al. (1989). A more general learning curve based on statistical physics and VC dimension theories (Vapnik and Chervonenkis, 1971) which applies to a general class of networks can be found in Haussler et al. (1992). For generalization results with noisy target signals the reader is referred to Amari et al. (1992).

4.8.2 Generalization in Stochastic Networks

This section deals with the asymptotic learning behavior of a general stochastic learning dichotomy machine (classifier). We desire a relation between the generalization error and the training error in terms of the number of free parameters of the machine (machine complexity) and the size of the training set. The results in this section are based on the work of Amari and Murata (1993). Consider a parametric family of stochastic machines where a machine is specified by a d-dimensional parameter vector w such that the probability of output y, given an input x, is specified by P(y | x, w). As an example, one may assume the machine to be a stochastic multilayer neural network parameterized by a weight vector w Rd, which, for a given input x Rn, emits a binary output with probability

(4.8.11)

and

where

(4.8.12)

and g(x, w) may be considered as a smooth deterministic function (e.g., superposition of multivariate sigmoid functions typically employed in layered neural nets). Thus, in this example, the stochastic nature of the machine is determined by its stochastic output unit.

Assume that there exists a true machine that can be faithfully represented by one of the above family of stochastic machines with parameter w0. The true machine receives inputs xk, k = 1, 2, ..., m, which are randomly generated according to a fixed but unknown probability distribution p(x), and emits yk. The maximum likelihood estimator (refer to Section 3.1.5 for definition) characterized by the machine will be our first candidate machine. This machine predicts y for a given x with probability P(y | x, ). An entropic loss function is used to evaluate the generalization of a trained machine for a new example (xm+1ym+1).

Let gen be the average predictive entropy (also known as the average entropic loss) of a trained machine parameterized by for a new example (xm+1ym+1):

(4.8.13)

Similarly, we define train as the average entropic loss over the training examples used to obtain

(4.8.14)

Finally, let H0 be the average entropic error of the true machine

(4.8.15)

Amari and Murata proved the following theorem for training and generalization error.

Theorem 4.8.1 (Amari and Murata, 1993): The asymptotic learning curve for the entropic training error is given by

(4.8.16)

and for the entropic generalization error by

(4.8.17)

The proof of Theorem 4.8.1 uses standard techniques of asymptotic statistics and is omitted here. (The reader is referred to the original paper by Amari and Murata (1993) for such proof).

In general, H0 is unknown and it can be eliminated from Equation (4.8.17) by substituting its value from Equation (4.8.16). This gives

(4.8.18)

which shows that for a faithful stochastic machine and in the limit of m d, the generalization error approaches that of the trained machine on m examples, which from Equation (4.8.16) is the classification error H0 of the true machine.

Again, the particular network architecture is of no importance here as long as it allows for a faithful realization of the true machine and m >> d. It is interesting to note that this result is similar to the worst case learning curve for deterministic machines [Equation (4.8.10)], when the training error is zero. The result is also in agreement with Cover's result on classifier ambiguity in Equation (1.5.3), where the term in Equation (4.8.18) may be viewed as the probability of ambiguous response on the + 1 input. In fact, Amari (1993) also proved that the average predictive entropy gen(m) for a general deterministic dichotomy machine (e.g., feedforward neural net classifier) converges to 0 as in the limit m  .

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [4.9] [4.10]

Back to the Table of Contents

Back to Main Menu