**4.8.1 Generalization Capabilities of Deterministic
Networks**

One important performance measure of trainable neural
networks is the size of the training set needed to bound their
generalization error below some specified number. Schwartz et
al. (1990) gave a theoretical framework for calculating the average
probability of correct generalization for a neural net trained
with a training set of size *m*. Here, the averaging is
over all possible networks (of fixed architecture) consistent
with the training set, and the only assumptions about the network's
architecture is that it is deterministic (employs deterministic
units) and that it is a universal architecture (or faithful model)
of the class of functions/mappings being learned. It is assumed
that these functions are of the form *f *: R*n*
{0, 1}, but the ideas can be extended to multiple and/or continuous-valued
outputs as well.

The following analysis is based on the theoretical
framework of Schwartz et al. (1990), and it also draws on some
clarifications given by Hertz et al. (1991). The main result
of this analysis is rather surprising; one can calculate the average
probability of correct generalization for any training set of
size *m* if one knows a certain function that can (in theory)
be calculated before training begins. However, it should be kept
in mind that this result is only meaningful when interpreted in
an average sense, and does not necessarily represent the typical
situation encountered in a specific training scheme.

Consider a class of networks with a certain fixed
architecture specified by the number of layers, the number of
units within each layer, and the interconnectivity pattern between
layers. Let us define the quantity *V*0
as

(4.8.1)

which stands for the total "volume" of
the weight space, where **w** represents the weights of an
arbitrary network, and is some *a priori* weight probability
density function. Thus, each network is represented as a point
**w** in weight space which implements a function *f***w**(**x**).
We may now partition the weight space into a set of disjoint
regions, one for each function *f***w**
that this class of networks can implement. The volume of the
region of weight space that implements a particular function *f*(**x**)
is given by

(4.8.2)

where

Each time an example {**x***k*, *f**d*(**x***k*)}
of a desired function *f**d*
is presented and is successfully learned (supervised learning
is assumed), the weight vector **w** is modified so that it
enters the region of weight space that is compatible with the
presented example. If *m* examples are learned, then the
volume of this region is given by

(4.8.3)

where

The region *V**m*
represents the total volume of weight space which realizes the
desired function *f**d***
**as well as all other functions *f***
**that agree with *f**d***
**on the desired training set. Thus, if
a new input is presented to the trained network it will be ambiguous
with respect to a number of functions represented by *V**m*
(recall the discussion on ambiguity for the simple case of a single
threshold gate in Section 1.5). As the number of learned examples
*m* is increased, the expected ambiguity decreases.

Next, the volume of weight space consistent with
both the training examples and a "particular" function
*f* is given by

(4.8.4)

where Equation (4.8.2) was used. Note that *f***w**
in *I* has been replaced by *f* and that the product
term factors outside of the integral. Now, assuming independent
input vectors **x***k*
generated randomly from distribution *p*(**x**), the factors
*I*(*f*, **x***k*)
in Equation (4.8.4) are independent. Thus, averaging *V**m*(*f *)
over all **x***k*
gives

(4.8.5)

The quantity *g*(*f *) takes on values
between 0 and 1. It is referred to as the generalization ability
of *f *; i.e., *g*(*f* ) may be viewed
as the probability that for an input **x** randomly chosen
from *p*(**x**). As an example, for a completely specified
*n*-input Boolean function *f**d*,
*g*(*f *) is given by (assuming that all 2*n*
inputs are equally likely)

where *d**H*(*f*,
*f**d*) is the
number of bits by which * f* and *f**d*
differ.

Let us define the probability *P**m*(*f *)
that a particular function *f* can be implemented after training
on *m* examples of *f**d*.
This probability is equal to the average fraction of the remaining
weight space that *f* occupies:

(4.8.6)

The approximation in Equation (4.8.6) is based on
the assumption that *V**m*
does not vary much with the particular training sequence; i.e.,
*V**m*
for each probable sequence. This assumption is expected to be
valid as long as *m* is small compared to the total number
of possible input combinations.

Good generalization requires that *P**m*(*f *)
be small. Let us use Equation (4.8.6) to compute the distribution
of generalization ability *g*(*f *) across all
possible *f *'s after successful training with *m*
examples:

(4.8.7)

Note that an exact *m*(*g*)
can be derived by dividing the right-hand side of Equation (4.8.7)
by . The above result is interesting since it allows us to compute,
before learning, the distribution of generalization ability after
training with *m* examples. The form of Equation (4.8.7)
shows that the distribution *m*(*g*)
tends to get concentrated at higher and higher values of *g*
as more and more examples are learned. Thus, during learning,
although the allowed volume of weight (or function) space shrinks,
the remaining regions tend to have large generalization ability.

Another useful measure of generalization is the
average generalization ability *G*(*m*) given by

(4.8.8)

which is the ratio between the *m *+ 1
and the *m*th moments of 0(*g*)
and can be computed if 0(*g*)
is given or estimated. *G*(*m*) gives the entire "learning
curve"; i.e., it gives the average expected success rate
as a function of *m*. Equation (4.8.8) allows us to predict
the number of examples, *m*, necessary to train the network
to a desired average generalization performance. We may also
define the average prediction error as 1 - *G*(*m*).
The asymptotic behavior (*m* ) of the average
prediction error is determined by the form of the initial distribution
0(*g*) near *g* = 1.
If a finite gap between *g* = 1 and the next
highest *g* for which 0(*g*)
is nonzero exists, then the prediction error decays to 1 exponentially
as . If, on the other hand, there is no such gap in 0(*g*),
then the prediction error decays as . These two behaviors of
the learning curve have also been verified through numerical experiments.
The nature of the gap in the distribution of generalizations
near the region of perfect generalization (*g* = 1)
is not completely understood. These gaps have been detected in
experiments involving the learning of binary mappings (Cohn and
Tesauro, 1991 and 1992). It is speculated that such a gap could
be due to the dynamic effects of the learning process where the
learning algorithm may, for some reason, avoid the observed near-perfect
solutions. Another possibility is that the gap is inherent in
the nature of the binary mappings themselves.

The above approach, though theoretically interesting,
is of little practical use for estimating *m*, since it requires
the knowledge of the distribution 0(*g*),
whose estimation is computationally expensive. It also gives
results that are only valid in an average sense, and does not
necessarily represent the typical situation encountered in a specific
training scheme.

Next, we summarize a result that tells us about
the generalization ability of a deterministic feedforward neural
network in the "worst" case. Here, also, the case of
learning a binary-valued output function *f *:R*n* {0, 1}
is treated. Consider a set of *m* labeled training example
pairs (**x**, *y*) selected randomly from some arbitrary
probability distribution *p*(**x**, *y*), with **x**
R*n* and *y = f*(**x**)
{0, 1}. Also, consider a single hidden layer feedforward neural
net with *k* LTG's and *d* weights that has been trained
on the *m* examples so that at least a fraction 1 - , where
, of the examples are correctly classified. Then, with a probability
approaching 1, this network will correctly classify the fraction
1 - of future random test examples drawn from *p*(**x**,
*y*), as long as (Baum and Haussler, 1989)

(4.8.9)

Ignoring the log term, we may write Equation (4.8.9),
to a first order approximation, as

(4.8.10)

which requires *m* *d* for good generalization.
It is interesting to note that this is the same condition for
"good" generalization (low ambiguity) for a single LTG
derived by Cover (1965) (refer to Section 1.5) and obtained empirically
by Widrow (1987). Therefore, one may note that in the limit of
large *m* the architecture of the network is not important
in determining the worst case generalization behavior; what matters
is the ratio of the number of degrees of freedom (weights) to
the training set size. On the other hand, none of the above theories
may hold for the case of a small training set. In this later
case, the size and architecture of the network and the learning
scheme all play a role in determining generalization quality (see
the next chapter for more details). It should also be noted that
the architecture of the net can play an important role in determining
the speed of convergence of a given class of learning methods,
as discussed later in Section 4.9.

Similar results for worst case generalization are
reported in Blumer et al. (1989). A more general learning curve
based on statistical physics and VC dimension theories (Vapnik
and Chervonenkis, 1971) which applies to a general class of networks
can be found in Haussler et al. (1992). For generalization results
with noisy target signals the reader is referred to Amari et al.
(1992).

**4.8.2 Generalization in Stochastic Networks **

This section deals with the asymptotic learning
behavior of a general stochastic learning dichotomy machine (classifier).
We desire a relation between the generalization error and the
training error in terms of the number of free parameters of the
machine (machine complexity) and the size of the training set.
The results in this section are based on the work of Amari and
Murata (1993). Consider a parametric family of stochastic machines
where a machine is specified by a *d*-dimensional parameter
vector **w ** such that the probability of output *y*,
given an input **x**, is specified by *P*(*y* | **x**,
**w**). As an example, one may assume the machine to be a
stochastic multilayer neural network parameterized by a weight
vector **w** R*d*,
which, for a given input **x** R*n*,
emits a binary output with probability

(4.8.11)

and

where

(4.8.12)

and *g*(**x**, **w**) may be considered
as a smooth deterministic function (e.g., superposition of multivariate
sigmoid functions typically employed in layered neural nets).
Thus, in this example, the stochastic nature of the machine is
determined by its stochastic output unit.

Assume that there exists a true machine that can
be faithfully represented by one of the above family of stochastic
machines with parameter **w**0.
The true machine receives inputs **x***k*,
*k* = 1, 2, ..., *m*, which are randomly generated according
to a fixed but unknown probability distribution *p*(**x**),
and emits *y**k*.
The maximum likelihood estimator (refer to Section 3.1.5 for
definition) characterized by the machine will be our first candidate
machine. This machine predicts *y* for a given **x**
with probability *P*(*y* | **x**, ). An entropic
loss function is used to evaluate the generalization of a trained
machine for a new example (**x***m*+1, *y**m*+1).

Let *gen*
be the average predictive entropy (also known as the average entropic
loss) of a trained machine parameterized by for a new example
(**x***m*+1, *y**m*+1):

(4.8.13)

Similarly, we define *train*
as the average entropic loss over the training examples used to
obtain

(4.8.14)

Finally, let *H*0
be the average entropic error of the true machine

(4.8.15)

Amari and Murata proved the following theorem for
training and generalization error.

**Theorem 4.8.1 (Amari and Murata, 1993)**:
The asymptotic learning curve for the entropic training error
is given by

(4.8.16)

*and for the entropic generalization error by
*

(4.8.17)

The proof of Theorem 4.8.1 uses standard techniques
of asymptotic statistics and is omitted here. (The reader is
referred to the original paper by Amari and Murata (1993) for
such proof).

In general, *H*0
is unknown and it can be eliminated from Equation (4.8.17) by
substituting its value from Equation (4.8.16). This gives

(4.8.18)

which shows that for a faithful stochastic machine
and in the limit of *m* *d*, the generalization error
approaches that of the trained machine on *m* examples, which
from Equation (4.8.16) is the classification error *H*0
of the true machine.

Again, the particular network architecture is of
no importance here as long as it allows for a faithful realization
of the true machine and *m* >> *d*.
It is interesting to note that this result is similar to the
worst case learning curve for deterministic machines [Equation
(4.8.10)], when the training error is zero. The result is also
in agreement with Cover's result on classifier ambiguity in Equation
(1.5.3), where the term in Equation (4.8.18) may be viewed as
the probability of ambiguous response on the *m *+ 1
input. In fact, Amari (1993) also proved that the average predictive
entropy *gen*(*m*)
for a general deterministic dichotomy machine (e.g., feedforward
neural net classifier) converges to 0 as in the limit *m* .

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [4.9] [4.10]