Consider the training set {**x**1*,
***x**2*,
...,* **x***m*}
and a class of decision surfaces [e.g., -surfaces associated with
a PTG(*r*)] which separate this set. The classification
of a new pattern **y** is ambiguous, relative to the given
class of decision surfaces and with respect to the training set,
if there exists two decision surfaces that both correctly classify
the training set but yield different classifications of the new
pattern. In Figure 1.5.1, points **y**1
and **y**3 are unambiguous,
but point **y**2 is
ambiguous for the class of linear decision surfaces. In the context
of a single threshold gate, it can be shown (Cover, 1965) that
the number of training patterns must exceed the statistical capacity
of a threshold gate before ambiguity is eliminated.

Generalization is the ability of a network (here,
a PTG) to correctly classify new patterns. Consider the problem
of generalizing from a training set with respect to a given admissible
family of decision surfaces (e.g., the family of surfaces that
can be implemented by a PTG). During the training phase, a decision
(separating) surface from the admissible class is synthesized
which correctly assigns members of the training set to the desired
categories. The new pattern will be assigned to the category
lying on the same side of the decision surface. Clearly, for
some dichotomies of the training set, the assignment of category
will not be unique. However, it is generally known that if a
large number of training patterns are used, the decision surface
will be sufficiently constrained to yield a unique response to
a new pattern. Next, we show that the number of training patterns
must exceed the statistical capacity of the PTG before unique
response becomes probable.

Figure 1.5.1. Ambiguous generalization. The points
**y**1 and **y**3
are uniquely classified regardless of which of the two linear
decision surfaces shown is used. The point **y**2
has an ambiguous classification.

**Theorem 1.5.1 (Cover, 1965):**
Let X {**y**}
= {**x**1, **x**2,
..., **x***m*,
**y**}* be in -general position in *R*d1,
then ***y*** is
ambiguous with respect to *C*(m, d* 2*)
dichotomies of X relative to the class of all -surfaces.
*

Proof of Theorem 1.5.1:

The point **y** is ambiguous with respect to
a dichotomy {*X*+, *X}
of X* = {**x**1, **x**2, ..., **x***m*}
if and only if there exists a -surface containing **y** which
separates {*X*+, *X}.
This is because, when X* is in -general position, any -surface
through **y** that realizes the dichotomy {*X*+, *X}
can be shifted infinitesimally to allow arbitrary classification
of ***y** without affecting the separation of {*X*+, *X}.
Equivalently, since -general position of X{***y**} implies
that the set of points *Z{(***y**)} = {(**x**1), (**x**2), ..., (**x***m*), (**y**)}
is in general position in R*d1*,
the point (**y**) is ambiguous with respect to the linear dichotomy
{*Z*+, *Z}
(here, each linear dichotomy of Z*, {*Z*+, *Z},
corresponds to a unique -separable dichotomy {X*+, *X}
in the input space). Hence, the point (***y**) is ambiguous
with respect to *D* linear dichotomies {*Z*+, *Z}
which can be separated by a (d 2)-dimensional hyperplane
constrained to pass through (***y**). Constraining the
hyperplane to pass through a point effectively reduces the dimension
of the space by 1. Thus, by Theorem 1.3.2, *D* = C(*m*, *d 2).
Now, by noting the one-to-one correspondence between the linear
separable dichotomies in -space and the -separable dichotomies
in the input space we establish that ***y** is ambiguous
with respect to C(*m*, *d 2) -separable
dichotomies of X*.

Now, if each of the -separable dichotomies of *X*
has equal probability, then the probability, *A*(*m, d*),
that **y** is ambiguous with respect to a random -separable
dichotomy of *X* is

**Example 1.5.1: **Let us
apply the above theorem to Figure 1.5.1, where *m = *10 and
*d =* 3 (including the bias input). A new point **y**
is ambiguous with respect to C(10, 1)* =* 20 dichotomies
of the *m* points relative to the class of all lines in the
plane. Now, C(10, 2) = 92 dichotomies of the *m* points
are separable by the class of all lines in the plane. Thus, a
new point **y** is ambiguous with respect to a random, linearly
separable dichotomy of the *m* points with probability

The behavior of *A*(*m, d*) for large
*d* is given by (Cover, 1965)

(1.5.2)

where * = m/d*. The function *A**()
is plotted in Figure 1.5.2. It is interesting to note that according
to Equation (1.5.2), ambiguous response is reduced only when ;
i.e., when *m* is larger than the statistical capacity of
the PTG (or more generally, the separating surface used).

To eliminate ambiguity, we need * *2 or *m
*2*d*; that is, a large number of training samples is
required. If we define 0 * *e*
1 as the probability of generalization error, then we can determine
the number of samples required to achieve a desired *e.
Assume that we choose *m* points (patterns) from a given
distribution such that these points are in general position and
that they are classified independently with equal probability
into one of two categories. Then, the probability of generalization
error on a new pattern similarly chosen, conditioned on the separability
of the entire set, is equal to *A*(*m,
d*). Therefore, as *d, one can employ Equation (1.5.2) and
show that m* must satisfy

(1.5.3)

in order to bound the probability of generalization
error below e.
Thus, we find that the probability of generalization error for
a single threshold gate with *d* degrees of freedom approaches
zero as fast as in the limit of *m* >> *d*.