1.5 Ambiguity and Generalization

Generalization is the ability of a network (here, a PTG) to correctly classify new patterns. Consider the problem of generalizing from a training set with respect to a given admissible family of decision surfaces (e.g., the family of surfaces that can be implemented by a PTG). During the training phase, a decision (separating) surface from the admissible class is synthesized which correctly assigns members of the training set to the desired categories. The new pattern will be assigned to the category lying on the same side of the decision surface. Clearly, for some dichotomies of the training set, the assignment of category will not be unique. However, it is generally known that if a large number of training patterns are used, the decision surface will be sufficiently constrained to yield a unique response to a new pattern. Next, we show that the number of training patterns must exceed the statistical capacity of the PTG before unique response becomes probable.

Figure 1.5.1. Ambiguous generalization. The points y1 and y3 are uniquely classified regardless of which of the two linear decision surfaces shown is used. The point y2 has an ambiguous classification.

Theorem 1.5.1 (Cover, 1965): Let X  {y} = {x1, x2, ..., xm, y} be in -general position in Rd1, then y is ambiguous with respect to C(m, d  2) dichotomies of X relative to the class of all -surfaces.

Proof of Theorem 1.5.1:

The point y is ambiguous with respect to a dichotomy {X+X} of X = {x1x2, ..., xm} if and only if there exists a -surface containing y which separates {X+X}. This is because, when X is in -general position, any -surface through y that realizes the dichotomy {X+X} can be shifted infinitesimally to allow arbitrary classification of y without affecting the separation of {X+X}. Equivalently, since -general position of X{y} implies that the set of points Z{(y)} = {(x1), (x2), ..., (xm), (y)} is in general position in Rd1, the point (y) is ambiguous with respect to the linear dichotomy {Z+Z} (here, each linear dichotomy of Z, {Z+Z}, corresponds to a unique -separable dichotomy {X+X} in the input space). Hence, the point (y) is ambiguous with respect to D linear dichotomies {Z+Z} which can be separated by a (d  2)-dimensional hyperplane constrained to pass through (y). Constraining the hyperplane to pass through a point effectively reduces the dimension of the space by 1. Thus, by Theorem 1.3.2, D = C(md  2). Now, by noting the one-to-one correspondence between the linear separable dichotomies in -space and the -separable dichotomies in the input space we establish that y is ambiguous with respect to C(md  2) -separable dichotomies of X.

Now, if each of the -separable dichotomies of X has equal probability, then the probability, A(m, d), that y is ambiguous with respect to a random -separable dichotomy of X is

Example 1.5.1: Let us apply the above theorem to Figure 1.5.1, where m = 10 and d = 3 (including the bias input). A new point y is ambiguous with respect to C(10, 1) = 20 dichotomies of the m points relative to the class of all lines in the plane. Now, C(10, 2) = 92 dichotomies of the m points are separable by the class of all lines in the plane. Thus, a new point y is ambiguous with respect to a random, linearly separable dichotomy of the m points with probability

The behavior of A(m, d) for large d is given by (Cover, 1965)

(1.5.2)

where = m/d. The function A*() is plotted in Figure 1.5.2. It is interesting to note that according to Equation (1.5.2), ambiguous response is reduced only when ; i.e., when m is larger than the statistical capacity of the PTG (or more generally, the separating surface used).

Figure 1.5.2. Asymptotic probability of ambiguous response.

To eliminate ambiguity, we need 2 or m 2d; that is, a large number of training samples is required. If we define 0 e 1 as the probability of generalization error, then we can determine the number of samples required to achieve a desired e. Assume that we choose m points (patterns) from a given distribution such that these points are in general position and that they are classified independently with equal probability into one of two categories. Then, the probability of generalization error on a new pattern similarly chosen, conditioned on the separability of the entire set, is equal to A(m, d). Therefore, as d, one can employ Equation (1.5.2) and show that m must satisfy

(1.5.3)

in order to bound the probability of generalization error below e. Thus, we find that the probability of generalization error for a single threshold gate with d degrees of freedom approaches zero as fast as in the limit of m >> d.

Goto [1.0][1.1] [1.2] [1.3] [1.4] [1.6] [1.7]