For any linear dichotomy {X+, X}
of a set of points, there exists a minimal sufficient subset of
extreme points such that any hyperplane correctly separating this
subset must separate the entire set correctly. From this definition
it follows that a point is an extreme point of the linear dichotomy
{X+, X} if
and only if it is ambiguous with respect to {X+, X}.
Thus, for a set of m points in general position in Rd,
each of the m points is ambiguous with respect to precisely
C(m 1, d 1) linear dichotomies of
the remaining m 1 points. Here, each of the
dichotomies is the restriction of two linearly separable dichotomies
of the original m points; the two dichotomies which differs
only in the classification of the remaining point. Therefore,
each of the m points is an extreme point with respect to
2C(m 1, d 1) dichotomies. Since
there are a total of C(m, d) linearly separable dichotomies,
the probability, Pe,
of a point to be an extreme point with respect to a randomly generated
dichotomy is (assuming equiprobable dichotomies)
(1.6.1)
Then, the expected number of extreme points in a
set of m points in general position in Rd,
is equal to the sum of the m probabilities that each point
is an extreme point. Since these probabilities are equal, the
expected number of extreme points can be written as
(1.6.2)
Figure 1.6.1 illustrates the effect of the size of
the training set m on the normalized expected number of
extreme points,
.
Figure 1.6.1. Normalized expected number of extreme
points for d = 5, 20 and as a function of
.
For large d, with = m/d, the normalized expected number of extreme points is given by (Cover, 1965)
(1.6.3)
which agrees with the simulation results in Figure
1.6.1 for d . The above limit on the expected
number of extreme points implies, roughly, that for a random separable
set of m patterns, the average number of patterns (information)
which need to be stored for a complete characterization of the
whole set is only twice the number of degrees of freedom of the
class of separating surfaces being used. The implication to pattern
recognition is that the essential information in an infinite training
set can be expected to be loaded into a network of finite capacity
(e.g., a PTG of finite order or a finite network of simple gates).