Learning in artificial neural networks is hard.
More precisely, loading of an arbitrary mapping onto a "faithful"
neural network architecture requires exponential time irrespective
of the learning algorithm used (batch or adaptive). Judd (1987,
1990) showed that the learning problem in neural networks is NP-complete,
even for approximate learning; i.e., in the worst case, we will
not be able to do much better than just randomly exhausting all
combinations of weight settings to see if one happens to work.
Therefore, as the problem size increases (that is, as the input
pattern dimension or the number of input patterns increases) the
training time scales up exponentially in the size of the problem.
Moreover, it has been shown (Blum and Rivest, 1989) that training
a simple *n*-input, three-unit two layer net of LTG's can
be NP-complete in the worst-case when learning a given set of
examples. Consider the class of functions defined on a collection
of *m* arbitrary points in R*n*.
It has been shown that the problem of whether there exist two
hyperplanes that separate them is NP-complete (Megiddo, 1986);
i.e., the training of a net with two-hidden *n*-input LTG's
and a single output LTG on examples of such functions is exponential
in time, in the worst case, even if a solution exists. Blum and
Rivest (1992) extend this result to the case of Boolean functions.
They also showed that learning Boolean functions with a two-layer
feedforward network of *k*-hidden units (*k* bounded
by some polynomial in *n*) and one output unit (which computes
the AND function) is NP-complete.

However, these theoretical results do not rule out
the possibility of finding a polynomial-time algorithm for the
training of certain classes of problems onto certain carefully
selected architectures. Blum and Rivest (1992) gave an example
of two networks trained on the same task, such that training the
first is NP-complete but the second can be trained in polynomial
time. Also, the class of linearly separable mappings can be trained
in polynomial time if single layer LTG nets are employed (only
a single unit is needed if the mapping has a single output).
This is easy to prove since one can use linear programming (Karmarkar,
1984) to compute the weights and thresholds of such nets in polynomial
time. One can also use this fact and construct layered networks
that have polynomial learning time complexity for certain classes
of non-linearly separable mappings. This is illustrated next.

Consider a set *F* of non-linearly separable
functions or which has the following two properties: (1) There
exists at least one layered neural net architecture for which
loading *m* training pairs {**x**, *y**d*}
of *f*(**x**) is NP-complete; (2) there exists a fixed
dimensionality expansion process D that maps points **x** in
R*n* to points **z**
in R*d*, such that
*d* is bounded by some polynomial in *n* [e.g., ], and
that the *m* training examples {**z**, *y**d*}
representing *f *(**x**) in the expanded space R*d*
are linearly separable. This set *F* is not empty; Blum
and Rivest (1992) gave examples of functions in *F*. Figure
4.9.1 depicts a layered architecture which can realize any function
in *F*. Here, a fixed preprocessing layer, labeled D in
Figure 4.9.1, implements the above dimensionality expansion process.
The output node is a *d*-input LTG. It can be easily shown
that the learning complexity of this network for functions in
*F* is polynomial. This can be seen by noting that the training
of the trainable part of this network (the output LTG) has polynomial
complexity for *m* linearly separable examples in R*d*
and that as *n* increases, *d* remains polynomial in
*n*.

Figure 4.9.1. A layered architecture consisting of
a fixed preprocessing layer *D* followed by an adaptive LTG.

The efficiency of learning linearly separable classification
tasks in a single threshold gate should not be surprising. We
may recall from Chapter One that the average amount of necessary
and sufficient information for the characterization of the set
of separating surfaces for a random, separable dichotomy of *m*
points grows slowly with *m* and asymptotically approaches
2*d* (twice the number of degrees of freedom of the class
of separating surfaces). This implies that for a random set of
linear inequalities in *d* unknowns, the expected number
of extreme inequalities which are necessary and sufficient to
cover the whole set, tends to 2*d* as the number of consistent
inequalities tends to infinity, thus bounding the (expected) necessary
number of training examples for learning algorithms in separable
problems. Moreover, this limit of 2*d* consistent inequalities
is within the learning capacity of a single *d*-input LTG.

Another intuitive reason that the network in Figure
4.9.1 is easier to train than a fully adaptive two layer feedforward
net is that we are giving it predefined nonlinearities. The former
net does not have to start from scratch, but instead is given
more powerful building blocks to work with. However, there is
a trade off. By using the network in Figure 4.9.1, we gain in
a worst-case computational sense, but lose in that the number
of weights increases from *n* to O(*n*2)
or higher. This increase in the number of weights implies that
the number of training examples must increase so that the network
can meaningfully generalize on new examples (recall the results
of the previous section).

The problem of NP-complete learning in multilayer
neural networks may be attributed to the use of fixed network
resources (Baum, 1989. Learning an arbitrary mapping can be
achieved in polynomial time for a network that allocates new computational
units as more patterns are learned. Mukhopadhyay et al. (1993)
gave a polynomial-time training algorithm for the general class
of classification problems (defined by mappings of the form )
based on clustering and linear programming models. This algorithm
simultaneously designs and trains an appropriate network for a
given classification task. The basic idea of this method is to
cover class regions with a minimal number of dynamically allocated
hyperquadratic volumes (e.g., hyperspheres) of varying size.
The resulting network has a layered structure consisting of a
simple fixed preprocessing layer, a hidden layer of LTG's, and
an output layer of logical OR gates. This and other efficiently
trainable nets are considered in detail in Section 6.3.

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [4.8] [4.10]