4.9 Complexity of Learning

This section deals with the computational complexity of learning: How much computation is required to learn (exactly or approximately to some "acceptable" degree) an arbitrary mapping in a multilayer neural network? In other words, is there an algorithm that is computationally "efficient" for training layered neural networks? Here, we will assume that the desired learning algorithm is a supervised one, which implies that the training set is labeled. Also, we will assume that the neural network has an arbitrary architecture but with no feedback connections.

Learning in artificial neural networks is hard. More precisely, loading of an arbitrary mapping onto a "faithful" neural network architecture requires exponential time irrespective of the learning algorithm used (batch or adaptive). Judd (1987, 1990) showed that the learning problem in neural networks is NP-complete, even for approximate learning; i.e., in the worst case, we will not be able to do much better than just randomly exhausting all combinations of weight settings to see if one happens to work. Therefore, as the problem size increases (that is, as the input pattern dimension or the number of input patterns increases) the training time scales up exponentially in the size of the problem. Moreover, it has been shown (Blum and Rivest, 1989) that training a simple n-input, three-unit two layer net of LTG's can be NP-complete in the worst-case when learning a given set of examples. Consider the class of functions defined on a collection of m arbitrary points in Rn. It has been shown that the problem of whether there exist two hyperplanes that separate them is NP-complete (Megiddo, 1986); i.e., the training of a net with two-hidden n-input LTG's and a single output LTG on examples of such functions is exponential in time, in the worst case, even if a solution exists. Blum and Rivest (1992) extend this result to the case of Boolean functions. They also showed that learning Boolean functions with a two-layer feedforward network of k-hidden units (k bounded by some polynomial in n) and one output unit (which computes the AND function) is NP-complete.

However, these theoretical results do not rule out the possibility of finding a polynomial-time algorithm for the training of certain classes of problems onto certain carefully selected architectures. Blum and Rivest (1992) gave an example of two networks trained on the same task, such that training the first is NP-complete but the second can be trained in polynomial time. Also, the class of linearly separable mappings can be trained in polynomial time if single layer LTG nets are employed (only a single unit is needed if the mapping has a single output). This is easy to prove since one can use linear programming (Karmarkar, 1984) to compute the weights and thresholds of such nets in polynomial time. One can also use this fact and construct layered networks that have polynomial learning time complexity for certain classes of non-linearly separable mappings. This is illustrated next.

Consider a set F of non-linearly separable functions or which has the following two properties: (1) There exists at least one layered neural net architecture for which loading m training pairs {x, yd} of f(x) is NP-complete; (2) there exists a fixed dimensionality expansion process D that maps points x in Rn to points z in Rd, such that d is bounded by some polynomial in n [e.g., ], and that the m training examples {zyd} representing (x) in the expanded space Rd are linearly separable. This set F is not empty; Blum and Rivest (1992) gave examples of functions in F. Figure 4.9.1 depicts a layered architecture which can realize any function in F. Here, a fixed preprocessing layer, labeled D in Figure 4.9.1, implements the above dimensionality expansion process. The output node is a d-input LTG. It can be easily shown that the learning complexity of this network for functions in F is polynomial. This can be seen by noting that the training of the trainable part of this network (the output LTG) has polynomial complexity for m linearly separable examples in Rd and that as n increases, d remains polynomial in n.


Figure 4.9.1. A layered architecture consisting of a fixed preprocessing layer D followed by an adaptive LTG.

The efficiency of learning linearly separable classification tasks in a single threshold gate should not be surprising. We may recall from Chapter One that the average amount of necessary and sufficient information for the characterization of the set of separating surfaces for a random, separable dichotomy of m points grows slowly with m and asymptotically approaches 2d (twice the number of degrees of freedom of the class of separating surfaces). This implies that for a random set of linear inequalities in d unknowns, the expected number of extreme inequalities which are necessary and sufficient to cover the whole set, tends to 2d as the number of consistent inequalities tends to infinity, thus bounding the (expected) necessary number of training examples for learning algorithms in separable problems. Moreover, this limit of 2d consistent inequalities is within the learning capacity of a single d-input LTG.

Another intuitive reason that the network in Figure 4.9.1 is easier to train than a fully adaptive two layer feedforward net is that we are giving it predefined nonlinearities. The former net does not have to start from scratch, but instead is given more powerful building blocks to work with. However, there is a trade off. By using the network in Figure 4.9.1, we gain in a worst-case computational sense, but lose in that the number of weights increases from n to O(n2) or higher. This increase in the number of weights implies that the number of training examples must increase so that the network can meaningfully generalize on new examples (recall the results of the previous section).

The problem of NP-complete learning in multilayer neural networks may be attributed to the use of fixed network resources (Baum, 1989. Learning an arbitrary mapping can be achieved in polynomial time for a network that allocates new computational units as more patterns are learned. Mukhopadhyay et al. (1993) gave a polynomial-time training algorithm for the general class of classification problems (defined by mappings of the form ) based on clustering and linear programming models. This algorithm simultaneously designs and trains an appropriate network for a given classification task. The basic idea of this method is to cover class regions with a minimal number of dynamically allocated hyperquadratic volumes (e.g., hyperspheres) of varying size. The resulting network has a layered structure consisting of a simple fixed preprocessing layer, a hidden layer of LTG's, and an output layer of logical OR gates. This and other efficiently trainable nets are considered in detail in Section 6.3.

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.5] [4.6] [4.7] [4.8] [4.10]

Back to the Table of Contents

Back to Main Menu