2.3 Approximation Capabilities of Feedforward Neural Networks for Continuous Functions

This section summarizes some fundamental results, in the form of theorems, on continuous function approximation capabilities of feedforward nets. The main result is that a two layer feedforward net with a sufficient number of hidden units, of the sigmoidal activation type, and a single linear output unit is capable of approximating any continuous function : Rn R to any desired accuracy. Before formally stating the above result, we consider some early observations on the implications of a classical theorem on function approximation, Kolmogorov's theorem, which motivates the use of layered feedforward nets as function approximators.

2.3.1 Kolmogorov's Theorem

It has been suggested (Hecht-Nielsen, 1987 and 1990; Lippmann, 1987; Spreecher, 1993) that Kolmogorov's theorem concerning the realization of arbitrary multivariate functions, provides theoretical support for neural networks that implement such functions.

Theorem 2.3.1 (Kolmogorov, 1957): Any continuous real-valued functions f (x1, x2, ..., xn) defined on [0, 1]n, , can be represented in the form

f(x1, x2, ..., xn) = (2.3.1)

where the gj's are properly chosen continuous functions of one variable, and the ij's are continuous monotonically increasing functions independent of f.

The basic idea in Kolmogorov's theorem is captured in the network architecture of Figure 2.3.1, where a universal transformation M maps Rn into several uni-dimensional transformations. The theorem states that one can express a continuous multivariate function on a compact set in terms of sums and compositions of a finite number of single variable functions.

Figure 2.3.1. Network representation of Kolmogorov's theorem.

Others, such as Girosi and Poggio (1989), have criticized this interpretation of Kolmogorov's theorem as irrelevant to neural networks by pointing out that the fij functions are highly non-smooth and the functions gj are not parameterized. On the other hand, Krková (1992) supported the relevance of this theorem to neural nets by arguing that non-smooth functions can be approximated as sums of infinite series of smooth functions, thus one should be able to approximately implement fij and gj with parameterized networks. More recently, Lin and Unbehauen (1993) argued that an "approximate" implementation of gi does not, in general, deliver an approximate implementation of the original function f(x). As this debate continues, the importance of Kolmogorov's theorem might not be in its direct application to proving the universality of neural nets as function approximators, in as much as it points to the feasibility of using parallel and layered network structures for multivariate function mappings.

2.3.2 Single Hidden Layer Neural Networks are Universal Approximators

Rigorous mathematical proofs for the universality of feedforward layered neural nets employing continuous sigmoid type, as well as other more general, activation units were given, independently, by Cybenko (1989), Hornik et al. (1989), and Funahashi (1989). Cybenko's proof is distinguished by being mathematically concise and elegant [it is based on the Hahn-Banach theorem (Luenberger, 1969)]. The following is the statement of Cybenko's theorem (the reader is referred to the original paper by Cybenko (1989) for the proof).

Theorem 2.3.2 (Cybenko, 1989): Let j be any continuous sigmoid type function (e.g., ). Then, given any continuous real-valued function f on [0, 1]n (or any other compact subset of Rn) and e 0, there exists vectors w1w2, ..., wN, , and and a parameterized function G(, w, , ): [0, 1]n  R such that

| G(x, w, , ) - f(x) | e for all x


G(x, w, , ) = (2.3.2)

and wj Rn, , j R, w = (w1w2, ..., wN),  = (12, ..., N), and  = (12, ..., N).

Hornik et al. (1989) [employing the Stone-Weierstrass Theorem (Rudin, 1964)] and Funahashi (1989) [using an integral formula presented by Irie and Miyake (1988)] independently proved similar theorems stating that a one hidden layer feedforward neural network is capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy. This implies that any failure of a function mapping by a multilayer network must arise from inadequate choice of parameters (i.e., poor choices for w1w2, ..., wN, , and ) or an insufficient number of hidden nodes. Hornik et al. (1990) proved another important result relating to the approximation capability of multilayer feedforward neural nets employing sigmoidal hidden unit activations. They showed that these networks can approximate not only an unknown function, but also approximate its derivative. In fact, Hornik et al. (1990) also showed that these networks can approximate functions that are not differentiable in the classical sense, but possess a generalized derivative as in the case of piecewise differentiable functions.

Using a theorem by Sun and Cheney (1992), Light (1992a) extended Cybenko's results to any continuous function f on Rn and showed that signed integer weights and thresholds are sufficient for accurate approximation. In another version of the theorem, Light shows that the sigmoid can be replaced by any continuous function J on R satisfying:


where n is odd and n 3. A similar result can be established for n even. Examples of activation functions satisfying the above conditions are given by the family of functions


Note that the cosine term is the Chebyshev polynomial of degree n. Figure 2.3.2 shows two plots of this activation function for n = 3 and n = 7, respectively.

(a) (b)

Figure 2.3.2. Activation function n() for the case (a) n = 3, and (b) n = 7.

The universality of single hidden layer nets with units having non-sigmoid activation functions was formally proved by Stinchcombe and White (1989). Baldi (1991) showed that a wide class of continuous multivariate functions can be approximated by a weighted sum of bell-shaped functions (multivariate Bernstein polynomials); i.e., a single hidden layer net with bell-shaped activations for its hidden units and a single linear output unit is a possible approximator of functions : Rn R. Hornik (1991) proved that a sufficient condition for universal approximation can be obtained by using continuous, bounded, and nonconstant hidden unit activation functions. Recently, Leshno et al. (1993, see also Hornik, 1993) have extended these results by showing that the above neural network with locally bounded piecewise continuous activation functions for hidden units is a universal approximator if and only if the network's activation function is not a polynomial. Ito (1991) showed that any function belonging to the class of rapidly decreasing continuous functions in Rn (i.e., functions f(x) satisfying for any kj  0) can be approximated arbitrarily well by a two layer architecture with a finite number of LTGs in the hidden layer. Here, the requirement of rapid decrease in f is not necessary and can be weakened.

2.3.3 Single Hidden Layer Neural Networks are Universal Classifiers

Theorem 2.3.2 may also be extended to classifier-type mappings (Cybenko, 1989) of the form

f(x) = j iff x Pj (2.3.5)

where :An {1, 2, ..., k}. An is a compact (closed and bounded) subset of Rn, and P1, P2, ..., Pk partition An into k disjoint measurable subsets, i.e., and is emoty for i  j.

Thus, a single hidden layer net with sigmoidal activation units and a single linear output unit is a universal classifier. This result was confirmed empirically by Huang and Lippmann (1988) on several examples including the ones shown in Figure 2.3.3 for n = 2 and k = 2 (i.e., two-class problems in a two-dimensional space).

Figure 2.3.3. Ten complex decision regions formed by a neural net classifier with a single hidden layer. (Adapted from W. Y. Huang and R. P. Lippmann, 1988, with permission of the American Institute of Physics.)

We conclude this section by noting that in Equation (2.3.5), the class label is explicitly generated as the output of a linear unit. This representation of class labels via quantized levels of a single linear output unit, although useful for theoretical considerations, is not practical; it imposes unnecessary constraints on the hidden layer, which in turn, leads to a large number of hidden units for the realization of complex mappings. In practical implementations, a local encoding of classes is more commonly used (Rumelhart et al., 1986a). This relaxes the constraints on the hidden layer mapping by adding several output units, each of which is responsible for representing a unique class. Here, LTG's or sigmoid type units may be used as output units.

Goto [2.0][2.1] [2.2] [2.4] [2.5]

Back to the Table of Contents

Back to Main Menu