**2.3 Approximation Capabilities of Feedforward Neural Networks for
Continuous Functions **

This section summarizes some fundamental results, in the
form of theorems, on continuous function approximation capabilities of
feedforward nets. The main result is that a two layer feedforward net with
a sufficient number of hidden units, of the sigmoidal activation type,
and a single linear output unit is capable of approximating any continuous
function *f *: R*n*
R to any desired accuracy. Before formally stating the above result, we
consider some early observations on the implications of a classical theorem
on function approximation, Kolmogorov's theorem, which motivates the use
of layered feedforward nets as function approximators.

**2.3.1 Kolmogorov's Theorem**

It has been suggested (Hecht-Nielsen, 1987 and 1990; Lippmann, 1987; Spreecher, 1993) that Kolmogorov's theorem concerning the realization of arbitrary multivariate functions, provides theoretical support for neural networks that implement such functions.

* Theorem 2.3.1 (Kolmogorov, 1957): Any continuous
real-valued functions f* (

*f*(*x*1*, x*2*,
..., xn*)* = *(2.3.1)

*where the gj's
are properly chosen continuous functions of one variable, and the ij's
are continuous monotonically increasing functions independent of f. *

The basic idea in Kolmogorov's theorem is captured in
the network architecture of Figure 2.3.1, where a universal transformation
*M* maps R*n* into
several uni-dimensional transformations. The theorem states that one can
express a continuous multivariate function on a compact set in terms of
sums and compositions of a finite number of single variable functions.

Figure 2.3.1. Network representation of Kolmogorov's
theorem.

Others, such as Girosi and Poggio (1989), have criticized
this interpretation of Kolmogorov's theorem as irrelevant to neural networks
by pointing out that the f*ij
*functions are highly non-smooth and the functions
*g**j* are not parameterized.
On the other hand, Krková
(1992) supported the relevance of this theorem to neural nets by arguing
that non-smooth functions can be approximated as sums of infinite series
of smooth functions, thus one should be able to approximately implement
f*ij *and
*g**j* with parameterized
networks. More recently, Lin and Unbehauen (1993) argued that an "approximate"
implementation of *g**i*
does not, in general, deliver an approximate implementation of the original
function *f*(**x**). As this debate continues, the importance of
Kolmogorov's theorem might not be in its direct application to proving
the universality of neural nets as function approximators, in as much as
it points to the feasibility of using parallel and layered network structures
for multivariate function mappings.

**2.3.2 Single Hidden Layer Neural Networks are Universal
Approximators**

Rigorous mathematical proofs for the universality of feedforward
layered neural nets employing continuous sigmoid type, as well as other
more general, activation units were given, independently, by Cybenko (1989),
Hornik et al. (1989), and Funahashi (1989). Cybenko's proof is distinguished
by being mathematically concise and elegant [it is based on the Hahn-Banach
theorem (Luenberger, 1969)]. The following is the statement of Cybenko's
theorem (the reader is referred to the original paper by Cybenko (1989)
for the proof).

* Theorem 2.3.2 (Cybenko, 1989): Let *j

|* G*(**x, w, , )*** -
f*(**x**) |* *e* *for all*
***x*** *

*where*

*G*(**x, w, , )*** = *(2.3.2)

*and ***w***j*
R*n, j ,
j* R*, ***w** = (**w**1, **w**2, ..., **w***N*)*,
* = (1, 2, ..., *N*)*,
and * = (1, 2, ..., *N*).

Hornik et al. (1989) [employing the Stone-Weierstrass
Theorem (Rudin, 1964)] and Funahashi (1989) [using an integral formula
presented by Irie and Miyake (1988)] independently proved similar theorems
stating that a one hidden layer feedforward neural network is capable of
approximating uniformly any continuous multivariate function, to any desired
degree of accuracy. This implies that any failure of a function mapping
by a multilayer network must arise from inadequate choice of parameters
(i.e., poor choices for **w**1, **w**2, ..., **w**** N**,

Using a theorem by Sun and Cheney (1992), Light (1992a)
extended Cybenko's results to any continuous function *f* on R*n*
and showed that signed integer weights and thresholds are sufficient for
accurate approximation. In another version of the theorem, Light shows
that the sigmoid can be replaced by any continuous function J
on R satisfying:

(2.3.3)

where *n* is odd and *n* 3. A similar result
can be established for *n* even. Examples of activation functions
satisfying the above conditions are given by the family of functions

(2.3.4)

Note that the cosine term is the Chebyshev polynomial
of degree *n*. Figure 2.3.2 shows two plots of this activation function
for *n *= 3 and *n *= 7, respectively.

(a) (b)

Figure 2.3.2. Activation function *n*()
for the case (a) *n* = 3, and (b) *n* = 7.

The universality of single hidden layer nets with units
having non-sigmoid activation functions was formally proved by Stinchcombe
and White (1989). Baldi (1991) showed that a wide class of continuous multivariate
functions can be approximated by a weighted sum of bell-shaped functions
(multivariate Bernstein polynomials); i.e., a single hidden layer net with
bell-shaped activations for its hidden units and a single linear output
unit is a possible approximator of functions *f *: R*n*
R. Hornik (1991) proved that a sufficient condition for universal approximation
can be obtained by using continuous, bounded, and nonconstant hidden unit
activation functions. Recently, Leshno et al. (1993, see also Hornik, 1993)
have extended these results by showing that the above neural network with
locally bounded piecewise continuous activation functions for hidden units
is a universal approximator if and only if the network's activation function
is not a polynomial. Ito (1991) showed that any function belonging to the
class of rapidly decreasing continuous functions in R*n*
(i.e., functions *f*(**x**) satisfying for
any *k**j* 0)
can be approximated arbitrarily well by a two layer architecture with a
finite number of LTGs in the hidden layer. Here, the requirement of rapid
decrease in *f* is not necessary and can be weakened.

**2.3.3 Single Hidden Layer Neural Networks are Universal
Classifiers**

Theorem 2.3.2 may also be extended to classifier-type
mappings (Cybenko, 1989) of the form

*f*(**x**) = *j *iff **x** *Pj*
(2.3.5)

where *f *:A*n*
{1, 2, ..., *k*}. A*n*
is a compact (closed and bounded) subset of R*n*,
and *P*1, *P*2,
..., *P**k* partition
A*n* into *k* disjoint
measurable subsets, i.e., and
is emoty for *i* *j*.

Thus, a single hidden layer net with sigmoidal activation
units and a single linear output unit is a universal classifier. This result
was confirmed empirically by Huang and Lippmann (1988) on several examples
including the ones shown in Figure 2.3.3 for *n *= 2 and *k *=
2 (i.e., two-class problems in a two-dimensional space).

Figure 2.3.3. Ten complex decision regions formed by a
neural net classifier with a single hidden layer. (Adapted from W. Y. Huang
and R. P. Lippmann, 1988, with permission of the American Institute of
Physics.)

We conclude this section by noting that in Equation (2.3.5),
the class label is explicitly generated as the output of a linear unit.
This representation of class labels via quantized levels of a single linear
output unit, although useful for theoretical considerations, is not practical;
it imposes unnecessary constraints on the hidden layer, which in turn,
leads to a large number of hidden units for the realization of complex
mappings. In practical implementations, a local encoding of classes is
more commonly used (Rumelhart et al., 1986a). This relaxes the constraints
on the hidden layer mapping by adding several output units, each of which
is responsible for representing a unique class. Here, LTG's or sigmoid
type units may be used as output units.