2.3 Approximation Capabilities of Feedforward Neural Networks for Continuous Functions
This section summarizes some fundamental results, in the
form of theorems, on continuous function approximation capabilities of
feedforward nets. The main result is that a two layer feedforward net with
a sufficient number of hidden units, of the sigmoidal activation type,
and a single linear output unit is capable of approximating any continuous
function f : Rn
R to any desired accuracy. Before formally stating the above result, we
consider some early observations on the implications of a classical theorem
on function approximation, Kolmogorov's theorem, which motivates the use
of layered feedforward nets as function approximators.
2.3.1 Kolmogorov's Theorem
It has been suggested (Hecht-Nielsen, 1987 and 1990; Lippmann, 1987; Spreecher, 1993) that Kolmogorov's theorem concerning the realization of arbitrary multivariate functions, provides theoretical support for neural networks that implement such functions.
Theorem 2.3.1 (Kolmogorov, 1957): Any continuous
real-valued functions f (x1,
x2, ..., xn)
defined on [0, 1]n, ,
can be represented in the form
f(x1, x2,
..., xn) = (2.3.1)
where the gj's
are properly chosen continuous functions of one variable, and the ij's
are continuous monotonically increasing functions independent of f.
The basic idea in Kolmogorov's theorem is captured in
the network architecture of Figure 2.3.1, where a universal transformation
M maps Rn into
several uni-dimensional transformations. The theorem states that one can
express a continuous multivariate function on a compact set in terms of
sums and compositions of a finite number of single variable functions.
Figure 2.3.1. Network representation of Kolmogorov's
theorem.
Others, such as Girosi and Poggio (1989), have criticized
this interpretation of Kolmogorov's theorem as irrelevant to neural networks
by pointing out that the fij
functions are highly non-smooth and the functions
gj are not parameterized.
On the other hand, Krková
(1992) supported the relevance of this theorem to neural nets by arguing
that non-smooth functions can be approximated as sums of infinite series
of smooth functions, thus one should be able to approximately implement
fij and
gj with parameterized
networks. More recently, Lin and Unbehauen (1993) argued that an "approximate"
implementation of gi
does not, in general, deliver an approximate implementation of the original
function f(x). As this debate continues, the importance of
Kolmogorov's theorem might not be in its direct application to proving
the universality of neural nets as function approximators, in as much as
it points to the feasibility of using parallel and layered network structures
for multivariate function mappings.
2.3.2 Single Hidden Layer Neural Networks are Universal Approximators
Rigorous mathematical proofs for the universality of feedforward
layered neural nets employing continuous sigmoid type, as well as other
more general, activation units were given, independently, by Cybenko (1989),
Hornik et al. (1989), and Funahashi (1989). Cybenko's proof is distinguished
by being mathematically concise and elegant [it is based on the Hahn-Banach
theorem (Luenberger, 1969)]. The following is the statement of Cybenko's
theorem (the reader is referred to the original paper by Cybenko (1989)
for the proof).
Theorem 2.3.2 (Cybenko, 1989): Let j
be any continuous sigmoid type function (e.g., ).
Then, given any continuous real-valued function f on [0, 1]n
(or any other compact subset of Rn)
and e 0, there exists
vectors w1, w2, ..., wN,
, and and a parameterized function G(, w, , ): [0, 1]n R
such that
| G(x, w, , ) -
f(x) | e for all
x
where
G(x, w, , ) = (2.3.2)
and wj
Rn, j ,
j R, w = (w1, w2, ..., wN),
= (1, 2, ..., N),
and
= (1, 2, ..., N).
Hornik et al. (1989) [employing the Stone-Weierstrass
Theorem (Rudin, 1964)] and Funahashi (1989) [using an integral formula
presented by Irie and Miyake (1988)] independently proved similar theorems
stating that a one hidden layer feedforward neural network is capable of
approximating uniformly any continuous multivariate function, to any desired
degree of accuracy. This implies that any failure of a function mapping
by a multilayer network must arise from inadequate choice of parameters
(i.e., poor choices for w1, w2, ..., wN,
, and ) or an insufficient number of hidden nodes. Hornik et al. (1990)
proved another important result relating to the approximation capability
of multilayer feedforward neural nets employing sigmoidal hidden unit activations.
They showed that these networks can approximate not only an unknown function,
but also approximate its derivative. In fact, Hornik et al. (1990) also
showed that these networks can approximate functions that are not differentiable
in the classical sense, but possess a generalized derivative as in the
case of piecewise differentiable functions.
Using a theorem by Sun and Cheney (1992), Light (1992a) extended Cybenko's results to any continuous function f on Rn and showed that signed integer weights and thresholds are sufficient for accurate approximation. In another version of the theorem, Light shows that the sigmoid can be replaced by any continuous function J on R satisfying:
(2.3.3)
where n is odd and n 3. A similar result
can be established for n even. Examples of activation functions
satisfying the above conditions are given by the family of functions
(2.3.4)
Note that the cosine term is the Chebyshev polynomial
of degree n. Figure 2.3.2 shows two plots of this activation function
for n = 3 and n = 7, respectively.
(a) (b)
Figure 2.3.2. Activation function n() for the case (a) n = 3, and (b) n = 7.
The universality of single hidden layer nets with units
having non-sigmoid activation functions was formally proved by Stinchcombe
and White (1989). Baldi (1991) showed that a wide class of continuous multivariate
functions can be approximated by a weighted sum of bell-shaped functions
(multivariate Bernstein polynomials); i.e., a single hidden layer net with
bell-shaped activations for its hidden units and a single linear output
unit is a possible approximator of functions f : Rn
R. Hornik (1991) proved that a sufficient condition for universal approximation
can be obtained by using continuous, bounded, and nonconstant hidden unit
activation functions. Recently, Leshno et al. (1993, see also Hornik, 1993)
have extended these results by showing that the above neural network with
locally bounded piecewise continuous activation functions for hidden units
is a universal approximator if and only if the network's activation function
is not a polynomial. Ito (1991) showed that any function belonging to the
class of rapidly decreasing continuous functions in Rn
(i.e., functions f(x) satisfying for
any kj 0)
can be approximated arbitrarily well by a two layer architecture with a
finite number of LTGs in the hidden layer. Here, the requirement of rapid
decrease in f is not necessary and can be weakened.
2.3.3 Single Hidden Layer Neural Networks are Universal Classifiers
Theorem 2.3.2 may also be extended to classifier-type
mappings (Cybenko, 1989) of the form
f(x) = j iff x Pj
(2.3.5)
where f :An
{1, 2, ..., k}. An
is a compact (closed and bounded) subset of Rn,
and P1, P2,
..., Pk partition
An into k disjoint
measurable subsets, i.e., and
is emoty for i j.
Thus, a single hidden layer net with sigmoidal activation
units and a single linear output unit is a universal classifier. This result
was confirmed empirically by Huang and Lippmann (1988) on several examples
including the ones shown in Figure 2.3.3 for n = 2 and k =
2 (i.e., two-class problems in a two-dimensional space).
Figure 2.3.3. Ten complex decision regions formed by a
neural net classifier with a single hidden layer. (Adapted from W. Y. Huang
and R. P. Lippmann, 1988, with permission of the American Institute of
Physics.)
We conclude this section by noting that in Equation (2.3.5),
the class label is explicitly generated as the output of a linear unit.
This representation of class labels via quantized levels of a single linear
output unit, although useful for theoretical considerations, is not practical;
it imposes unnecessary constraints on the hidden layer, which in turn,
leads to a large number of hidden units for the realization of complex
mappings. In practical implementations, a local encoding of classes is
more commonly used (Rumelhart et al., 1986a). This relaxes the constraints
on the hidden layer mapping by adding several output units, each of which
is responsible for representing a unique class. Here, LTG's or sigmoid
type units may be used as output units.