In unsupervised learning, there is no teacher signal.
We are given a training set {**x***i*;
*i* = 1, 2, ..., *m}*, of unlabeled vectors in R*n*.
The objective is to categorize or discover, features or regularities
in the training data. In some cases the **x***i*'s
must be mapped into a lower dimensional set of patterns such that
any topological relations existing among the **x***i*'s
are preserved among the new set of patterns. Normally, the success
of unsupervised learning hinges on some appropriately designed
network which encompasses a task-independent criterion of the
quality of representation that the network is required to learn.
Here, the weights of the network are to be optimized with respect
to this criterion.

Our interest here is in training networks of simple
units to perform the above tasks. In the remainder of this chapter,
some basic unsupervised learning rules for a single unit and for
simple networks are introduced. The following three classes of
unsupervised rules are considered: Hebbian learning, competitive
learning, and self-organizing feature map learning. Hebbian learning
is treated first. Then, competitive learning and self-organizing
feature map learning are covered in Sections 3.4 and 3.5, respectively.

The rules considered in this section are motivated
by the classical Hebbian synaptic modification hypothesis (Hebb,
1949). Hebb suggested that biological synaptic efficaces (**w**)
change in proportion to the correlation between the firing of
the pre- and post-synaptic neurons, **x** and *y*, respectively,
which may be stated formally as ( Stent, 1973; Changeux and Danchin,
1976)

(3.3.1)

where > 0 and the unit's output is
.

Let us now assume that the input vectors are drawn
from an arbitrary probability distribution *p*(**x**).
Let us also assume that the network being trained consists of
a single unit. At each time *k*, we will present a vector
**x**, randomly drawn from *p*(**x**), to this unit.
We will employ the Hebbian rule in Equation (3.3.1) to update
the weight vector **w**. The expected weight change
can be evaluated by averaging Equation (3.3.1) over all inputs
**x**. This gives

(3.3.2)

or, assuming **x** and **w** are statistically
independent,

(3.3.3)

Since, at equilibrium =
**0**, Equation (3.3.3) leads to ,
and thus is the only equilibrium state.
Here, **C** = is known
as the autocorrelation matrix and is given by

(3.3.4)

The terms on the main diagonal of **C** are the
mean squares of the input components, and the cross terms are
the cross correlations among the input components. **C** is
a Hermitian matrix (real and symmetric). Thus, its eigenvalues,
*i*, *i* = 1, 2, ..., *n*,
are positive real or zero, and it has orthogonal eigenvectors,
**c**(*i*). From
the definition of an eigenvector, each **c**(*i*)
satisfies the relation **Cc**(*i*) = *i***c**(*i*).

It can be shown that the solution **w*****
= **0** is not stable (see Section 4.3.1). Therefore, Equation
(3.3.3) is unstable and it drives **w** to infinite magnitude,
with a direction parallel to that of the eigenvector of **C**
with the largest eigenvalue. It will be shown, in Section 4.3.1,
that this learning rule tends to maximize the mean square of *y*,
. In other words, this rule is driven
to maximize the variance of the output of a linear unit, assuming
that *y* has zero mean. A zero mean *y* can be achieved
if the unit inputs are independent random variables with zero
mean. One way to prevent the divergence of the Hebbian learning
rule in Equation (3.3.1) is to normalize
to '1' after each learning step ( von der Malsburg, 1973; Rubner
and Tavan, 1989). This leads to the update rule:

(3.3.5)

In the following, we briefly describe additional
stable Hebbian-type learning rules. The detailed analysis of
these rules is deferred to Chapter 4.

An alternative approach for stabilizing the Hebbian
rule is to modify it by adding a weight decay proportional to
*y*2 (Oja, 1982).
This results in Oja's rule:

(3.3.6)

Oja's rule converges in the mean to a state **w***
which maximizes the mean value of *y*2,
, subject to the constraint ||**w**|| = 1.
It can also be shown that the solution **w***
is the principal eigenvector (the one with the largest corresponding
eigenvalue) of **C** (Oja, 1982; Oja and Karhunen, 1985).
The analysis of this rule is covered in Chapter 4.

Other modifications to the Hebbian rule have been
proposed to prevent divergence. Yuille et al. (1989) proposed
the rule

(3.3.7)

It can be shown (see Problem 3.3.4) that, in an average
sense, the weight vector update is given by gradient descent on
the criterion function:

(3.3.8)

In Section 4.3.4, it is shown that Equation (3.3.7)
converges to a vector **w***
that points in the same (or opposite) direction of the principal
eigenvector of **C** and whose norm is given by the square
root of the largest eigenvalue of **C**.

**Example 3.3.1:** In this
example, the convergence behavior of Oja's and Yuille et al. learning
rules are demonstrated on zero-mean random data. Figures 3.3.1
through 3.3.4 show two sets of simulation results for the evolution
of the weight vector of a single linear unit trained with Oja's
rule and with Yuille et al. rule. Figures 3.3.1 and 3.3.2 show
the behavior of the norm and the direction (cosine of the angle
between **w** and the eigenvector of **C** with the largest
eigenvalue) of the weight vector **w** as a function of iteration
number, *k*, respectively. In this simulation, the data
(training set) consists of sixty 15-dimensional vectors whose
components are generated randomly and independently from a uniform
distribution in the range [-0.5, +0.5].
In this particular case, the data set leads to a correlation
matrix having its two largest eigenvalues equal to 0.1578 and
0.1515, respectively. During learning, the training vectors are
presented in a fixed, cyclic order, and a learning rate
is used. As can be seen from Figure 3.3.1, the length of **w***k*
(practically) converges after 3,000 iterations. Here, both Oja
and Yuille et al. rules converge to the theoretically predicted
values of 1 and , respectively. The two
rules exhibit an almost identical weight vector direction evolution
as depicted in Figure 3.3.2. The figure shows an initial low
overlap between the starting weight vector and the principal eigenvector
of the data correlation matrix. This overlap increases slowly
at first, but then increases fast towards 1. As the direction
of **w** approaches that of the principal eigenvector (i.e.,
cos approaches 1) the convergence becomes slow again. Note that
Figure 3.3.2 only shows the evolution of cos over the first 3600
iterations. Beyond this point, the convergence becomes very slow.
This is due to the uniform nature of the data, which does not
allow for the principal eigenvector to dominate all other eigenvectors.
Thus, a strong competition emerges among several eigenvectors,
each attempting to align **w***k*
along its own direction; the end result being the slow convergence
of cos .

The second set of simulations involves a training
set of sixty 15-dimensional vectors drawn randomly from a normal
distribution with zero mean and variance of 1. This data leads
to a correlation matrix **C** with a dominating eigenvector,
relative to that of the previous data set. Here, the largest
two eigenvalues of **C** are equal to 2.1172 and 1.6299, respectively.
Again, Oja and Yuille et al. rules are used, but with a learning
rate of 0.01. Figure 3.3.3 shows a better behaved convergence
for Oja's rule as compared to Yuille et al. rule. The later rule
exhibits an oscillatory behavior in ||**w***k*||
about its theoretical asymptotic value, .
As for the convergence of the direction of **w**, Figure 3.3.4
show a comparable behavior for both rules. Here, the existence
of a dominating eigenvector for **C** is responsible for the
relatively faster convergence of cos , as compared to the earlier
simulation in Figure 3.3.2. Finally, we note that the oscillatory
behavior in Figures 3.3.3 and 3.3.4 can be significantly reduced
by resorting to smaller constant learning rates or by using a
decaying learning rate of the form .
This, however, leads to slower convergence speeds for both ||**w***k*||
and cos .

Figure 3.3.1. Weight vector magnitude vs. time for
Oja's rule (solid curve) and for Yuille et al. rule (dashed curve)
with = 0.02. The training set consists of sixty 15-dimensional
real-valued vectors whose components are generated according to
a uniform random distribution in the range [-0.5, +0.5].

Figure 3.3.2. Evolution of the cosine of the angle
between the weight vector and the principal eigenvector of the
correlation matrix for Oja's rule (solid curve) and for Yuille
et al. rule (dashed curve) with = 0.02 (the dashed
line is hard to see because it overlaps with the solid line).
The training set is the same as for Figure 3.3.1.

Figure 3.3.3. Weight vector magnitude vs. time for
Oja's rule (solid curve) and for Yuille et al. rule (dashed curve)
with = 0.01. The training set consists of sixty 15-dimensional
real-valued vectors whose components are generated according to
N(0, 1).

Figure 3.3.4. Evolution of the cosine of the angle
between the weight vector and the principal eigenvector of the
correlation matrix for Oja's rule (solid curve) and for Yuille
et al. rule (dashed curve) with = 0.01. The training
set is the same as for Figure 3.3.3.

Linsker (1986, 1988) proposed and studied the general
unsupervised learning rule:

(3.3.9)

subject to bounding constraints on the weights ,
with *y* = **w**T**x**
and . The average weight change in Equation
(3.3.9) is given by

(3.3.10)

where . If we set *a* > 0,
*b**i* = *a*
for *i* = 1, 2, ..., *n*, and **d** = **0**
in Equation (3.3.10), we get

(3.3.11)

where **1** is an *n*-dimensional column
vector of ones. Equation (3.3.11) gives the *i*th average
weight change:

(3.3.12)

where **C***i*
is the *i*th row of **C**. If we assume that the components
of **x** are random and are drawn from the same probability
distribution with mean , then for all
*i*, and Equation (3.3.11) reduces to

(3.3.13)

where was used. Here, it
can be easily shown that Equation (3.3.13) has the following associated
criterion function:

(3.3.14)

Therefore, by noting that **w**T**Cw**
is the mean square of the unit's output activity, ,
the simplified form of in Equation (3.3.13)
corresponds to maximizing subject to
the constraint . It can be shown that
this rule is unstable, but with the restriction ,
the final state will be clamped at a boundary value,
or . If is large enough, *w*- = 1,
*w*+ = +1,
and *n* is odd, then the above rule converges to a weight
vector with weights equal to *w*+
and the remaining weights equal to *w*-.
The weight vector configuration is such that
is maximized. For the *n* even case, one of the weights
at *w*-
will be pushed towards zero so that is
maintained.

**3.3.5 Hebbian Learning in a Network Setting: Principal
Component Analysis (PCA)**

Amari (1977a) and later Linsker (1988) pointed out
that principal component analysis (PCA) is equivalent to maximizing
the information content in the outputs of a network of linear
units. The aim of PCA is to extract *m* normalized orthogonal
vectors, **u***i*,
*i* = 1, 2, ..., *m*, in the
input space that account for as much of the data's variance as
possible. Subsequently, the *n*-dimensional input data (vectors
**x**) may be transformed to a lower *m*-dimensional space
without losing essential intrinsic information. This can be done
by projecting the input vectors onto the *m*-dimensional
subspace spanned by the extracted orthogonal vectors, **u***i*,
according to the inner products **x**T**u***i*.
Since *m* is smaller than *n*, the data undergoes a
dimensionality reduction. This, in turn, makes subsequent processing
of the data (e.g., clustering or classification) much easier to
handle.

The following is an outline for a direct optimization-based
method for determining the **u***i*
vectors. Let , be an input vector generated
according to a zero-mean probability distribution *p*(**x**).
Let **u** denote a vector in R*n*,
onto which the input vectors are to be projected. The projection
**x**T**u**
is the linear sum of *n* zero-mean random variables, which
is itself a zero-mean random variable. Here, the objective is
to find the solution(s) **u***
which maximizes , the variance of the
projection **x**T**u**
with respect to *p*(**x**), subject to ||**u**|| =
1. In other words, we are interested in finding the maxima **w***
of the criterion function

(3.3.15)

from which the unity norm solution(s) **u***
can be computed as with ||**w***|| 0.
Now, by noting that , and recalling that
is the autocorrelation matrix **C**,
Equation (3.3.15) may be expressed as

(3.3.16)

The extreme points of *J*(**w**) are the
solutions to *J*(**w**) = **0**, which gives

or

(3.3.17)

The solutions to Equation (3.3.17) are **w** = *a***c**(*i*),
*i* = 1, 2, ..., *n*, *a* R.
In other words, the maxima **w***
of *J*(**w**) must point in the same or opposite direction
of one of the eigenvectors of **C**. Upon careful examination
of the Hessian of *J*(**w**) [as in Section 4.3.2], we
find that the only maximum exists at **w*** = *a***c**(1)
for some finite real-valued *a* (in this case, *J*(**w***) = 1).
Therefore, the variance of the projection **x**T**u**
is maximized for **u** = **u**1 = = **c**(1).
Next, we repeat the above maximization of *J*(**w**)
in Equation (3.3.15), but with the additional requirement that
the vector **w** be orthogonal to **c**(1).
Here, it can be readily seen that the maximum of *J*(**w**)
is equal to 2, and occurs
at **w*** = *a***c**(2).
Thus, **u**2 = **c**(2).
Similarly, the solution **u**3 = **c**(3)
maximizes *J* under the constraint that **u**3
be orthogonal to **u**1
and **u**2, simultaneously.
Continuing this way, we arrive at the *m* principal directions
**u**1 through **u***m*.
Again, these vectors are ordered so that **u**1
points in the direction of the maximum data variance, and the
second vector **u**2
points in the direction of maximum variance in the subspace orthogonal
to **u**1, and so on.
Now, the projections **x**T**u***i*,
*i* = 1, 2, ..., *m*, are called
the principal components of the data; these projections are equivalent
to the ones obtained by the classical Karhunen-Loève transformation
of statistics (Karhunen, 1947; Loève, 1963). Note that
the previous Hebbian rules discussed in the single linear unit
setting all maximize , the output signal
variance, and hence they extract the first principal component
of the zero-mean data. If the data has a non-zero mean, then
we subtract the mean from it before extracting the principal components.

PCA in a network of Interacting Units

Here, an *m*-output network is desired which
is capable of incrementally and efficiently computing the first
*m* principal components of a given set of vectors in R*n*.
Consider a network of *m* linear units, *m* < *n*.
Let **w***i*
be the weight vector of the *i*th unit. Oja (1989) extended
his learning rule in Equation (3.3.6) to the *m*-unit network
according to (we will drop the *k* superscript here for clarity):

(3.3.18)

where *w**ij*
is the *j*th weight for unit *i* and *y**i*
is its output. Another rule proposed by Sanger (1989) is given
by:

(3.3.19)

Equations (3.3.18) and (3.3.19) require communication
between the units in the network during learning. Equation (3.3.18)
requires the *j*th input signal *x**j*,
as well as the output signals *y**k*
of all units, to be available when adjusting the *j*th weight
of unit *i*. Each signal *y**k*
is modulated by the *j*th weight of unit *k* and is
fed back as an inhibitory input to unit *i*. Thus, unit
*i* can be viewed as employing the original Hebbian rule
of Equation (3.3.1) to update its *j*th weight, but with
an effective input signal whose *j*th component is given
by the term inside parentheses in Equation (3.3.18). Sanger's
rule employs similar feedback except that the *i*th unit
only receives modulated output signals generated by units with
index *k*, where .

The above two rules are identical to Oja's rule
in Equation (3.3.6) for *m *= 1. For *m *> 1,
they only differ by the upper limit on the summation. Both rules
converge to **w***i*'s
that are orthogonal. Oja's rule does not generally find the eigenvector
directions of **C**. However, in this case the *m* weight
vectors converge to span the same subspace as the first *m*
eigenvectors of **C**. Here, the weight vectors depend on
the initial conditions and on the order of presentation of the
input data, and therefore differ individually from trial to trial.
On the other hand, Sanger's rule is insensitive to initial conditions
and to the order of presentation of input data. It converges
to , in order, where the first unit (*i *= 1)
has . Some insights into the convergence
behavior of this rule/PCA net can be gained by exploring the analysis
in Problems 3.3.5 and 3.3.6. Additional analysis is given in
Section 4.4.

Sanger (1989) gave a convergence proof of the above
PCA net employing Equation (3.3.19) under the assumption
with and . The
significance of this proof is that it guarantees the dynamics
in Equation (3.3.19) to find the first *m* eigenvectors of
the auto-correlation matrix **C** (assuming that the eigenvalues
through *m*
are distinct). An equally important property of Sanger's PCA
net is that there is no need to compute the full correlation matrix
**C**. Rather, the first eigenvectors of **C** are computed
by the net adaptively and directly from the input vectors. This
property can lead to significant savings in computational effort
if the dimension of the input vectors is very large compared to
the desired number of principal components to be extracted. For
an interesting application of PCA to image coding/compression,
the reader is referred to Gonzalez and Wintz (1987) and Sanger
(1989). See also Section 5.3.6.

PCA in a Single Layer network with Adaptive Lateral Connections

Another approach for PCA is to use the single layer
network with *m* linear units and trainable lateral connections
between units, as shown in Figure 3.3.5 (Rubner and Tavan, 1989).

The lateral connections *u**ij*
are present from unit *j* to unit *i* only if *i*
> *j*. The weights *w**ij*
connecting the inputs **x***k*
to the units are updated according to the
simple normalized Hebbian learning rule:

(3.3.20)

On the other hand, the lateral weights employ anti-Hebbian
learning, in the form:

(3.3.21)

where > 0. Note that the first unit
with index 1 extracts **c**(1).
The second unit tries to do the same except that the lateral
connection *u*21
from unit 1 inhibits *y*2
from approaching **c**(1),
hence *y*2 is forced
to settle for the 2nd principal direction, namely **c**(2),
and so on. Thus, this network extracts the first *m* principal
data directions in descending order, just as Sanger's network.
Since the principal directions are orthogonal, the correlation's
*y**i yj*
approach zero as convergence is approached and thus the *u**ij*
weights are driven to zero.

Optimal PCA mapping based on more complex statistical
criteria are also possible if nonlinear units are used (Oja, 1991;
Taylor and Coombes, 1993). Here, the extracted principal components
can be thought of as the eigenvectors of the matrix of higher-order
statistics which is a generalization of the second order statistics
matrix (correlation matrix **C**). The nonlinearities implicitly
introduce higher-order moments into the optimal solution.

Two natural ways of introducing nonlinearities into
the PCA net are via higher-order units or via units with nonlinear
activations. In order to see how higher-order units lead to higher-order
statistics PCA, consider the case of a simple network of a single
quadratic unit of *n*-inputs *x**i*,
*i* = 1, 2, ..., *n*. The
input/output relation for this quadratic unit is given by:

(3.3.22)

Another way of interpreting Equation (3.3.22) is
to write it in the form of a linear unit, such as

(3.3.23)

where

** z** = [ *x*1 *x*2
*x*3 ... *xn*
*x*1*x*2
... *x*1*xn*
*x*2*x*3 ... ]T

and

**w** = [*w*1
*w*2 ... *wn*
*w*11 *w*12
... *w*1*n* *w*22
*w*23 ... *wnn*]T

is a vector of real valued parameters. Therefore,
the *n*-input quadratic unit is equivalent to a linear unit
receiving its inputs from a fixed preprocessing
layer. This preprocessing layer transforms the original input
vectors **x***k*
into higher dimensional vectors **z***k*,
as in a QTG [refer to Section (1.1.2)].

Now, if stable Hebbian learning is used to adapt
the **w** parameter vector, this vector will stabilize at the
principal eigenvector of the correlation matrix **C****z** = .
This matrix can be written in terms of the inputs *x**i*
as:

Note that **C**1 = ,
as in Equation (3.3.4), and that **C**2 = .
The matrices **C**2
and **C**3 reflect
third order statistics and **C**4
reflects fourth order statistics. Yet higher-order statistics
can be realized by allowing for higher-order terms of the *x**i*'s
in **z**.

Extraction of higher-order statistics is also possible
if units with nonlinear activation functions are used (e.g., sigmoidal
activation units). This can be seen by employing Taylor series
expansion of the output of a nonlinear unit *y* = *f *(**w**T**x**)
at **w**T**x** = 0.
Here, it is assumed that all derivatives of *f* exist.
This expansion allows us to write this unit's output as

(3.3.25)

which may be interpreted as the output of a high-order
(polynomial) unit (see Problem 3.3.7). Therefore, higher-order
principal component extraction is expected when Hebbian-type learning
is applied to this unit. For further exploration in nonlinear
PCA, the reader may consult Karhunen (1994), Sudjianto and Hassoun
(1994), and Xu (1994). Also, see Section 5.3.6.