3.3 Unsupervised Learning

In unsupervised learning, there is no teacher signal. We are given a training set {xi; i = 1, 2, ..., m}, of unlabeled vectors in Rn. The objective is to categorize or discover, features or regularities in the training data. In some cases the xi's must be mapped into a lower dimensional set of patterns such that any topological relations existing among the xi's are preserved among the new set of patterns. Normally, the success of unsupervised learning hinges on some appropriately designed network which encompasses a task-independent criterion of the quality of representation that the network is required to learn. Here, the weights of the network are to be optimized with respect to this criterion.

Our interest here is in training networks of simple units to perform the above tasks. In the remainder of this chapter, some basic unsupervised learning rules for a single unit and for simple networks are introduced. The following three classes of unsupervised rules are considered: Hebbian learning, competitive learning, and self-organizing feature map learning. Hebbian learning is treated first. Then, competitive learning and self-organizing feature map learning are covered in Sections 3.4 and 3.5, respectively.

3.3.1 Hebbian Learning

The rules considered in this section are motivated by the classical Hebbian synaptic modification hypothesis (Hebb, 1949). Hebb suggested that biological synaptic efficaces (w) change in proportion to the correlation between the firing of the pre- and post-synaptic neurons, x and y, respectively, which may be stated formally as ( Stent, 1973; Changeux and Danchin, 1976)

(3.3.1)

where  > 0 and the unit's output is .

Let us now assume that the input vectors are drawn from an arbitrary probability distribution p(x). Let us also assume that the network being trained consists of a single unit. At each time k, we will present a vector x, randomly drawn from p(x), to this unit. We will employ the Hebbian rule in Equation (3.3.1) to update the weight vector w. The expected weight change can be evaluated by averaging Equation (3.3.1) over all inputs x. This gives

(3.3.2)

or, assuming x and w are statistically independent,

(3.3.3)

Since, at equilibrium = 0, Equation (3.3.3) leads to , and thus is the only equilibrium state. Here, C =  is known as the autocorrelation matrix and is given by

(3.3.4)

The terms on the main diagonal of C are the mean squares of the input components, and the cross terms are the cross correlations among the input components. C is a Hermitian matrix (real and symmetric). Thus, its eigenvalues, i, i = 1, 2, ..., n, are positive real or zero, and it has orthogonal eigenvectors, c(i). From the definition of an eigenvector, each c(i) satisfies the relation Cc(i) = ic(i).

It can be shown that the solution w* = 0 is not stable (see Section 4.3.1). Therefore, Equation (3.3.3) is unstable and it drives w to infinite magnitude, with a direction parallel to that of the eigenvector of C with the largest eigenvalue. It will be shown, in Section 4.3.1, that this learning rule tends to maximize the mean square of y, . In other words, this rule is driven to maximize the variance of the output of a linear unit, assuming that y has zero mean. A zero mean y can be achieved if the unit inputs are independent random variables with zero mean. One way to prevent the divergence of the Hebbian learning rule in Equation (3.3.1) is to normalize to '1' after each learning step ( von der Malsburg, 1973; Rubner and Tavan, 1989). This leads to the update rule:

(3.3.5)

In the following, we briefly describe additional stable Hebbian-type learning rules. The detailed analysis of these rules is deferred to Chapter 4.

3.3.2 Oja's Rule

An alternative approach for stabilizing the Hebbian rule is to modify it by adding a weight decay proportional to y2 (Oja, 1982). This results in Oja's rule:

(3.3.6)

Oja's rule converges in the mean to a state w* which maximizes the mean value of y2, , subject to the constraint ||w|| = 1. It can also be shown that the solution w* is the principal eigenvector (the one with the largest corresponding eigenvalue) of C (Oja, 1982; Oja and Karhunen, 1985). The analysis of this rule is covered in Chapter 4.

3.3.3 Yuille et al. Rule

Other modifications to the Hebbian rule have been proposed to prevent divergence. Yuille et al. (1989) proposed the rule

(3.3.7)

It can be shown (see Problem 3.3.4) that, in an average sense, the weight vector update is given by gradient descent on the criterion function:

(3.3.8)

In Section 4.3.4, it is shown that Equation (3.3.7) converges to a vector w* that points in the same (or opposite) direction of the principal eigenvector of C and whose norm is given by the square root of the largest eigenvalue of C.

Example 3.3.1: In this example, the convergence behavior of Oja's and Yuille et al. learning rules are demonstrated on zero-mean random data. Figures 3.3.1 through 3.3.4 show two sets of simulation results for the evolution of the weight vector of a single linear unit trained with Oja's rule and with Yuille et al. rule. Figures 3.3.1 and 3.3.2 show the behavior of the norm and the direction (cosine of the angle between w and the eigenvector of C with the largest eigenvalue) of the weight vector w as a function of iteration number, k, respectively. In this simulation, the data (training set) consists of sixty 15-dimensional vectors whose components are generated randomly and independently from a uniform distribution in the range [-0.5, +0.5]. In this particular case, the data set leads to a correlation matrix having its two largest eigenvalues equal to 0.1578 and 0.1515, respectively. During learning, the training vectors are presented in a fixed, cyclic order, and a learning rate is used. As can be seen from Figure 3.3.1, the length of wk (practically) converges after 3,000 iterations. Here, both Oja and Yuille et al. rules converge to the theoretically predicted values of 1 and , respectively. The two rules exhibit an almost identical weight vector direction evolution as depicted in Figure 3.3.2. The figure shows an initial low overlap between the starting weight vector and the principal eigenvector of the data correlation matrix. This overlap increases slowly at first, but then increases fast towards 1. As the direction of w approaches that of the principal eigenvector (i.e., cos approaches 1) the convergence becomes slow again. Note that Figure 3.3.2 only shows the evolution of cos over the first 3600 iterations. Beyond this point, the convergence becomes very slow. This is due to the uniform nature of the data, which does not allow for the principal eigenvector to dominate all other eigenvectors. Thus, a strong competition emerges among several eigenvectors, each attempting to align wk along its own direction; the end result being the slow convergence of cos .

The second set of simulations involves a training set of sixty 15-dimensional vectors drawn randomly from a normal distribution with zero mean and variance of 1. This data leads to a correlation matrix C with a dominating eigenvector, relative to that of the previous data set. Here, the largest two eigenvalues of C are equal to 2.1172 and 1.6299, respectively. Again, Oja and Yuille et al. rules are used, but with a learning rate of 0.01. Figure 3.3.3 shows a better behaved convergence for Oja's rule as compared to Yuille et al. rule. The later rule exhibits an oscillatory behavior in ||wk|| about its theoretical asymptotic value, . As for the convergence of the direction of w, Figure 3.3.4 show a comparable behavior for both rules. Here, the existence of a dominating eigenvector for C is responsible for the relatively faster convergence of cos , as compared to the earlier simulation in Figure 3.3.2. Finally, we note that the oscillatory behavior in Figures 3.3.3 and 3.3.4 can be significantly reduced by resorting to smaller constant learning rates or by using a decaying learning rate of the form . This, however, leads to slower convergence speeds for both ||wk|| and cos .











Figure 3.3.1. Weight vector magnitude vs. time for Oja's rule (solid curve) and for Yuille et al. rule (dashed curve) with  = 0.02. The training set consists of sixty 15-dimensional real-valued vectors whose components are generated according to a uniform random distribution in the range [-0.5, +0.5].











Figure 3.3.2. Evolution of the cosine of the angle between the weight vector and the principal eigenvector of the correlation matrix for Oja's rule (solid curve) and for Yuille et al. rule (dashed curve) with  = 0.02 (the dashed line is hard to see because it overlaps with the solid line). The training set is the same as for Figure 3.3.1.











Figure 3.3.3. Weight vector magnitude vs. time for Oja's rule (solid curve) and for Yuille et al. rule (dashed curve) with  = 0.01. The training set consists of sixty 15-dimensional real-valued vectors whose components are generated according to N(0, 1).











Figure 3.3.4. Evolution of the cosine of the angle between the weight vector and the principal eigenvector of the correlation matrix for Oja's rule (solid curve) and for Yuille et al. rule (dashed curve) with  = 0.01. The training set is the same as for Figure 3.3.3.

3.3.4 Linsker's Rule

Linsker (1986, 1988) proposed and studied the general unsupervised learning rule:

(3.3.9)

subject to bounding constraints on the weights , with y = wTx and . The average weight change in Equation (3.3.9) is given by

(3.3.10)

where . If we set a > 0, bi = a for i = 1, 2, ..., n, and d = 0 in Equation (3.3.10), we get

(3.3.11)

where 1 is an n-dimensional column vector of ones. Equation (3.3.11) gives the ith average weight change:

(3.3.12)

where Ci is the ith row of C. If we assume that the components of x are random and are drawn from the same probability distribution with mean , then for all i, and Equation (3.3.11) reduces to

(3.3.13)

where was used. Here, it can be easily shown that Equation (3.3.13) has the following associated criterion function:

(3.3.14)

Therefore, by noting that wTCw is the mean square of the unit's output activity, , the simplified form of in Equation (3.3.13) corresponds to maximizing subject to the constraint . It can be shown that this rule is unstable, but with the restriction , the final state will be clamped at a boundary value, or . If is large enough, w- = 1, w+ = +1, and n is odd, then the above rule converges to a weight vector with weights equal to w+ and the remaining weights equal to w-. The weight vector configuration is such that is maximized. For the n even case, one of the weights at w- will be pushed towards zero so that is maintained.

3.3.5 Hebbian Learning in a Network Setting: Principal Component Analysis (PCA)

Hebbian learning can lead to more interesting computational capabilities when applied to a network of units. In this section, we apply unsupervised Hebbian learning in a simple network setting to extract the m principal directions of a given set of data (i.e., the leading eigenvector directions of the input vectors' auto-correlation matrix).

Amari (1977a) and later Linsker (1988) pointed out that principal component analysis (PCA) is equivalent to maximizing the information content in the outputs of a network of linear units. The aim of PCA is to extract m normalized orthogonal vectors, ui, i = 1, 2, ..., m, in the input space that account for as much of the data's variance as possible. Subsequently, the n-dimensional input data (vectors x) may be transformed to a lower m-dimensional space without losing essential intrinsic information. This can be done by projecting the input vectors onto the m-dimensional subspace spanned by the extracted orthogonal vectors, ui, according to the inner products xTui. Since m is smaller than n, the data undergoes a dimensionality reduction. This, in turn, makes subsequent processing of the data (e.g., clustering or classification) much easier to handle.

The following is an outline for a direct optimization-based method for determining the ui vectors. Let , be an input vector generated according to a zero-mean probability distribution p(x). Let u denote a vector in Rn, onto which the input vectors are to be projected. The projection xTu is the linear sum of n zero-mean random variables, which is itself a zero-mean random variable. Here, the objective is to find the solution(s) u* which maximizes , the variance of the projection xTu with respect to p(x), subject to ||u|| = 1. In other words, we are interested in finding the maxima w* of the criterion function

(3.3.15)

from which the unity norm solution(s) u* can be computed as with ||w*||  0. Now, by noting that , and recalling that is the autocorrelation matrix C, Equation (3.3.15) may be expressed as

(3.3.16)

The extreme points of J(w) are the solutions to J(w) = 0, which gives


or

(3.3.17)

The solutions to Equation (3.3.17) are w = ac(i), i = 1, 2, ..., n, a  R. In other words, the maxima w* of J(w) must point in the same or opposite direction of one of the eigenvectors of C. Upon careful examination of the Hessian of J(w) [as in Section 4.3.2], we find that the only maximum exists at w* = ac(1) for some finite real-valued a (in this case, J(w*) = 1). Therefore, the variance of the projection xTu is maximized for u = u1 =  = c(1). Next, we repeat the above maximization of J(w) in Equation (3.3.15), but with the additional requirement that the vector w be orthogonal to c(1). Here, it can be readily seen that the maximum of J(w) is equal to 2, and occurs at w* = ac(2). Thus, u2 = c(2). Similarly, the solution u3 = c(3) maximizes J under the constraint that u3 be orthogonal to u1 and u2, simultaneously. Continuing this way, we arrive at the m principal directions u1 through um. Again, these vectors are ordered so that u1 points in the direction of the maximum data variance, and the second vector u2 points in the direction of maximum variance in the subspace orthogonal to u1, and so on. Now, the projections xTui, i = 1, 2, ..., m, are called the principal components of the data; these projections are equivalent to the ones obtained by the classical Karhunen-Loève transformation of statistics (Karhunen, 1947; Loève, 1963). Note that the previous Hebbian rules discussed in the single linear unit setting all maximize , the output signal variance, and hence they extract the first principal component of the zero-mean data. If the data has a non-zero mean, then we subtract the mean from it before extracting the principal components.

PCA in a network of Interacting Units

Here, an m-output network is desired which is capable of incrementally and efficiently computing the first m principal components of a given set of vectors in Rn. Consider a network of m linear units, m < n. Let wi be the weight vector of the ith unit. Oja (1989) extended his learning rule in Equation (3.3.6) to the m-unit network according to (we will drop the k superscript here for clarity):

(3.3.18)

where wij is the jth weight for unit i and yi is its output. Another rule proposed by Sanger (1989) is given by:

(3.3.19)

Equations (3.3.18) and (3.3.19) require communication between the units in the network during learning. Equation (3.3.18) requires the jth input signal xj, as well as the output signals yk of all units, to be available when adjusting the jth weight of unit i. Each signal yk is modulated by the jth weight of unit k and is fed back as an inhibitory input to unit i. Thus, unit i can be viewed as employing the original Hebbian rule of Equation (3.3.1) to update its jth weight, but with an effective input signal whose jth component is given by the term inside parentheses in Equation (3.3.18). Sanger's rule employs similar feedback except that the ith unit only receives modulated output signals generated by units with index k, where .

The above two rules are identical to Oja's rule in Equation (3.3.6) for = 1. For > 1, they only differ by the upper limit on the summation. Both rules converge to wi's that are orthogonal. Oja's rule does not generally find the eigenvector directions of C. However, in this case the m weight vectors converge to span the same subspace as the first m eigenvectors of C. Here, the weight vectors depend on the initial conditions and on the order of presentation of the input data, and therefore differ individually from trial to trial. On the other hand, Sanger's rule is insensitive to initial conditions and to the order of presentation of input data. It converges to , in order, where the first unit (= 1) has . Some insights into the convergence behavior of this rule/PCA net can be gained by exploring the analysis in Problems 3.3.5 and 3.3.6. Additional analysis is given in Section 4.4.

Sanger (1989) gave a convergence proof of the above PCA net employing Equation (3.3.19) under the assumption with and . The significance of this proof is that it guarantees the dynamics in Equation (3.3.19) to find the first m eigenvectors of the auto-correlation matrix C (assuming that the eigenvalues through m are distinct). An equally important property of Sanger's PCA net is that there is no need to compute the full correlation matrix C. Rather, the first eigenvectors of C are computed by the net adaptively and directly from the input vectors. This property can lead to significant savings in computational effort if the dimension of the input vectors is very large compared to the desired number of principal components to be extracted. For an interesting application of PCA to image coding/compression, the reader is referred to Gonzalez and Wintz (1987) and Sanger (1989). See also Section 5.3.6.

PCA in a Single Layer network with Adaptive Lateral Connections

Another approach for PCA is to use the single layer network with m linear units and trainable lateral connections between units, as shown in Figure 3.3.5 (Rubner and Tavan, 1989).


Figure 3.3.5: PCA network with adaptive lateral connections.

The lateral connections uij are present from unit j to unit i only if i > j. The weights wij connecting the inputs xk to the units are updated according to the simple normalized Hebbian learning rule:

(3.3.20)

On the other hand, the lateral weights employ anti-Hebbian learning, in the form:

(3.3.21)

where  > 0. Note that the first unit with index 1 extracts c(1). The second unit tries to do the same except that the lateral connection u21 from unit 1 inhibits y2 from approaching c(1), hence y2 is forced to settle for the 2nd principal direction, namely c(2), and so on. Thus, this network extracts the first m principal data directions in descending order, just as Sanger's network. Since the principal directions are orthogonal, the correlation's yi yj approach zero as convergence is approached and thus the uij weights are driven to zero.

3.3.6 Nonlinear PCA

PCA networks, such as those discussed above, extract principal components which provide an optimal linear mapping from the original input space to a lower dimensional output space whose dimensions are determined by the number of linear units in the network. The optimality of this mapping is with respect to the second order statistics of the training set {xk}.

Optimal PCA mapping based on more complex statistical criteria are also possible if nonlinear units are used (Oja, 1991; Taylor and Coombes, 1993). Here, the extracted principal components can be thought of as the eigenvectors of the matrix of higher-order statistics which is a generalization of the second order statistics matrix (correlation matrix C). The nonlinearities implicitly introduce higher-order moments into the optimal solution.

Two natural ways of introducing nonlinearities into the PCA net are via higher-order units or via units with nonlinear activations. In order to see how higher-order units lead to higher-order statistics PCA, consider the case of a simple network of a single quadratic unit of n-inputs xi, i = 1, 2, ..., n. The input/output relation for this quadratic unit is given by:

(3.3.22)

Another way of interpreting Equation (3.3.22) is to write it in the form of a linear unit, such as

(3.3.23)

where

z = [ x1  x2  x3  ...  xn    x1x2  ...  x1xn    x2x3  ...  ]T

and

w = [w1  w2  ...  wn  w11  w12  ...  w1n  w22  w23  ...  wnn]T

is a vector of real valued parameters. Therefore, the n-input quadratic unit is equivalent to a linear unit receiving its inputs from a fixed preprocessing layer. This preprocessing layer transforms the original input vectors xk into higher dimensional vectors zk, as in a QTG [refer to Section (1.1.2)].

Now, if stable Hebbian learning is used to adapt the w parameter vector, this vector will stabilize at the principal eigenvector of the correlation matrix Cz = . This matrix can be written in terms of the inputs xi as:

Note that C1 = , as in Equation (3.3.4), and that C2 = . The matrices C2 and C3 reflect third order statistics and C4 reflects fourth order statistics. Yet higher-order statistics can be realized by allowing for higher-order terms of the xi's in z.

Extraction of higher-order statistics is also possible if units with nonlinear activation functions are used (e.g., sigmoidal activation units). This can be seen by employing Taylor series expansion of the output of a nonlinear unit y = (wTx) at wTx = 0. Here, it is assumed that all derivatives of f exist. This expansion allows us to write this unit's output as

(3.3.25)

which may be interpreted as the output of a high-order (polynomial) unit (see Problem 3.3.7). Therefore, higher-order principal component extraction is expected when Hebbian-type learning is applied to this unit. For further exploration in nonlinear PCA, the reader may consult Karhunen (1994), Sudjianto and Hassoun (1994), and Xu (1994). Also, see Section 5.3.6.

Goto [3.0] [3.1] [3.2] [3.4] [3.5] [3.6]

Back to the Table of Contents

Back to Main Menu