Recall the simplified stochastic reinforcement learning rule of Equation (3.2.5). The continuous-time version of this rule is given by (4.5.1)

from which the average learning equation is given as (4.5.2)

Now, employing Equation (4.2.6), we may think of Equation (4.5.2) as implementing a gradient search on an average instantaneous criterion function, , whose gradient is given by (4.5.3)

In Equations (4.5.1) through (4.5.3), we have and the output y is generated by a stochastic unit according to the probability function P(yw, x), given by (4.5.4)

with the expected output (4.5.5)

First, we express the expected (average) reinforcement signal, , for the kth input vector with respect to all possible outputs y as (4.5.6)

and then evaluate its gradient with respect to w. The gradient of is given by (4.5.7)

which follows from Equation (4.5.4). We also have (4.5.8)

and (4.5.9)

which can be used in Equation (4.5.7) to give (4.5.10)

If we now take the gradient of Equation (4.5.6) and use (4.5.10), we arrive at (4.5.11)

which can also be written as (4.5.12)

Finally, by averaging Equation (4.5.12) over all inputs xk, we get (4.5.13)

where now the averages are across all patterns and all outputs. Note that is proportional to in Equation (4.5.3) and has an opposite sign. Thus, Equation (4.5.2) can be written in terms of as (4.5.14)

which implies that the average weight vector converges to a local maximum of ; i.e., Equation (4.5.1) converges, on average, to a solution that locally maximizes the average reinforcement signal.

Extensions of these results to a wider class of reinforcement algorithms can be found in Williams (1987). The characterization of the associative reward-penalty algorithm in Equation (3.2.1) is more difficult since it does not necessarily maximize . However, the above analysis should give some insight into the behavior of simple reinforcement learning.

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.6] [4.7] [4.8] [4.9] [4.10] Back to the Table of Contents Back to Main Menu