4.5 Theory of Reinforcement Learning

Recall the simplified stochastic reinforcement learning rule of Equation (3.2.5). The continuous-time version of this rule is given by


from which the average learning equation is given as


Now, employing Equation (4.2.6), we may think of Equation (4.5.2) as implementing a gradient search on an average instantaneous criterion function, , whose gradient is given by


In Equations (4.5.1) through (4.5.3), we have and the output y is generated by a stochastic unit according to the probability function P(yw, x), given by


with the expected output


as in Section 3.1.6. Next, it is shown that Equation (4.5.3) is proportional to the gradient of the expected reinforcement signal (Williams, 1987; Hertz et al., 1991).

First, we express the expected (average) reinforcement signal, , for the kth input vector with respect to all possible outputs y as


and then evaluate its gradient with respect to w. The gradient of is given by


which follows from Equation (4.5.4). We also have


and (4.5.9)

which can be used in Equation (4.5.7) to give


If we now take the gradient of Equation (4.5.6) and use (4.5.10), we arrive at


which can also be written as


Finally, by averaging Equation (4.5.12) over all inputs xk, we get


where now the averages are across all patterns and all outputs. Note that is proportional to in Equation (4.5.3) and has an opposite sign. Thus, Equation (4.5.2) can be written in terms of as


which implies that the average weight vector converges to a local maximum of ; i.e., Equation (4.5.1) converges, on average, to a solution that locally maximizes the average reinforcement signal.

Extensions of these results to a wider class of reinforcement algorithms can be found in Williams (1987). The characterization of the associative reward-penalty algorithm in Equation (3.2.1) is more difficult since it does not necessarily maximize . However, the above analysis should give some insight into the behavior of simple reinforcement learning.

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.6] [4.7] [4.8] [4.9] [4.10]

Back to the Table of Contents

Back to Main Menu