**4.5 Theory of Reinforcement Learning**

from which the average learning equation is given
as

as in Section 3.1.6. Next, it is shown that Equation
(4.5.3) is proportional to the gradient of the expected reinforcement
signal (Williams, 1987; Hertz et al.,
1991).

First, we express the expected (average) reinforcement
signal, , for the *k*th input vector
with respect to all possible outputs *y* as

(4.5.6)

and then evaluate its gradient with respect to **w**.
The gradient of is given by

(4.5.7)

which follows from Equation (4.5.4). We also have

(4.5.8)

and (4.5.9)

which can be used in Equation (4.5.7) to give

(4.5.10)

If we now take the gradient of Equation (4.5.6) and
use (4.5.10), we arrive at

(4.5.11)

which can also be written as

(4.5.12)

Finally, by averaging Equation (4.5.12) over all
inputs **x***k*,
we get

(4.5.13)

where now the averages are across all patterns and
all outputs. Note that is proportional
to in Equation (4.5.3) and has an opposite
sign. Thus, Equation (4.5.2) can be written in terms of
as

(4.5.14)

which implies that the average weight vector converges
to a local maximum of ; i.e., Equation
(4.5.1) converges, on average, to a solution that locally maximizes
the average reinforcement signal.

Extensions of these results to a wider class of
reinforcement algorithms can be found in Williams (1987). The
characterization of the associative reward-penalty algorithm in
Equation (3.2.1) is more difficult since it does not necessarily
maximize . However, the above analysis
should give some insight into the behavior of simple reinforcement
learning.

Goto [4.0][4.1] [4.2] [4.3] [4.4] [4.6] [4.7] [4.8] [4.9] [4.10]