Reinforcement learning is a process of trial and
error which is designed to maximize the expected value of a criterion
function known as a "reinforcement signal." The basic
idea of reinforcement learning has its origins in psychology in
connection with experimental studies of animal learning (Thorndike,
1911). In its simplest form, reinforcement learning is based
on the idea that if an action is followed by an "improvement"
in the state of affairs, then the tendency to produce that action
is strengthened, i.e., reinforced. Otherwise, the tendency of
the system to produce that action is weakened (Barto and Singh,
1991; Sutton et al., 1991).

Given a training set of the form {**x***k*, *r**k*},
*k* = 1, 2, ..., *m*, where
and *r**k* is
an evaluative signal (normally ) which
is supplied by a "critic". The idea here is not to
associate **x***k*
with *r**k*
as in supervised learning. Rather, *r**k*
is a reinforcement signal which informs the unit being trained
about its performance on the input **x***k*.
So *r**k* evaluates
the "appropriateness" of the unit's output, *y**k*,
due to the input **x***k*.
Usually, *r**k*
gives no indication on what *y**k*
should be. It is therefore important for the unit to be stochastic
so that a mechanism of exploration of the output space is present.

One may view supervised learning in a stochastic
unit as an extreme case of reinforcement learning, where the output
of the unit is binary and there is one correct output for each
input. Here, *r**k
*becomes the desired target *d**k*.
Also, we may use the gradient descent learning rule of Equation
(3.1.83) to train the stochastic unit. In general, the reinforcement
signal itself may be stochastic such that the pair {**x***k*, *r**k*}
only provides the "probability" of positive reinforcement.
The most general extreme for reinforcement learning (and the
most difficult) is where both the reinforcement signal and the
input patterns depend arbitrarily on the past history of the stochastic
unit's output. An example would be the stabilization of an unstable
dynamical system or even a game of chess where the reinforcement
signal arrives at the end of a sequence of player moves.

**3.2.1 Associative Reward-Penalty Reinforcement
Learning Rule**

We now present a reinforcement learning rule due
to Barto and Anandan (1985), which is known as the associative
reward-penalty (*A**rp*)
algorithm. We discuss it here in the context of a single stochastic
unit. Motivated by Equation (3.1.83), we may express the *A**rp*
reinforcement rule as

(3.2.1)

where

(3.2.2)

and

(3.2.3)

with + >> > 0.
The derivative term in Equation (3.2.1)
may be eliminated without affecting the general behavior of this
learning rule. In this case, the resulting learning rule corresponds
to steepest descent on the relative entropy criterion function.
The setting of *d**k*
according to Equation (3.2.2) guides the unit to do what it just
did if *y**k*
is "good" and to do the opposite if not (Widrow et al.,
1973). In general, this makes the dynamics of **w***k*
in Equation (3.2.1) substantially different from that of **w***k*
in the supervised stochastic learning rule in Equation (3.1.83).
When learning converges, the approach
1 making the unit effectively deterministic; the unit's output
approaches the state providing the largest average reinforcement
on the training set.

One variation of *A**rp*
(Barto and Jordan, 1987) utilizes a continuous-valued or graded
reinforcement signal, , and has the form:

(3.2.4)

Another variation uses *r**k*
= 1 and has the simple form:

(3.2.5)

This later rule is more amenable to theoretical analysis,
as is shown in Section 4.5, where it will be shown that the rule
tends to maximize the average reinforcement signal, .
Reinforcement learning speed may be improved if batch mode training
is used (Barto and Jordan, 1987; Ackley and Littman, 1990). Here,
a given pattern **x***k*
is presented several times, and the accumulation of all the weight
changes is used to update **w***k*.
Then, pattern **x***k*+1
is presented several times, and so on. For an overview treatment
of the theory of reinforcement learning, see Barto (1985) and
Williams (1992). Also, see the Special Issue on Reinforcement
Learning, edited by Sutton (1992), for theoretical and practical
considerations.