3.2 Reinforcement Learning

Reinforcement learning is a process of trial and error which is designed to maximize the expected value of a criterion function known as a "reinforcement signal." The basic idea of reinforcement learning has its origins in psychology in connection with experimental studies of animal learning (Thorndike, 1911). In its simplest form, reinforcement learning is based on the idea that if an action is followed by an "improvement" in the state of affairs, then the tendency to produce that action is strengthened, i.e., reinforced. Otherwise, the tendency of the system to produce that action is weakened (Barto and Singh, 1991; Sutton et al., 1991).

Given a training set of the form {xkrk}, k = 1, 2, ..., m, where and rk is an evaluative signal (normally ) which is supplied by a "critic". The idea here is not to associate xk with rk as in supervised learning. Rather, rk is a reinforcement signal which informs the unit being trained about its performance on the input xk. So rk evaluates the "appropriateness" of the unit's output, yk, due to the input xk. Usually, rk gives no indication on what yk should be. It is therefore important for the unit to be stochastic so that a mechanism of exploration of the output space is present.

One may view supervised learning in a stochastic unit as an extreme case of reinforcement learning, where the output of the unit is binary and there is one correct output for each input. Here, rk becomes the desired target dk. Also, we may use the gradient descent learning rule of Equation (3.1.83) to train the stochastic unit. In general, the reinforcement signal itself may be stochastic such that the pair {xkrk} only provides the "probability" of positive reinforcement. The most general extreme for reinforcement learning (and the most difficult) is where both the reinforcement signal and the input patterns depend arbitrarily on the past history of the stochastic unit's output. An example would be the stabilization of an unstable dynamical system or even a game of chess where the reinforcement signal arrives at the end of a sequence of player moves.

3.2.1 Associative Reward-Penalty Reinforcement Learning Rule

We now present a reinforcement learning rule due to Barto and Anandan (1985), which is known as the associative reward-penalty (Arp) algorithm. We discuss it here in the context of a single stochastic unit. Motivated by Equation (3.1.83), we may express the Arp reinforcement rule as






with + >>  > 0. The derivative term in Equation (3.2.1) may be eliminated without affecting the general behavior of this learning rule. In this case, the resulting learning rule corresponds to steepest descent on the relative entropy criterion function. The setting of dk according to Equation (3.2.2) guides the unit to do what it just did if yk is "good" and to do the opposite if not (Widrow et al., 1973). In general, this makes the dynamics of wk in Equation (3.2.1) substantially different from that of wk in the supervised stochastic learning rule in Equation (3.1.83). When learning converges, the approach 1 making the unit effectively deterministic; the unit's output approaches the state providing the largest average reinforcement on the training set.

One variation of Arp (Barto and Jordan, 1987) utilizes a continuous-valued or graded reinforcement signal, , and has the form:


Another variation uses rk = 1 and has the simple form:


This later rule is more amenable to theoretical analysis, as is shown in Section 4.5, where it will be shown that the rule tends to maximize the average reinforcement signal, . Reinforcement learning speed may be improved if batch mode training is used (Barto and Jordan, 1987; Ackley and Littman, 1990). Here, a given pattern xk is presented several times, and the accumulation of all the weight changes is used to update wk. Then, pattern xk+1 is presented several times, and so on. For an overview treatment of the theory of reinforcement learning, see Barto (1985) and Williams (1992). Also, see the Special Issue on Reinforcement Learning, edited by Sutton (1992), for theoretical and practical considerations.

Goto [3.0] [3.1] [3.3] [3.4] [3.5] [3.6]

Back to the Table of Contents

Back to Main Menu