# Solved – How does backpropagation work in the case of reinforcement learning for games

If we want a neural network to learn how to recognize e.g. digits, the backpropagation procedure is as follows:

1. Let the NN look at an image of a digit, and output its probabilities on the different digits.

2. Calculate the gradient of the loss function w.r.t. the parameters, and adjust the parameters.

But now let's say we want the NN to learn how to play Pong.

Then the situation is different:

1. We can't really split the game up in "cycles", as we can for recognizing digits. Success in Pong is defined by (a) hitting the ball, and (b) getting the ball into the opponent's wall. But whether this succeeds depends on a succession of choices. Let's say that at time \$t\$, the NN manages to hit the ball. This was only possible because he made the right decisions at time \$t-k, t-k+1, …, t\$, not just at time \$t\$. So how does the backpropagation procedure take this complex sequence of decisions into account when changing the parameters?
Contents

Just because the NN makes a sequence of choices doesn't change the way backpropagation works. The only thing which is required for backpropagation is that the loss is differentiable with respect to the weights, because that's the only requirement necessary to compute the gradients.

In fact, RNNs can produce a sequence of outputs, and they are also trained with backpropagation (or "backpropagation through time", which is just a more convoluted way of saying BP).

The difficulty with RL is not that the network produces a sequence of outputs rather than just one output. The difficulty is that often, the environment is non-differentiable with respect to the action taken. For example, consider a simple game where you must go left or right, getting a reward of 1 for going left, and a reward of -1 for going right.

The network must output a discrete action, left or right, which is most likely sampled from a sigmoid/softmax layer, which immediately breaks the differentiability requirement. Even in continuous games, an RL environment is unlikely to be differentiable (suppose you output an action in [0,1], and you get 1 reward if your action \$a > 0.5\$ and -1 reward for \$a leq 0.5\$ — this is nondifferentiable).

Even if the environment happens to be differentiable, unless you know the dynamics (the physical laws) of the environment, you'd be unable to backpropagate through them.

Therefore in RL, we usually cannot train with vanilla backpropagation. Two popular classes of algorithms which work around this problem are:

1. Policy gradients: Use gradient estimation techniques which estimate the gradient of the reward wrt to weights, even if it is a nondifferentiable function.

2. Q-learning: Instead of training by maximizing reward directly, learn the expected reward of each state-action pair. This can be done with vanilla backpropagation. Then simply take the action with the highest expected reward to play the game.

See here for an interesting case of where vanilla BP can be used in RL.

Rate this post

# Solved – How does backpropagation work in the case of reinforcement learning for games

If we want a neural network to learn how to recognize e.g. digits, the backpropagation procedure is as follows:

1. Let the NN look at an image of a digit, and output its probabilities on the different digits.

2. Calculate the gradient of the loss function w.r.t. the parameters, and adjust the parameters.

But now let's say we want the NN to learn how to play Pong.

Then the situation is different:

1. We can't really split the game up in "cycles", as we can for recognizing digits. Success in Pong is defined by (a) hitting the ball, and (b) getting the ball into the opponent's wall. But whether this succeeds depends on a succession of choices. Let's say that at time \$t\$, the NN manages to hit the ball. This was only possible because he made the right decisions at time \$t-k, t-k+1, …, t\$, not just at time \$t\$. So how does the backpropagation procedure take this complex sequence of decisions into account when changing the parameters?

Just because the NN makes a sequence of choices doesn't change the way backpropagation works. The only thing which is required for backpropagation is that the loss is differentiable with respect to the weights, because that's the only requirement necessary to compute the gradients.

In fact, RNNs can produce a sequence of outputs, and they are also trained with backpropagation (or "backpropagation through time", which is just a more convoluted way of saying BP).

The difficulty with RL is not that the network produces a sequence of outputs rather than just one output. The difficulty is that often, the environment is non-differentiable with respect to the action taken. For example, consider a simple game where you must go left or right, getting a reward of 1 for going left, and a reward of -1 for going right.

The network must output a discrete action, left or right, which is most likely sampled from a sigmoid/softmax layer, which immediately breaks the differentiability requirement. Even in continuous games, an RL environment is unlikely to be differentiable (suppose you output an action in [0,1], and you get 1 reward if your action \$a > 0.5\$ and -1 reward for \$a leq 0.5\$ — this is nondifferentiable).

Even if the environment happens to be differentiable, unless you know the dynamics (the physical laws) of the environment, you'd be unable to backpropagate through them.

Therefore in RL, we usually cannot train with vanilla backpropagation. Two popular classes of algorithms which work around this problem are:

1. Policy gradients: Use gradient estimation techniques which estimate the gradient of the reward wrt to weights, even if it is a nondifferentiable function.

2. Q-learning: Instead of training by maximizing reward directly, learn the expected reward of each state-action pair. This can be done with vanilla backpropagation. Then simply take the action with the highest expected reward to play the game.

See here for an interesting case of where vanilla BP can be used in RL.

Rate this post