Let's imagine a Q-learning version of the supervised learning problem of guessing a digit from MNIST database. In this game, the initial state is the 28×28 image pixels. You have 10 possible actions, labeled from 0 to 9. And the reward you get is $R$ if you correctly guessed the digit, 0 otherwise. After each image/guess, the game ends.

As the second state is always terminal and assuming $alpha = 1$, I simply update my $Q(s,a)$ with $Q(s,a) leftarrow R$.

I noticed I get some very different learning curves when I change $R$, which I didn't expect. The model converges way faster when $R$ is big.

My guess is it depends of the magnitude of $Q(s,a)$ values. If $Q(s,a)$ are 3 digit numbers, using $R = 1$ would be too "low" and would be equivalent to a 0. However, using $R = 100000$ would be so "big" compared to 3 digit numbers that the updated $Q(s,a)$ would be similar to one-hot encoded vector.

Is that a common issue?

**Contents**hide

#### Best Answer

Neural networks are sensitive to the scale of the input and the target. Rescaling the target implicitly requires rescaling the network weights, which in turn will influence the ability of the optimizer to find a good solution.