I am training a DQN for a task where an agent must reach a goal in 2D space, with actions as up, down, left and right. The reward at each time step is an exponential decay based on the distance between the agent and the goal.

The problem I am having, is that at times, the q-values predicted by the DQN are very high — much, much higher than they should be theoretically. For example, the maximum reward available at any time step is 0.1 (when the agent is exactly at the goal), yet the DQN is predicting q-values of over 20. After a while, this overshoot dies down, but it does cause significant problems when it happens.

The below graph shows the average q-values plotted against the episode number:

One common cause of this is that increasing one q-value will cause neighbouring q-values to increase (due to the smoothness of the neural network output), and hence there is a compounding effect where the q-values spiral out of control. However, I am using a separate target-network to the q-network as proposed by DeepMind's original DQN paper, which is designed to eliminate this problem. I am resetting the target-network to the q-network every 500 steps.

What could be the other causes of this?

**Contents**hide

#### Best Answer

This happens. Try increasing the number of steps before updating the target network, decreasing the discount factor or using double DQN.