I have a question concerning actor critic methods in reinforcement learning.
In these slides (https://hadovanhasselt.files.wordpress.com/2016/01/pg1.pdf) different types of actor-critics are explained. Advantage actor critic and TD actor critic are mentioned in the last slide:
But when I look at the slide "Estimating the advantage function (2)", it is said, that the advantage function can be approximated by the td error. Then the update rule includes the td error the same way as in TD actor critic.
So is advantage actor critic and td actor critic actually the same? Or is there a difference I don't see?
Advantage can be approximated by TD error. This may be helpful especially if you want to update $theta$ after each transition.
For the batch approaches, you can calculate $Q_w(A,S)$ e.g. by means of fitted Q-iteration and subsequently $V(S)$. Using this, you have the general advantage function and your gradient change of the policy may be much more stable because it will be closer to global/actual advantage function.
- Solved – Overview over Reinforcement Learning Algorithms
- Solved – Reinforcement Learning – What is the logic behind actor-critic methods? Why use a critic
- Solved – a baseline function in policy gradients methods
- Solved – LSTM network in the Asynchronous Advantage Actor-Critic (A3C) algorithm
- Solved – Computing the Actor Gradient Update in the Deep Deterministic Policy Gradient (DDPG) algorithm