Suppose we have a Markov decision process with a finite state set and a finite action set. We calculate the expected reward with a discount of $gamma in [0,1]$.

In chapter 3.8 of the book "Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton) it is stated that there always exists at least one optimal policy, but it doesn't prove why.

I suppose the various optimal policies yield the same optimal value function, at least this is what would make sense and also assumed in the book.

Can someone give me a proof for the above statement or a link to a proof?

**Contents**hide

#### Best Answer

Assume that there exists two optimal policies $pi$ and $pi'$ with respective value functions $V$ and $V'$. Assume that, for some state $x$, $V(x) neq V'(x)$. Without loss of generality, we can assume that $V(x) < V'(x)$. But if $V(x)$ is lower than $V'(x)$, then it is not optimal since it is better to follow $pi'$. Therefore, it is proved by contradiction, $V(x)$ and $V'(x)$ must be the same.

### Similar Posts:

- Solved – Why is the optimal policy non-stationary in the case finite-horizon problems, whereas it is stationary in the case of infinite-horizon problems
- Solved – Reinforcement Learning – difference between a Policy and a State transition matrix
- Solved – Proof of Bellman Optimality Equation
- Solved – Proof of Bellman Optimality Equation
- Solved – Proof of Bellman Optimality Equation