# Solved – Uniqueness of the optimal value function for an MDP

Suppose we have a Markov decision process with a finite state set and a finite action set. We calculate the expected reward with a discount of $$gamma in [0,1]$$.

In chapter 3.8 of the book "Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton) it is stated that there always exists at least one optimal policy, but it doesn't prove why.

I suppose the various optimal policies yield the same optimal value function, at least this is what would make sense and also assumed in the book.

Can someone give me a proof for the above statement or a link to a proof?

Contents

Assume that there exists two optimal policies $$pi$$ and $$pi'$$ with respective value functions $$V$$ and $$V'$$. Assume that, for some state $$x$$, $$V(x) neq V'(x)$$. Without loss of generality, we can assume that $$V(x) < V'(x)$$. But if $$V(x)$$ is lower than $$V'(x)$$, then it is not optimal since it is better to follow $$pi'$$. Therefore, it is proved by contradiction, $$V(x)$$ and $$V'(x)$$ must be the same.