Following Barto and Sutton's "Reinforcement Learning: An Introduction", I am having trouble rigorously proving the Bellman Optimality Equation for finite MDPs.

Namely, why does $v_*(s) = maxlimits_{a in A(s)} q_{pi_*}(s, a)$?

My attempt to see this is true:

Let $v_* := maxlimits_{a in A(s)} q_{pi_*}(s, a)$

$v_*(s) = sumlimits_{a in A(s)} pi_*(a | s) q_{pi_*}(s, a) leq v_*$

However, I'm not sure I see why equality must hold. I'm thinking that we construct $pi'$ such that $pi'(s) in argmax limits_{a in A(s)} q_{pi_*}(s, a)$, $pi(s') = pi_*(s')$ $forall s' neq s$ and show that $v_pi(s) = v_*$?.

Intuitively I see this statement being true if we allow non-stationary policies and that stationary rewards should mean that we could "just take" our policy to be stationary, but I don't see the clear reasoning behind this.

In a similar vein, why does does $pi_*(s) = argmax limits_{a in A(s)} q_{pi_*}(s, a)$ constitute an optimal policy?

**Contents**hide

#### Best Answer

Before answering your question. We need to introduce the following relationship. Let's first take a look at the definitions of value function and action value function under policy $pi$: begin{align} &v_{pi}(s)=E{left[G_t|S_t=sright]} \ &q_{pi}(s,a)=E{left[G_t|S_t=s, A_t=aright]} end{align} where $G_t=sum_{k=0}^{infty}gamma^kR_{t+k+1}$ is the return at time $t$. The relationship between these two value functions can be derived as begin{align} v_{pi}(s)&=E{left[G_t|S_t=sright]} nonumber \ &=sum_{g_t} p(g_t|S_t=s)g_t nonumber \ &= sum_{g_t}sum_{a}p(g_t, a|S_t=s)g_t nonumber \ &= sum_{a}p(a|S_t=s)sum_{g_t}p(g_t|S_t=s, A_t=a)g_t nonumber \ &= sum_{a}p(a|S_t=s)E{left[G_t|S_t=s, A_t=aright]} nonumber \ &= sum_{a}p(a|S_t=s)q_{pi}(s,a) qquad (1) end{align} The above equation is important. It describes the relationship between two fundamental value functions in reinforcement learning. It is valid for any policy. Moreover, if we have a deterministic policy, then $v_{pi}(s)=q_{pi}(s,pi(s))$.

Now let's start answering your question by recalling the definitions of optimal policy, optimal state-value function, and optimal action-value function:

Optimal policy: If $v_{pi}(s)ge v_{pi'}(s)$ for all $sin mathcal{S}$, then we say $pi$ is better than or equal to $pi'$. There is always at least one policy that is better than or equal to all other policies. This is an optimal policy.

Optimal state-value function: $displaystyle v_*(s)=max_{pi}v_{pi}(s)$, $forall sinmathcal{S}$.

Optimal action-value function: $displaystyle q_*(s, a)=max_{pi}q_{pi}(s, a)$, $forall sinmathcal{S}$ and $ain A(s)$.

Now, let's assume we already know $q_*(s, a)$, then the following deterministic policy is apparently an optimal policy. begin{align} p(a|s)=begin{cases}displaystyle 1, quad a=text{argmax}_{ain A(s)}q_*(s, a)\ 0, quad text{else} end{cases}label{o1} end{align} In practice, we can only focus on deterministic policies since there always exists at least one deterministic policy that is optimal. By using this deterministic optimal policy in Eq. (1), we can obtain the following important relationship: begin{align} v_{*}(s)=max_{ain A(s)}q_*(s, a) end{align} This is the famous Bellman optimality equation.