I was reading Andrew Ng's lecture notes on reinforcement learning and on page 3 he defines the value function:

$$V^{pi}(s) = E[R(s_0) + gamma R(s_1)+gamma^2 R(s_2) + … |s_0 = s, pi]$$

Which means the expected total pay off, given that we start on state s and execute policy $pi$. However, in his footnotes he says that this notation is a little "sloppy" because he says that $pi$ isn't technically a random variable (even though the notation implies that both $pi$ and s are random variables, though it makes sense that s is a r.v. since we don't always know for sure what state we will end up after doing some action a).

My question is, **why isn't $pi$ a random variable?** If its not a random variable is it in a reinforcement learning that we are just looking for some policy that is the "best" and was chosen by "nature" somehow? Does it mean that we are not allowed to have some prior belief on which $pi$ might be true? Is reinforcement learning or at least, the value function here restrained to a frequentist point of view?

Would a better notation for that equation be:

$$V^{pi}(s) = E[R(s_0) + gamma R(s_1)+gamma^2 R(s_2) + … |s_0 = s; pi]$$

These were my thoughts so far:

$pi$ is the policy function, its a function that maps states deterministically to actions $pi(s) = a$. However, I didn't really see why reinforcement learning had to be restricted to a frequentist interpretation. I seemed reasonable to me that $pi$ could be a r.v. and we instead try to execute the expected policy over all policies or something along those lines (I am not trying to make this idea too precise, but hopefully the idea/concept makes sense). Is it just that Andrew Ng is introducing the concepts of reinforcement learning first a frequentist, as it might be the easiest to understand?

**Contents**hide

#### Best Answer

This has nothing to do with frequentism. When the policy $pi$ creates a distribution over actions, it is called a *stochastic policy*.

Originally, policies were not stochastic since they were defined as mapping to the highest-value action. The actually policy that is followed, in say an $epsilon$-greedy approach is to disobey that deterministic policy and act randomly with probability $epsilon$.

### Similar Posts:

- Solved – Reinforcement Learning – difference between a Policy and a State transition matrix
- Solved – Overview over Reinforcement Learning Algorithms
- Solved – Resources to get started with deep reinforcement learning
- Solved – specific example of reinforcement learning using linear function approximation
- Solved – Constructing a transition probability from Q-learning