I have a hard time grasping the need for policy optimization and say the log kernel trick/score function. Instead of using the score function, why do you not simply optimize for the highest reward and choose $$pi^*= max(text{all actions with discounted rewards})?$$

I am learning about reinforcement learning and have grasped the basics of value and policy iteration. I would appreciate if answers are intuitive (without math, if possible).

**Contents**hide

#### Best Answer

Instead of using the score function, why do you not simply optimize for the highest reward and choose Policy*= Max(All actions with discounted rewards)?

You do not have the information in order to take that maximum at the start of learning. In order to know the *expected return* or discounted sum of future rewards, you need to of measured it whilst using an already optimal policy.

Iterating towards this goal with a policy based on best estimates so far, refining those estimates given the current policy (by acting in that policy and sampling results), then refining the policy based on better estimate is essentially how action-value-based methods work, such as Monte Carlo Control, SARSA or Q Learning. These are all RL solvers, but are not always the most efficient for a given problem.

The *score* function helps to calculate a sampled measure of the gradient of the expected return of a parametric policy with respect to its parameters. Which means you can use it to perform stochastic gradient *ascent* directly on a policy, increasing its performance (on average) without necessarily needing to know the action values. The REINFORCE algorithm does not use action values at all. However, algorithms which do, such as Actor-Critic, can be better, and still maintain benefits compared to using a pure action-value approach.

Which is better? It depends on the problem. Sometimes it is more efficient to express a policy as a parametric function of the state. A common example of this is when there are many actions, or action space is continuous. Getting action-value estimates for a large number of actions, and then finding the maximising action over them, is computationally expensive. In those scenarios, it will be more efficient to use a policy gradient method and the score function is needed to estimate the gradient.

Another common scenario where a direct policy refinement can be better is when the ideal policy is stochastic. E.g. in scissor/paper/stone game. Expressing this as maximising over action values is not stable – the agent will pick one action, until that is exploited against it, then pick another etc. Whilst an agent using policy gradient and a softmax action choice could learn optimal ratios in an environment like scissor/paper/stone – two such agents competing should converge in theory to the Nash equilibrium of equiprobable actions.

Conversely, sometimes action-value methods will be the more efficient choice. There might be a simpler relationship between optimal action value and state, than between policy and state. A good example of this might be a maze solver (with reward -1 per time step). The mapping between action value and state is just related to the distance to the exit. The mapping between policy and state has no obvious relation to the state, except when expressed as taking the action that minimises that distance.

### Similar Posts:

- Solved – Q-Learning with mostly 0 reward
- Solved – Constructing a transition probability from Q-learning
- Solved – Does episodic reinforcement learning still need a discount factor
- Solved – Why are rewards scaled when using Reinforcement Learning (RL) algorithms in practice
- Solved – Implementing RNN policy gradient in pytorch