Most examples I have seen about Q learning, are performed in a deterministic world. For example, in the traditional grid world, the agent can finally do the path searching by exploring and exploiting the environment with a reward function without knowing the transition probability function.

$$Q(s,a) = Q(s,a) + a*[ Reward + discount * Max Q(s',a') – Q(s,a)] $$

Now suppose the grid is a stochastic environment, an agent can move up/left/right with 1/3 probability. How can I program the Q learning, does that mean that in calculating the $Max Q(s',a')$,

$$Max Q(s',a') = Max [ P(up)*Q(s',up) , P(left) *Q(s',down) , P(right) * Q(s, right)]?$$

**Contents**hide

#### Best Answer

Q-learning also permits an agent to choose an action stochastically (according to some distribution). In this case, the reward is the expected reward given that distribution of actions. I think this fits your case above.

Q-learning also permits actions that may fail. Hence, $Q(s, Left)$ might lead you to a state $s'$ that is not the to the left $s$ (e.g. the action "fails" with some probability). In that case, the model (MDP, table of Q-values, automaton) will encode the possibility of failure directly and no distributions or expected values are needed.

### Similar Posts:

- Solved – Q learning in a stochastic environment
- Solved – Q learning in a stochastic environment
- Solved – How to deal with the Reinforcement learning problem with a fixed length episode but no terminal state
- Solved – Constructing a transition probability from Q-learning
- Solved – Forms of the Reward function in Reinforcement Learning: A vector, a matrix, a linear combination