I'm trying to solve the same problem with different algorithms (Travel max possible distance with a car). While using value iteration and policy iteration I was able to get the best results possible but with Q-learning it doesn't seem to go well.

My algorithm looks like this

`alpha = 0.2 # learning rate gamma = 0.9 # discount factor Q = np.zeros(shape=(70, 7)) # tank size X steps, actions space size theta = 0.01 # track Q-function updates # while Q is not converged for i in count(): delta = 0 # max Q update for the episode s = 0 # starting state while True: # running an episode # exploration vs exploitation a = get_next_action(Q, s, eps=eps) # random or follow policy # perform action chosen, receive reward and new state r, s_ = get_action_reward(a, s) max_q = maximize_Q(s_) # maximum Q value we can get from state s_ q_update = alpha*(r + gamma*max_q - Q[s, a]) Q[s, a] += q_update s = s_ # do to next state delta = max(delta, np.abs(q_update)) if s is None: # while s is not terminal break # check Q for convergence if delta < theta: print(f'Q-function converged after {i} iterations') break # exploit more with each iteration eps = min_eps + (max_eps - min_eps)*np.exp(-decay_rate*i) `

My reward is `distance_travelled`

or `-step_number`

(if we did not move).

Available actions are

`ACTIONS = ( (30, 0), # buy 30L without driving (20, 0), (10, 0), (0, 0), # have some rest (-10, 10), (-20, 20), (-30, 30), # burn 30L of fuel to drive 30 km ) `

And the size of a tank is limited, so you cannot have more than *60L* in it.

What might be wrong with Q-algorithm itself or its hyperparameters, so it does not give me `150km`

result which is the best for `10`

timesteps

**Contents**hide

#### Best Answer

There are a few places where this can go wrong, but this really stands out:

`# check Q for convergence if delta < theta: print(f'Q-function converged after {i} iterations') break `

It looks like a hold-over from value iteration, where it would be valid. For Q learning this is not correct at all. Under Q learning, you are sampling random episodes, with an emphasis on behaving "close to current best guess at optimal" and the corrections you need to apply to the Q table for any one episode might be close to zero purely by chance. This is very different to value iteration which cycles through all possible states and can assess the corrections it makes using a known transition model.

There is no convenient end point for Q learning where it can tell, by internal measure only during training, that it has converged to the action values for an optimal policy.

What you can do is every few hundred episodes, assess the agent by running the greedy policy (i.e. set $epsilon = 0$ in your action choices). As your environment is deterministic and always starts in the same state, you can just take a single run for this test. If the result is optimal *and* the Q table predicts this accurately, then it has converged.

Often with Q learning on more complex real problems, you will not know if the agent has converged to a true optimum. The best you can do is plot performance over time and show:

It is no longer improving

The Q values it predicts are close to the ones seen. This can be a normal loss function such as mean square error, perhaps averaged over a few hundred episodes in case of stochastic environments.

It is as good or better than other agents in the same environment

While using value iteration and policy iteration I was able to get the best results possible but with Q-learning it doesn't seem to go well.

When you only have a small numbers of states and actions, plus have access to the transition model (which Q-learning does not use), then this is a valid result.

A good analogy might be learning the probability distribution of rolling a number of dice and summing them:

Value iteration is like a recursive algorithm which you have told how the results from a single die work, so it can effectively build a probability tree and sum up all the options.

Q-learning is like rolling the dice multiple times and making a histogram. This is always going to be an approximation to the true result, and take longer than probability-based calculations.

However, value iteration reaches limits of applicability in terms of complexity much sooner than Q learning. It is possible to sample from environments where the underlying model is very complex, or unknown. It is also possible to sample from environments where the full state space is too large to loop through completely in reasonable time.

### Similar Posts:

- Solved – How to combine stochastic policy with Q-value Iteration
- Solved – How to correctly compute $rho$ in reinforcement learning with importance sampling
- Solved – Friend or Foe Q Learning Algorithm Q-Value Update
- Solved – Updating q-values in q-learning
- Solved – Reinforcement learning for continuous states, discrete actions. Algorithms