Intuitively I feel like it should be similar to that of a Hidden Markov Model, but RL training data isn't labeled.

What is the shape/architecture of the RL input data?

**Contents**hide

#### Best Answer

Reinforcement learning is a collection of different approaches/solutions to problems framed as Markov Decision Processes.

As such, although terminology might vary, there are a few named components that need to exist, some of which are data that drives the learning process. Using the terminology from Sutton & Barto's Reinforcement Learning: An Introduction, the basic premise could be explained like this:

A Reinforcement Learning problem consists of an Agent with that exists within an Environment where on each Time Step it can observe a current State, take Actions, and as consequence of those actions, receives a Reward signal and can observe an updated State. The Agent's learning objective is to create a Policy (a mapping from State to Action) that maximises the expected Reward received over time.

The Agent and Environment are not data as such, although they might be modelled in a system that uses some other data source, this is not part of the RL problem.

The Policy results from the RL model, so it is not input data. There are many ways to attempt to learn optimal policies in RL. Well-known algorithms such as SARSA and Q-learning do so by creating a model that estimates *action-values*, or the expected long-term Reward from taking one Action from a given State and following the same Policy from then on. This estimate, if the algorithm has worked correctly, allows the agent to simply take a local decision based on current state, using the predicted total reward (sometimes called the Return, or the Utility) as a guide.

Getting back to the data that RL processes, you have States, Actions and Reward:

The State $S_t$ should summarise all observations (and maybe history of observations, if relevant) that influence the outcome of the next action at time step $t$. Getting the representation correct is critical to RL working correctly – the state must have the Markov property, or closely approximate it. If you have a small number states, then you can just enumerate them. Otherwise, typically you make a feature vector that represents the state.

The Action choice $A_t$ represents what the agent does. If there are discrete set of choices you can enumerate them, or perhaps turn into a one-hot-encoded vector. Otherwise, if the actions are setting continuous values, then they can also be a feature vector like the State.

The immediate Reward $R_t$ following each State/Action is simply a real value (i.e. $R_t in mathbb{R}$).

Convention has it that an action taken at time step $t$, results in reward and next state at $t+1$, so the sequence of data is often written $S_t, A_t, R_{t+1}, S_{t+1}…$ or looking at a single transition with the next action included $S, A, R, S', A'…$ (this is how SARSA got its name)

In terms of what data an RL algorithm needs to learn, in general you need to store *trajectories*, so an ordered list of $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, S_3$ etc, with $S$ and $A$ values either as enumeration or vector values and rewards as scalar values. Clearly you can store this in a table, per time step, if you want, or in separate arrays of state, action and reward indexed by time step.

Different learning algorithms will want different access to this list, depending on how they do their calculations. For instance, Monte Carlo control algorithms process entire "episodes" at once where there must be a final outcome. Basic SARSA(0) processes $S, A, R, S', A'$ subsets and basic single-step Q-learning processes $S, A, R, S'$ only (it selects the best choice for $A'$ when it makes its learning updates). More complex versions of these algorithms SARSA($lambda$) and Q($lambda$) need longer trajectories, although they have *online* versions that incorporate simple memory vectors so you typically don't need to actually store those longer trajectories.

However, pretty much all RL algorithms work by taking some kind of "slice" out of the longer series of states, actions and rewards. Any data format that allows that access in a general sense – e.g. picking out arbitrary $S_t, A_t, R_{t+1}, S_{t+1}…S_{t+n}, A_{t+n}$ – will be compatible with almost any RL algorithm. In some cases it would be overkill to support this though, as many good RL algorithms are online and only need to access data from previous and current time steps (whilst updating predicted Return on the previous state and action).

Choosing between enumerating states/actions or representing by descriptive vectors has other consequences. The tabular versions of the algorithms work with enumerated states and actions, and are generally more robust – you can prove they will learn optimal behaviour. However, they have limited application because in most interesting problems the possible number of states is too high to use enumeration.

If you use vector descriptions of actions and states, then you must also use some form of function approximator (e.g. a neural net), similar to supervised learning, in order to estimate Return/Utility from a given $(S,A)$ pair. However, you don't need to store any supervised *label* to train this approximator, instead the RL algorithm will generate a target when you feed it the $S,A,R…$ data associated with the pair being learned. With these more complex RL systems, it is often a good idea to store all the samples and train the approximator on batches sampled randomly from this history – this is called experience replay.

### Similar Posts:

- Solved – Updating q-values in q-learning
- Solved – Forms of the Reward function in Reinforcement Learning: A vector, a matrix, a linear combination
- Solved – Reward function in Reinforcement Learning
- Solved – Applying linear function approximation to reinforcement learning
- Solved – Applying linear function approximation to reinforcement learning