In much of machine learning literature, the systems being modelled are instantaneous. `Inputs -> outputs`

, with no notion of impact from past values.

In some systems, inputs from previous time-steps are relevant, e.g. because the system has internal states/storage. For example, in a hydrological model, you have inputs (rain, sun, wind), and outputs (streamflow), but you also have surface- and soil-storage at various depths. In a physically-based model, you might model those states as discrete buckets, with inflow, out-flow, evaporation, leakage, etc. all according to physical laws.

If you want to model streamflow in a purely empirical sense, e.g. with a neural network, you *could* just create an instantaneous model, and you'd get OK first-approximation results (and actually in land surface modelling, you could easily do better than a physically based model…). But you would be missing a lot of relevant information – stream flow in inherently lagged relative to rainfall, for instance.

One way to get around this would be to include lagged variants of input features. e.g. if your data is hourly, then include `rain over the last 2 days`

, `rain over the last month`

. These inputs do improve model results in my experience, but it's basically a matter of experience and trial-and-error as to how you chose the appropriate lags. There are a huge array of possible lagged variables to include (straight lagged data, lagged averages, exponential moving windows, etc.; multiple variables, with interactions, and often with high covariances). I guess theoretically a grid-search for the best model is possible, but this would be prohibitively expensive.

I'm wondering **a)** if there is a reasonable, cheapish, and relatively objective way to select the best lags to include from the almost infinite choices, or **b)** if there is a better way of representing storage pools in a purely empirical machine-learning model.

**Contents**hide

#### Best Answer

If we want to to look at lags over a long time in the past (or features derived from them like exponential moving averages or interactions between them) then there would be a large number of feature candidates. As you correctly mentioned a grid search would be expensive, even if you want to train a simple linear regression model. One approach that can be very helpful in this case would be to use a fast sub optimal feature selection method. For example you can use Greedy backward subset selection, Greedy backward/forward, or Lasso feature selection. This can be much faster and you can potentially look at much larger number of features. Based on my personal experience and also according to this paper if features are high correlated greedy backward/forward is outperforming Lasso: http://papers.nips.cc/paper/3586-adaptive-forward-backward-greedy-algorithm-for-sparse-learning-with-linear-models.pdf

Another intuition that can be helpful in many time series is that as we look more into the past the exact time becomes less important. For example in your example, the impact of rain fall on the flow of water, the rain fall on today and yesterday will probably have different coefficients in your model. But the rain fall on 365 days ago and on 366 days ago will probably have the same impact on the flow today. This facilitates application of transforms / feature engineering techniques that aggregates the data based on time. For example you can have a grid of exponential moving averages (AR(1) systems or IIR(1) filters in signal processing terms) as new features to model long term memory, followed by a linear combination of lagged data (FIR filters) to model short term memory. Note that you don't have to include all the generated features in your model and it is a good idea to perform feature selection to select a few of AR(1) systems and lags. I used a scheme similar to what I described above to extract features from multiple time series as shown in the following diagram.

Another techniques that is commonly used is to perform an unsupervised feature extraction method on the time series data. For example you can use PCA and keep only the most dominant principal components, or you can use discrete (Cosine) Fourier transform and only keep the strongest components. There are other transforms like Haar wavelet that can be useful in certain domains to extract features from time series.