I cannot be specific about the nature of the data as it is proprietary, but suppose we have data like this: Each month, some people sign up for a service. Then, in each subsequent month, those people may upgrade the service, discontinue the service or be denied the service (e.g. for failure to pay). For the earliest cohort in our data, we have about 2 years of data (24 months).
The number of people joining each month is large (in the 100,000 range) and the number doing any of the three things is in the thousands. However, we are not using the individual level data (which would be millions of rows) but data aggregated by month and cohort (what proportion of each cohort do each thing each month).
We've been modelling existing data using multivariate adaptive regression splines (MARS) and finding some interesting results. However, I am worried about using these to extrapolate or predict into the future. My concerns are because predictions into the future are necessarily outside the sample space (in terms of time) and splines can become unstable for extrapolation.
Is this a legitimate method? What concerns are there and can they be addressed?
Best Answer
From my interpretation of the question, the underlying question you are asking is whether or not you can model time as a spline.
The first question I will attempt to answer is whether or not you can use splines to extrapolate your data. The short answer is it depends, but the majority of the time, splines are not that great for extrapolation. Splines are essentially a interpolation method, they partition the space your data lies on, and at each partition they fit a simple regressor. So lets look at the method of MARS. The MARS method is defined as $$hat f(x) = sum_{i=1}^{n} alpha_iB_i(x_{[i]})$$ where $alpha_i$ is the constant at the i'th term in the MARS model, $B_i$ is the bases function at the i'th term, and $x_{[i]}$ represents the feature selected from your feature vector at the i'th term. The basis function can either be a constant or can be a hinge function (rectifier). The hinge function is simply $$max(0,x_{[i]} + c_i)$$ What the hinge function forces the model is to create a piecewise linear function (it is interesting to note that a neural network with a rectified linear activation function can be seen as the superset model of the MARS model).
So to get back to the question of why splines are usually not that great for extrapolation is to realize that once the point you need extrapolated starts to lie pasts the boundaries of the interpolation only either a very small part of your model will be"activated" or a very large part of it will be "activated", and therefore the power of the model disappears (because of the lack of variation). To get a little bit more intuition about this let us pretend that we are trying to fit a MARS model to a feature space lying in $mathbb{R}$. So given one number we try to predict another. The MARS model comes up with a function that looks something like this: $$hat f(x)= 5 + max(0,x – 5) + 2max(0,x-10)$$
If the extrapolation occurs past the number $10$ the function now becomes $$hat f(x) = 10 +2(x-10)= 2x-10$$ The MARS model we had before now boils down to a single linear function and therefore the power of the MARS model disappears (this is the case of the majority of terms "activating"). The same thing will happen for extrapolation before the number $5$. The output of the MARS model will then simply be a constant. This is why the majority of the time, splines are not suited for extrapolation. This also explains the problem you mentioned in the comments of your posts, about extrapolated predictions being "very off for new values" and that they tend to "continue in the same direction" for different time series.
Now lets get back to time series. Time series are a pretty special case in machine learning. They tend to have a bit of structure, whether it be partial in-variance or one of the many different types of substructures, and this structure can be exploited. But special algorithms are needed that are able to exploit this structure, unfortunately splines do not do this.
There are a couple things I would recommend you try out. The first one would be reccurent networks. If your time series is not that long (and does not have long term dependencies) you should be able to get away with using a simple vanilla recurrent network. If you wanted to be able to understand what is happening, you could use a rectified linear unit with biases as an activation function and that will be equivalent to doing MARS modelling on the subset of the timeseries and the "memory" that the recurrent neural net holds. It'd be hard to interpret how the memory is managed by the net, but you should gain some idea how the subspace is being handled with respect to the piecewise linear function generated. Also if you have static features that do not belong to the time series it is relatively easy to still use them in the net.
If the time series you have is very long and might have long term dependencies, I recommend using one of the gated recurrent networks, something like GRU or LSTM.
On the more classical side of time series classification you could use hidden markov models. I won't go further into these, because I am not as familiar with them.
In conclusion, I would not recommend using splines for two reasons. One, it is not able to handle complicated extrapolation problems, which seems to be the problem that you are describing. And two, splines do not exploit the substructures of time series which can be very powerful in time series classification.
Hope this helps.