Solved – In sample splitting for time series data, do we randomly select data

I'm having a hard to conceptually understanding how to do this. I would like to do my own sample splitting (not the method built into a package).

Let's say you have 80 days of weather data. You want to use 3 prior days of data to predict the 4th day's weather. This means you in total have 77 total observations. Let's say you want to keep 20 for validation and 17 for test, leaving you with 40 for training. What do we generally do next?

Would we just randomly select 40 out of 77 and use it to train? And then randomly select 20 for validation (which will be used to tune our hyperparameters)?

Or do we usually use the first 40 observations to train, next 20 for validation, and final 17 for testing?

You don't randomly split in time-series datasets because it doesn't respect the temporal order and causes data-leakage, e.g. unintentionally inferring the trend of future samples.

One approach is as you suggested: first 40 for training, next 20 for validation and final 17 for testing. Another similar way to do is time-series cross-validation, e.g. fold 1: 40 train + 5 valid, fold 2: 45 train + 5 valid … all respecting the temporal order. And, while testing, you can still partition the final 17 and as you did in cross-validation, again respecting the temporal order.

Similar Posts:

Rate this post

Leave a Comment