I'm fitting some machine learning algorithms (e.g. SVM) on my panel data. It's taking too long for my entire dataset, so I'm considering generating smaller samples from bootstrapping then fit the SVM in parallel.
At first glance, it may seem like by bootstrapping I will destroy the time-dimension of my data. However, if I include variable lags and moving averages, can I just bootstrap as normal (i.e. considering each observation as independent)?
Best Answer
Generally, if you have a complicated data structure bootstrap sampling should reflect how real sampling was done. For example, with clustered or hierarchical data you rather sample whole clusters, than individual observations, because sampling individuals gives you a biased sample (e.g. Rena et al. 2010, Field and Welsh, 2007). For time-series two approaches were suggested in literature:
- Model-based resampling – you fit a model to the data to construct residuals from the fitted data and then generate new data using those residuals, so that you can sample "approximate disturbances" that can be found in your data.
- Block sampling – you sample blocks of data from time series (e.g. $x_{t-1}, x_t, x_{t+1}$) and then sample those blocks, so that local time structure is preserved.
You can find nice examples of those sampling strategies in a classic book by Efron, or in great handbook by Davison and Hinkley.
However, as also Andrew M mentioned, bootstrap applies to sampling with replacement $N$ observations out of $N$ observations, so this won't help with making the dataset smaller – it will make things worst, because you would have a sample of $N times R$ draws.