I am a software developer. I do not have a formal training in time series. I have started reading Chatfield and Brockwell. I have enough wisdom to reach out to professional statisticians in your field for insightful commentary so I can avoid doing something wrong.
How can I apply leave one out and k-fold cross validation on my time series?
Technically, I have 10 independent time series that is comprised of 10 participants. For each series, we have participant id, timestamp (data taken in one second interval), heart rate, GIS location, GIS zone(The zone is a GIS polygon of special interest for fatigue), and a binary variable indicating if the user is fatigued or not. My goal is to do cross validation so I can build a model to detect the fatigue.
My data is something like as follows:
- participant id, timestamp, heartrate, lat, long, zone, fatigue
- 1, 10:30, 130, 70, 38, 39, 1, 0
- 1, 10:30, 130, 72, 38, 39, 1, 0
- 10, 10:30, 138, 72, 38,39, 1, 0
where I can tell which time series I am in based on the participant.
Let me divide my time series by participant id. I have [1,2,3,4,5,6,8,9,10]. Where 1 here represents all the data I have for participant 1. Thus, my time series 1. We can consider each series independent from each other. So I can do something like:
Leave one out
- 1 Train: [2,3,4,5,6,7,8,9,10] Test: 1
- 2 Train: [1,3,4,5,6,7,8,9] Test: 2
- 3 Train: [1,2,3,4,5,6,7,9,10] Test: 
- 4 Train: [1,2,3,4,5,6,7,8,9,10] Test: 
- 5 Train: [1,2,3,4,6,7,8,9,10] Test: 
- 6 Train: [1,2,3,4,5,7,8,9,10] Test: 
- 7 Train: [1,2,3,4,5,6,8,9,10] Test: 
- 8 Train: [1,2,3,4,5,6,7,9,10] Test: 
- 9 Train: [1,2,3,4,5,6,7,8,10] Test: 
- 10 Train: [1,2,3,4,5,6,7,8,9] Test: 
2 – fold validation
I really have confused myself. I was thinking about this approach, but I was told by a colleague that I had it all wrong because I was doing a "within time series approach" and I needed to do a "across time series" approach.
I also checked out this which I think is again for the "within" time series approach because you are taking 1 time series and dividing it in m parts. I have 10 independent time series that supposedly observe the same/similar effect and are independent from each other. I am trying to detect
The biggest concern with this kind of thing would be having a data point from participant
a at time
x in the test set, and another point from participant
a at time
y in the training set, where
y > x. In other words, predicting the past based on the future. In this case, even just having participant
a's data split between the train and test set could be problematic, since the model might overfit to some peculiarities of that participant.
Your leave-one-out scheme avoids both these problems though.
I'm assuming that your goal is to predict the presence of fatigue at a particular instant in time, without any context about what came before, in which case what you've described is great.
If the problem you're trying to solve involves seeing a sequence of observations and identifying the onset of fatigue, then the "forward chaining" procedure described in this answer is appropriate. You would still want to do the leave-one-out thing you described. But when evaluating performance on the last participant, you would feed your predictor each data point in order, and record its ability to predict the next one, given what it's seen so far.
- Solved – Splitting Time Series Data into Train/Test/Validation Sets
- Solved – Cross validation and train test split
- Solved – How to determine if the mean of 1 time series is significantly greater than that of a group of other time series
- Solved – In sample splitting for time series data, do we randomly select data
- Solved – How to calculate the probability of success of a Logistic Regression model with a single continuous predictor