I have a problem which I'd like to apply machine learning (supervised classification) to, however, the data is sequential and each row in the data vector has its own length. This implies that the number of features in each row is non-constant (think time-series data – for example – day-by-day data). So this means that the order of the data has meaning and we simply cannot e.g. pad with zeros to make all rows have equal length, as that would introduce spurious signals which would confuse my classifier. At least that's my current opinion.
One possible approach is to use e.g. window functions and simply compute (for each day) running sums of things. But that means I'm losing information on history, since each day would have to be represented as its own row in the matrix in order to make all rows have a fixed number of columns, so I could apply classical ML algorithms. I want to avoid this, as I believe it's a suboptimal approach – but I will listen to any arguments against my opinion.
I don't have a lot of experience with neural networks, but I believe there are architectures which support non-fixed-length sequence data, e.g., RNNs? Does anyone have any good links/resources I may consider?
I welcome thoughts and suggestions from practitioners on how to approach this modeling problem. Thank you!
Regards,
M
Best Answer
It seems you are asking two questions here:
- How to deal with the situation where different samples have different numbers of features, i.e. when some features are either not applicable to some samples or are not available
- How to perform supervised classification on time-series data
With regards to question 1, it depends. Each sample does need to have the same number of features. Some models, i.e. decision-tree based ones, can explicitly deal with missing/NA data. Others, like logistic regression, need ordinal features and cannot deal with categorical features. In this case, it may be worth introducing additional binary features (representing whether feature X is present/applicable), and choosing some appropriate value for feature X in case it is missing / not applicable. A good choice would depend on the specific problem.
Question 2: you have a choice of manually engineering features, or trying a model that can attempt to deal with the temporal structure of your data automatically. Most models assume that each sample is independent of the others; ideally, you would apply some feature engineering to make your time series stationary and use your domain knowledge to decide what historical data is important for each sample and how it should be represented. Z-scores, moving averages, variances etc. could all be useful here. If you have a lot of data, you may attempt to use RNNs, but in my experience it is only worth it if you have a lot of data and you otherwise have no intuition about which features may be useful.
Regardless of which model you choose to use, setting up appropriate validation and testing frameworks is absolutely crucial. With time series you need to be extra careful. E.g. you need to decide if using data from the future to train your model is appropriate, whether you need to throw some data around your training set away etc. Do not just blindly randomly sample data into validation/test sets, this will likely give you wildly biased estimates that will not be useful.
I would also recommend researching each question independently, both have been addressed on this stackexchange before. Good luck!
Similar Posts:
- Solved – Which regression tree to use for large data
- Solved – Reducing the number of variables by PCA vs. by clustering the features
- Solved – Reducing the number of variables by PCA vs. by clustering the features
- Solved – Binary classification when many binary features are missing
- Solved – Find close pairs in very high dimensional space with sparse vectors