I have a huge dataset for a binary classification problem (about 1.5 million rows), and the feature space is quite large (145 dimension).
Some of these features are factors (YES, NO), but there is missing data. So my question is:
1 – Should i drop the missing data and lose information (the response matrix is quite sparse)
2 – Model these factors as 3 levels? ie, instead of (YES, NO) they become (YES, NO, X) where X is for missing data.
The problem with option 2 is that there is a lot of rows with missing information, so i would lose a lot of samples.
EDIT: Actually, the missing values happen when an specific condition happens, so they are informative as they are not 'missing' in the normal way. Does this reinforce the approach 2 above?
Thanks for any insight!
Option 1 is an alternative that must be considered, but there are other approaches, and combinations of approaches. Each column which possesses missing values must be treated individually.
The decision of how do deal with each column will depend on many factors: the meaning of the column, proportion of missing values, nature of missing values (if it's a categorical variable, a missing value can be even very informative to predict the response variable), etc. There is no "default" treatment. We need specific information to give specific advise.
You should deal with it as systematically as possible:
- List all columns which have missing values.
- Determine the proportion of missing values in each column.
- Choose standard candidate approaches for each column (list-wise deletion, mean imputation, regression imputation, etc.).
- Evaluate the best approaches (you could for example train your classifier with two different approaches and evaluate them in a validation set).
There are lots of advanced approaches. Everything said above apply to them anyway. Googling "missing data" will give you many more insights.
[Edit: comment about "option 2" removed, because the original question was modified and the comment is not applicable anymore]
- Solved – handling many missing values within a regression
- Solved – How will you deal with “don’t know” and “missing data” in survey data
- Solved – When is it acceptable to drop an entire feature from a data set due to missing obvservations
- Solved – filling out NA values using clustering analysis
- Solved – How many missing values in a dataset can Xgboost handle