Say I am working on a binary classification problem and that I have a feature matrix $X$ where some entries are missing (
NaN). The rest of the entries in $X$ are real numbers.
How can I apply SVMs on this data?
This largely depends on the nature of your data. In an ideal case, a domain expert could specify the missing value. When no prior knowledge about your data exists, the following procedures are commonly done:
Replace the missing value with the mean (for continuous values) or the median (for nominal values) of that feature.
Take the instance with missing value(s) as a query and search for the $K$ closest instances to it in the data using all features with known values. The missing feature(s) is (are) then set to a value based on the instance's nearest neighbors. This is a generalized version of the procedure above, where $K$ is set to the largest possible value and the aggregation is mean or median. In this procedure, one needs to specify a proper distance function on the feature space. Euclidean distance is often used.
- Solved – Binary classification when many binary features are missing
- Solved – Mahalanobis distance in a LDA classifier
- Solved – probablistic output for binary SVM classification
- Solved – Is it better to replace missing values by mean or mean by class
- Solved – Distances for binary and non binary categorical data