To build a SVM-based classifier, I have a training data set consisting of N data points. Some of them are redundant. For instance, there have 50 data points which are exactly the same, and there have other 100 data points which are exactly the same. I have two choices, remove the redundant ones and construct the reduced data set; keep the original data set. Will the resulting classifier be different after applying these two different choices?
If you are using hard margins, there is no difference because the best margin is the same either way.
If you are using soft margins, then duplicating a data point can matter since the penalty is a sum over data points within the margin, and duplicating these data points affects the size of the penalty.
Here are $1$-dimensional pictures showing what might be the best soft-margin classifiers without and with duplication.
- Solved – Why do we say that large margin improves generalization in SVM
- Solved – svm functional margin and geometric margin
- Solved – Why One class SVM seperate from the origin
- Solved – In Stata, why do the stcox CI differ when using margins
- Solved – What’s the difference between $ell_1$-SVM, $ell_2$-SVM and LS-SVM loss functions