Solved – Does collinearity of one-hot encoded features matter for SVM and LogReg?

Sometimes I encode categorical features as binary values – one feature per possible category value indicating whether that feature name matches the original category value (i.e. one-of-K scheme).

Now these values are linearly dependent, since obviously their total sum is 1.

Does this linear dependence matter for linear SVM, kernel SVM, logistic regression, etc.?
Where does it matter so that I need to remove one of the features? Does it cause problems for normal linear regression?
For which methods does it not make a difference?

Based on my understanding, collinearity will impact the estimation of the weights. It leads to multiple solutions. So if your goal is to see the weights of the features and calculate the significance, you probably have to remove one dummy value and use only K-1 values. In this case, the intercept is the weight of the dummy values that you removed. Or you can use K values without intercept.

But if your goal is to have a prediction model with high prediction performance (i.e., what you really care is the outcome), it does not matter which encoding schemes you use. If using K values, you can either disable the intercept term or add a regularization term to eliminate the impact of the collinearity.

Similar Posts:

Rate this post

Leave a Comment