Solved – Do dumthe variables need scaling in machine learning models

I have a data set with continuous variable and dummy variable (1/0). When using models such as neural, SVM, linear, etc, I was recommended to put the input variables into similar scales, such as mean=0, var=1. Do I also need to scale these dummy variables? My intuition is no, but I'm really unsure, so wish to hear from others as well.

Firstly, the difference in scale between variables influences the regularization of the model. This is easiest to think about when you have a Lasso or Ridge regression. If there is a variable that ranges between 0 and 1000 before scaling, and you scale it to a range between 0 and 1, then the penalty applied to the coefficients is much smaller when the range is bigger. So that influences the model, most likely in a bad way, because the regularization you apply to different variables is uneven. SVM has similar issues with unscaled variables, a completely unregularised neural network in theory wouldn't have those issues, and an unregularised linear model doesn't have those issues either. So in those latter cases, scaling wouldn't affect regularization. This is ignoring any precision problems with very big or very small numbers on the computer.

Another thing to keep in mind why you want to be scaling is that it might affect the fitting algorithms. Let's say in a neural net you have a very very small input variable. Then the corresponding coefficient should be large. If you want to grow your variables with gradient descent, the scale of the variables is going to make a difference in how fast these variables converge. In the case of a neural net, you are never really going to converge, so the scaling may make a large difference.

A last reason for scaling is that it can make the coeffients easier to interpret.

In some cases, an algorithm such as K-nearest neighbors is greatly affected by scaling, effectively weighing how much each input variable affects the distance. These methods don't have scaling built into the algorithm, so you have to do it yourself. In those cases, you could have quite a big effect of scaling the dummy variables.

These are the reasons for scaling. It's never really a problem to scale your variables, so one approach is to always do it. But most people wouldn't scale their dummy variables, I guess.

A dummy variable has mean p and variance p(1 -p), where p is the proportion of 1's. If you normalise it, and you have a 1000 zero's and 1 one, you would subtract -1/1000 and multiply by 1000 * 999 / 1000 ~= 1000, so afterwards the input variable looks like `a lot of almost zero's and one value of 1000. That won't improve things, but wouldn't affect much and the data is pretty ridiculous to begin with anyway. In most cases, where the variance isn't that small, the mean and variance are pretty close to 0 and 1 anyway.

So because the difference will be small in practice, regularisation and fitting will be only slightly affected. All in all, I think it's best to not scale the dummy variables because it shouldn't change much and it's an unnecessary operation.

Similar Posts:

Rate this post

Leave a Comment