Solved – Scikit-learn: How to normalize Huber regressors

In scikit-learn the Ridge regression estimator has a normalize parameter that normalizes the regressors. I found that it was necessary to set this to True to get a reasonable fit to my data when using higher degree polynomial features (it provided consistent regularization no matter how many samples I trained/predicted on)

I would like to use a more robust estimator such as Huber regression, however it does not have this normalize parameter so the fit is quite poor.

Sklearn has a preprocessing.Normalizer() transformer which I tried adding into my pipeline, but it didn't help. When I instead created a preprocessing.FunctionTransformer() that calls preprocessing.normalize(), I found that if I set axis=0 (i.e. normalizing over the features rather than samples) I got a good fit much like when I had set normalize=True for the Ridge estimator.

However, this only worked when I predicted on a sample of similar size to my training set. Depending on the number of inputs, the predicted values would change (this behavior does not occur with Ridge's normalize=True)

I've been reading through the Ridge estimator's source code trying to find exactly how it implements its normalize parameter, but it seems like a very convoluted solution.

Is there a relatively straightforward way that will properly normalize the regressors of a Huber estimator in the same way that the normalize parameter does for the Ridge estimator?

IIUC, it seems like you've confused two different forms of normalization.

sklearn.preprocessing.Normalizer normalizes vectors to unit norm. Note how it is naturally used to scale rows (instances), and not columns (features). Unit normalization is dependent on the vector length, in general. Concatenate a vector to itself, and you will need to reduce the elements further in order to retain unit length. For rows (instances), the length is constant.

sklearn.preprocessing.StandardScaler, conversely, removes the mean, and scales to unit variance. This is naturally used to scale columns (features). It is basically independent of the vector size.

In your case, it seems like you should use StandardScaler together with something like sklearn.linear_model.SGDRegressor with (Huber loss) in a pipeline. You will need somehow to tune the l1 and and l2 parameters, preferably using some form of cross validation.

Similar Posts:

Rate this post

Leave a Comment