I read that normalization is not required when using gradient tree boosting (see e.g. https://stackoverflow.com/q/43359169/1551810 and https://github.com/dmlc/xgboost/issues/357).
And I think I understand that in principle there is no need for normalization when boosting regression trees.
Nevertheless, using xgboost for regression trees, I see that scaling the target can have a significant impact on the (in-sample) error of the prediction result. What is the reason for this?
Example for the Boston Housing dataset:
import numpy as np import pandas as pd import xgboost as xgb from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston boston = load_boston() y = boston['target'] X = boston['data'] scales = pd.Index(np.logspace(-6, 6), name='scale') data = {'reg:linear': [], 'reg:gamma': []} for objective in ['reg:linear', 'reg:gamma']: for scale in scales: xgb_model = xgb.XGBRegressor(objective=objective).fit(X, y / scale) y_predicted = xgb_model.predict(X) * scale data[objective].append(mean_squared_error(y, y_predicted)) pd.DataFrame(data, index=scales).plot(loglog=True, grid=True).set(ylabel='MSE')
Best Answer
A big part of the answer seems to be found in https://github.com/dmlc/xgboost/issues/799#issuecomment-181768076.
By default, base_score
is set to 0.5 and this seems a bad choice for regression problems. When the average of the target is much higher or lower than base_score
, the first x trees are just trying to catch the average, and less trees are left to solve the real task.
The solution thus seems simple: adjust base_score
to the mean of the target to avoid impact from its scale on the regression result.
Especially for objective 'reg:gamma'
this indeed seems to be the clue, whereas for 'reg:linear'
it provides only a partial improvement:
data = {'reg:linear': [], 'reg:gamma': [], 'reg:linear - base_score': [], 'reg:gamma - base_score': []} for objective in ['reg:linear', 'reg:gamma']: for scale in scales: xgb_model = xgb.XGBRegressor(objective=objective).fit(X, y / scale) y_predicted = xgb_model.predict(X) * scale data[objective].append(mean_squared_error(y, y_predicted)) for objective in ['reg:linear', 'reg:gamma']: for scale in scales: base_score = (y / scale).mean() xgb_model = xgb.XGBRegressor(objective=objective, base_score=base_score).fit(X, y / scale) y_predicted = xgb_model.predict(X) * scale data[objective + ' - base_score'].append(mean_squared_error(y, y_predicted)) styles = ['g-', 'r-', 'g--', 'r--'] pd.DataFrame(data, index=scales).plot(loglog=True, grid=True, style=styles).set(ylabel='MSE')
So the remaining question reduces to: Why is there still sometimes an impact of scaling the target with objective 'reg:linear', even after adjusting base_score to the mean of the (scaled) target?
Similar Posts:
- Solved – getting a cross_val_score of 0
- Solved – Difference in regression coefficients of sklearn’s LinearRegression and XGBRegressor
- Solved – Creating a dataframe includes the cross validation scores
- Solved – XGBoost tree “Value” output:
- Solved – Polynomial regression seems to give different coefficients depending on Python or R