Solved – Regression random forest and highly skewed response distribution

There is a great deal of information on how unbalanced data sets may impact predictive accuracy in classification problems. Several solutions have been proposed (see here). My questions are:

  1. Can a highly skewed target distribution (i.e. when the response variable is continuous and not categorical) create similar problems in a regression random forest? The response I am trying to predict is expressed as a percentage and 96% of the observations take on the value 0.

  2. I am using 5-fold cross-validation to estimate RMSE and $R^2$. Is any of these metrics influenced by the response distribution?

  3. If the skewed distribution is a problem, how should I deal with it?

One might argue that this is a classification problem with a small rounding error rather than a regression design. RF is often referred to as resilient in dealing with skewness issues but it is not invincible. In this case, chances are almost none of the positive responses would make it into each small tree being grown, or into the OOB subset against which they are tested.

Inability to correctly predict your responses of interest is likely to be reflected in the overall r2, however, it would not be the most useful descriptor (most easily visualized in a simple linear regression equivalent of an r2 for the relationship between a cloud of zero points and a few outliers). Solutions listed there may still apply to alleviate the problem; however, I would reconsider the design as 1) an unbalanced classification problem and 2) if the subset of positive responses is sufficiently large, treating just the positive responses in the regression mode you are interested in.

Similar Posts:

Rate this post

Leave a Comment