Solved – Heteroscedasticity in machine learning predictions

I am using a machine learning method (PLS) to predict a continuous variable, which currently does a pretty good job, with reasonable RMSE etc.

However, the residuals exhibit heteroscedasticity, where the error variance increases along with the dependent variable. I come from an econometrics background where it is emphasised that heteroscedasticity in residuals implies that the predictions obtained are non-optimal (intuitively there is some information that is being that could be incorporated into a model to improve the prediction). In econometrics, if we want to improve the performance of models we use weighted least squares, or other related methods to take into account this facet of the data.

How should I approach this in machine learning? In particular in PLS (in the package Caret), if you have any knowledge of this?



The approach I developed (probably not the first to do so as it is a fairly obvious idea!) was to jointly model the conditional variance of the target distribution as well as the conditional mean (in this case using kernel learning methods), the parameters obtained by minimizing a penalized likelihood criterion. The benefit of this is that the residuals in high variance regions of the attribute space are appropriately down-weighted. The paper is here: (pre-print here:

If you don't want to write bespoke code, something more basic could be achieved using standard (weighted) least-squares regression tools. First fit a model to the data. Next fit a second model to the squared residuals of the first. This gives an estimate of the conditional variance of the target distribution. Then re-fit the first model, weighting the samples according to their estimated conditional variance. Perhaps repeat this process a couple of times for convergence. A similar approach was used by Nix and Weigend (see references in my paper).

A better approach fits the model of the conditional variance to the leave-one-out cross-validation estimate of the conditional mean (which can be performed very cheaply for many linear-in-the-parameters models) as otherwise the estimated conditional variance will be somewhat biased (smaller than it should be).


Similar Posts:

Rate this post

Leave a Comment