Solved – How to optimize a regression by removing 10% “worst” data points

I would like to remove 10% of my data points (I consider them as outliers) to maximize the R squared. Is there a way to do so efficiently?

I know many people suggest not to remove outliers. But in this case, I just would like the regression model to represent 90% of the population.

You could try LTS regression: Least trimmed squares, This is not first rejecting outliers, then fitting a regression, but effectively doing both at once, where outlier is defined as the points least fitting for the regression models. There is an implementation in the package mass (on CRAN) for R. There is no closed form solution for the estimators, a method resembling a genetic algorithm is used to give a close to optimal solution.

I think this is the closest fit for what you are asking for.

Note that robustness can mean many different things! and that my answer emphasizes robustness with respect to the $X$-space, that is, LTS can be useful in cases with low-quality data, where the data can contain some observations that really do not belong there, that do not correspond to the linear model you want to fit. Harrell's answer is the case with robustness with respect to $Y$-space, a different case. From your post we cannot really decide which fits your case, you must decide!

Similar Posts:

Rate this post

Leave a Comment