I've been told not to trim the dependent variable in a regression, but I don't know why. It makes sense that I shouldn't select my sample based on the outcome, but what assumption does this violate? Is their a theoretical reason why I shouldn't do this? Thanks!
Update: By "trim" I mean discard outliers. On the right hand side it is common (at lest in financial economics) to discard or trim observations at 0.5% to 1% in either tail. I've been told that doing the same on the left hand side is taboo. But I'm not sure exactly why.
I don't have a specific problem in mind, I just realized that I don't know the real why, other than you shouldn't pick you sample based on the outcome.
There is a simulated data set called
outliers in the TeachingDemos package for R. If you remove the "outliers" using a common rule of thumb, then relook at the data and remove the points that are now outliers and continue until you have no "outliers" you end up trowing away 75% of the data as "outliers". Are they really unusual if they are the majority of the data? The examples on the help page also show using this data for a regression model and how throwing away half of the data as outliers does not make much difference.
This is intended as an illustration against using automated rules for throwing away data.
Actually the discovery of penicillian was an outlier, consider what the world would be like if that data point had been discarded instead of investigated.
There are more acceptable routines such as M-estimation or other robust regression techniques that downweight unusual observations rather than throwing them out.