Solved – Top-coding and regression

I intend to conduct a multiple linear regression using large US Health Survey data. However, having looked at one of the variables relating to income, it seems that it has been top-coded at the 95th percentile.

As such I was wondering what the impact on the regression would be if I was to delete those observations which have been topcoded, and indeed whether this would be the best thing to do?

My understanding is that the top-coded data (which comprises 1.3% of the total data) would produce a bias in the data and therefore in the regression model. Therefore, as I mentioned, it should be removed?

Your help is much appreciated!

This top coding (called capping below) can be thought of as general strategy of 'treating outliers'. Imagine that an extremely rich person (example: Bill Gates) was part of the data. Then the income value for that person will be very high. But we sort of know that the person is not representative of general population.

There are two ways to handle it.

Capping/Top-coding Few Affected Variables

Capping is introduced so that model does not learn to correlate extremely high incomes with outcome variable. But at the same time, there could be other variables (say height) on which Bill Gates is not going to be an outlier. It is better to keep the record for the person but cap outlier variables like income.

Eliminating Rows

If you suspect that most of the variables/columns in the data are outliers for that person or record, you can eliminate the entire row. For example, because of high income, may be Gates can always get treated at best hospitals and so on. So if all other variables like health indicator are likely to be outliers for the person, you eliminate the record. Although this case is expected to be rarer, since we hope nature would not bestow special powers on people with more money 🙂

Bottom line: check what other variables are and then take a call.

Similar Posts:

Rate this post

Leave a Comment