Solved – Using age categories in regression

I used multiple age categories (under18, 18-35, 36-44 etc.) in my survey. So I recoded each interval as a dummy variable. I know it's not 100% correct to use something like age as categorical variables. Is there actually any literature which argues for this method and which legitimates the way I did this.


This was just me writing example to point out my issue. I actually binned it differently. How would you justify it? Do you have any literature to underpin this?

It's true that categorizing continuous variables can lead to some problems, but it can also help approximate a complex model more easily. For the same number of degrees of freedom, though, it might be preferable to fit a flexible linear model (e.g., a spline or polynomial model). That said, there is some (older) literature that justifies splitting into categories, as long as one splits into enough categories. Cochran (1968) and a follow-up Becher (1992) both recommend splitting into at least 5 categories in order to minimize the amount of residual confounding. In general. more categories is better, but you can risk over-fitting and having imprecise estimates with too few individuals in each category.

Becher, H. (1992). The concept of residual confounding in regression models and some applications. Statistics in Medicine, 11(13), 1747–1758.

Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 295-313.

Similar Posts:

Rate this post

Leave a Comment