I used multiple age categories (under18, 18-35, 36-44 etc.) in my survey. So I recoded each interval as a dummy variable. I know it's not 100% correct to use something like age as categorical variables. Is there actually any literature which argues for this method and which legitimates the way I did this.

Best

Kevin

This was just me writing example to point out my issue. I actually binned it differently. How would you justify it? Do you have any literature to underpin this?

**Contents**hide

#### Best Answer

It's true that categorizing continuous variables can lead to some problems, but it can also help approximate a complex model more easily. For the same number of degrees of freedom, though, it might be preferable to fit a flexible linear model (e.g., a spline or polynomial model). That said, there is some (older) literature that justifies splitting into categories, as long as one splits into enough categories. Cochran (1968) and a follow-up Becher (1992) both recommend splitting into at least 5 categories in order to minimize the amount of residual confounding. In general. more categories is better, but you can risk over-fitting and having imprecise estimates with too few individuals in each category.

Becher, H. (1992). The concept of residual confounding in regression models and some applications. *Statistics in Medicine*, 11(13), 1747–1758. https://doi.org/10.1002/sim.4780111308

Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. *Biometrics*, 295-313.

### Similar Posts:

- Solved – How to adjust confounders in Logistic regression
- Solved – correct test I should use to compare multiple response (check all that apply) answers by age group
- Solved – Friedman’s test for binary data – possible or not
- Solved – Adjusting for confounding variables in binary response variables
- Solved – Confounding variables in machine learning predictions