I used multiple age categories (under18, 18-35, 36-44 etc.) in my survey. So I recoded each interval as a dummy variable. I know it's not 100% correct to use something like age as categorical variables. Is there actually any literature which argues for this method and which legitimates the way I did this.
Best
Kevin
This was just me writing example to point out my issue. I actually binned it differently. How would you justify it? Do you have any literature to underpin this?
Best Answer
It's true that categorizing continuous variables can lead to some problems, but it can also help approximate a complex model more easily. For the same number of degrees of freedom, though, it might be preferable to fit a flexible linear model (e.g., a spline or polynomial model). That said, there is some (older) literature that justifies splitting into categories, as long as one splits into enough categories. Cochran (1968) and a follow-up Becher (1992) both recommend splitting into at least 5 categories in order to minimize the amount of residual confounding. In general. more categories is better, but you can risk over-fitting and having imprecise estimates with too few individuals in each category.
Becher, H. (1992). The concept of residual confounding in regression models and some applications. Statistics in Medicine, 11(13), 1747–1758. https://doi.org/10.1002/sim.4780111308
Cochran, W. G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 295-313.
Similar Posts:
- Solved – How to adjust confounders in Logistic regression
- Solved – correct test I should use to compare multiple response (check all that apply) answers by age group
- Solved – Friedman’s test for binary data – possible or not
- Solved – Adjusting for confounding variables in binary response variables
- Solved – Confounding variables in machine learning predictions