I am working with a dataset of 1000 individuals, 200 of which are disease positive. I have run a logistic regression with 25 predictors to identify overall which variables are significantly predictive. Straightforward…
However, I also want to identify which variables account for the greatest amount of variability for males vs. females, and see if there are differences in which variables pop. I considered modeling gender x predictor interaction terms, but that essentially doubles my number of predictors. I proceeded with a forward logistic regression and what I noticed was that by the last iteration, the model correctly identified a high percentage of non-disease group (>95%) but was very poor in correctly identifying the disease group. If anything, I would prefer a false-positive model (for clinical reasons)!
So I played around and took a random sample of 200 from the non-disease group and ran analyses with those individuals and found that the final iteration of the forward LR correctly predicted a high percentage of both groups. Therefore it seemed that using the whole sample yielded a model biased toward the larger group.
In reading through these pages and other sources, it seems that sub-sampling isn't viewed positively regarding LR, but I could not find anything about using it in an iterative, stepwise procedure.
So my questions are:
1) Is sub-sampling acceptable for a stepwise LR with such a disparate proportion of dichotomous variable?
2) If not, what other procedure(s) should I consider? (e.g., exact logistic regression?)
The simple answer is No. Subsampling will not help.
If by subsampling you mean a balanced sample so that the ratio of events changes from 200/1000 to 200/400. This is only used in classification models and is of no use (generally) in maximum-likelihood / probability models.
What the comments are trying to suggest is that there are many other larger issues revealed in questions that could be textbook chapters by themselves:
- Sample size of logistic models is measured by number of events, model building as events-per-variable. With 8 EPV (assuming all are continuous predictors otherwise less) your sample size in relationship to your number of predictors is small which is going to cause issues.
- Detecting interactions is notoriously difficult.
- Forward/backward or combination variable selection methods have major issues. This a popular topic on stackexchange crossvalidated. It is likely overemphasized with limited issues with EPV>50. But this is the classic situation where you are going to get mislead by automated variable selection methods (Austin: Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality)
- variable selection is hard
- prediction models and descriptive models often require different methods. Not always. Differences often overemphasized. But from the limited information available it seems like these two goals are going to be difficult to combine in this case.
- Evaluation of logistic models avoid using correctly classified. Classification in logistic regression often based on arbitrary probability cut-off. Fair review: Steyerberg: "Assessing the performance…" http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575184/
- Solved – What does it mean that stepwise, backward and forward selection methods are “path dependent”
- Solved – How to reduce variables in logistic regression?
- Solved – Selecting variables in multiple linear regression in R
- Solved – Stepwise regression modeling using multiply imputed data sets
- Solved – Using Mutual Information for Binary Logistic Regression Variable Selection