I'm doing clinical research in medicine and have taken several statistics courses. I've never published a paper using linear/logistic regression and would like to do variable selection correctly. Interpretability is important, so no fancy machine learning techniques. I've summarized my understanding of variable selection – would someone mind shedding light on any misconceptions? I found two (1) similar (2) CV posts to this one, but they didn't quite fully answer my concerns. Any thoughts would be much appreciated! I have 3 primary questions at the end.
Problem and Discussion
My typical regression/classification problem has 200-300 observations, an adverse event rate of 15% (if classification), and info on 25 out of 40 variables that have been claimed to have a "statistically significant" effect in the literature or make plausible sense by domain knowledge.
I put "statistically significant" in quotes, because it seems like everyone and their mother uses stepwise regression, but Harrell (3) and Flom (4) don't appear to like it for a number of good reasons. This is further supported by a Gelman blog post discussion (5). It seems like the only real time that stepwise is acceptable is if this is truly exploratory analysis, or one is interested in prediction and has a cross-validation scheme involved. Especially since many medical comorbidities suffer from collinearity AND studies suffer from small sample size, my understanding is that there will be a lot of false positives in the literature; this also makes me less likely to trust the literature for potential variables to include.
Another popular approach is to use a series of univariate regressions/associations between predictors and independent variable as a starting point. below a particular threshold (say, p < 0.2). This seems incorrect or at least misleading for the reasons outlined in this StackExchange post (6).
Lastly, an automated approach that appears popular in machine learning is to use penalization like L1 (Lasso), L2 (Ridge), or L1+L2 combo (Elastic Net). My understanding is that these do not have the same easy interpretations as OLS or logistic regression.
Gelman + Hill propose the following:
In my Stats course, I also recall using F tests or Analysis of Deviance to compare full and nested models to do model/variable selection variable by variable. This seems reasonable, but fitting sequential nested models systematically to find variables that cause largest drop in deviance per df seems like it could be easily automated (so I'm a bit concerned) and also seems like it suffers from problems of the order in which you test variable inclusion. My understanding is that this should also be supplemented by investigating multicollinearity and residual plots (residual vs. predicted).
Is the Gelman summary the way to go? What would you add or change in his proposed strategy?
Aside from purely thinking about potential interactions and transformations (which seems very bias/error/omission prone), is there another way to discover potential ones? Multivariate adaptive regression spline (MARS) was recommended to me, but I was informed that the nonlinearities/transformations don't translate into the same variables in a standard regression model.
Suppose my goal is very simple: say, "I'd like to estimate the association of X1 on Y, only accounting for X2". Is it adequate to simply regress Y ~ X1 + X2, report the outcome, without reference to actual predictive ability (as might be measured by cross-validation RMSE or accuracy measures)? Does this change depending on event rate or sample size or if R^2 is super low (I'm aware that R^2 is not good because you can always increase it by overfitting)? I am generally more interested in inference/interpretability than optimizing predictive power.
- "Controlling for X2, X1 was not statistically significantly associated with Y relative to X1's reference level." (logistic regression coefficient)
- "X1 was not a statistically significant predictor of Y since in the model drop in deviance was not enough relative to the change in df." (Analysis of Deviance)
Is cross-validation always necessary? In which case, one might also want to do some balancing of classes via SMOTE, sampling, etc.
Andrew Gelman is definitely a respected name in the statistical world. His principles closely align with some of the causal modeling research that has been done by other "big names" in the field. But I think given your interest in clinical research, you should be consulting other sources.
I am using the word "causal" loosely (as do others) because there is a fine line we must draw between performing "causal inference" from observational data, and asserting causal relations between variables. We all agree RCTs are the main way of assessing causality. We rarely adjust for anything in such trials per the randomization assumption, with few exceptions (Senn, 2004). Observational studies have their importance and utility (Weiss, 1989) and the counterfactual based approach to making inference from observational data is accepted as a philosophically sound approach to doing so (Höfler, 2005). It often approximates very closely the use-efficacy measured in RCTs (Anglemyer, 2014).
Therefore, I'll focus on studies from observational data. My point of contention with Gelman's recommendations is: all predictors in a model and their posited causal relationship between a single exposure of interest and a single outcome of interest should be specified apriori. Throwing in and excluding covariates based on their relationship between a set of main findings is actually inducing a special case of 'Munchausen's statistical grid' (Martin, 1984). Some journals (and the trend is catching on) will summarily reject any article which uses stepwise regression to identify a final model (Babyak, 2004), and I think the problem is seen in similar ways here.
The rationale for inclusion and exclusion of covariates in a model is discussed in: Judea Pearl's Causality (Pearl, 2002). It is perhaps one of the best texts around for understanding the principles of statistical inference, regression, and multivariate adjustment. Also practically anything by Sanders and Greenland is illuminating, in particular their discussion on confounding which is regretfully omitted from this list of recommendations (Greenland et al. 1999). Specific covariates can be assigned labels based on a graphical relation with a causal model. Designations such as prognostic, confounder, or precision variables warrant inclusion as covariates in statistical models. Mediators, colliders, or variables beyond the causal pathway should be omitted. The definitions of these terms are made rigorous with plenty of examples in Causality.
Given this little background I'll address the points one-by-one.
This is generally a sound approach with one MAJOR caveat: these variables must NOT be mediators of the outcome. If, for instance, you are inspecting the relationship between smoking and physical fitness, and you adjust for lung function, that is attenuating the effect of smoking because it's direct impact on fitness is that of reducing lung function. This should NOT be confused with confounding where the third variable is causal of the predictor of interest AND the outcome of interest. Confounders must be included in models. Additionally, overadjustment can cause multiple forms of bias in analyses. Mediators and confounders are deemed as such NOT because of what is found in analyses, but because of what is BELIEVED by YOU as the subject-matter-expert (SME). If you have 20 observations per variable or fewer, or 20 observations per event in time-to-event or logistic analyses, you should consider conditional methods instead.
This is an excellent power saving approach that is not so complicated as propensity score adjustment or SEM or factor analysis. I would definitely recommend doing this whenever possible.
I disagree wholeheartedly. The point of adjusting for other variables in analyses is to create strata for which comparisons are possible. Misspecifying confounder relations does not generally lead to overbiased analyses, so residual confounding from omitted interaction terms is, in my experience, not a big issue. You might, however, consider interaction terms between the predictor of interest and other variables as a post-hoc analysis. This is a hypothesis generating procedure that is meant to refine any possible findings (or lack thereof) as a. potentially belonging to a subgroup or b. involving a mechanistic interaction between two environmental and/or genetic factors.
I also disagree with this wholeheartedly. It does not coincide with the confirmatory analysis based approach to regression. You are the SME. The analyses should be informed by the QUESTION and not the DATA. State with confidence what you believe to be happening, based on a pictoral depiction of the causal model (using a DAG and related principles from Pearl et. al), then choose the predictors for your model of interest, fit, and discuss. Only as a secondary analysis should you consider this approach, even at all.
The role of machine learning in all of this is highly debatable. In general, machine learning is focused on prediction and not inference which are distinct approaches to data analysis. You are right that the interpretation of effects from penalized regression are not easily interpreted for a non-statistical community, unlike estimates from an OLS, where 95% CIs and coefficient estimates provide a measure of association.
The interpretation of the coefficient from an OLS model Y~X is straightforward: it is a slope, an expected difference in Y comparing groups differing by 1 unit in X. In a multivariate adjusted model Y~X1+X2 we modify this as a conditional slope: it is an expected difference in Y comparing groups differing by 1 unit in X1 who have the same value of X2. Geometrically, adjusting for X2 leads to distinct strata or "cross sections" of the three space where we compare X1 to Y, then we average up the findings over each of those strata. In R, the
coplot function is very useful for visualizing such relations.