I have a binary classification problem with 5K records and 60+ features/columns/variables. dataset is slightly imbalanced (or not) with 33:67 class proportion

What I did was

**1st**) Run a logistic regression (statsmodel) with all 60+ columns as input (meaning controlling confounders) and find out the significant risk factors (p<0.0.5) from result(summary output). So through this approach, I don't have to worry about confounders because confounders are controlled via multivariate regression. Because I have to know that my risk factors are significant as well Meaning build a predictive model on the basis of significant features. I say this because in a field like medical science/clinical studies, I believe it is also important to know the causal effect. I mean if you wish to publish in a journal, do you think we can just list the variables based on feature importance approach (results of which differ for each FS approach). Ofcourse, I find some common features across all feature selection algorithm. But is this enough to justify that this a meaningful predictor? Hence, I was hoping that p-value would convince and help people understand that this is significant predictor

**2nd**) Use the identified 7 significant risk factors to build a classification ML model

**3rd**) It yielded an AUC of around 82%

Now my question is

**1**) Out of 7 significant factors identified, we already know 5 risk factors based on domain experience and literature. So we are considering the rest 2 as new factors which we found. Might be because we had a very good data collection strategy (meaning we collected data for new variables as well which previous literature didn't have)

**2**) But when I build a model with already known 5 features, it produces an AUC of `82.1`

. When I include all the 7 significant features, it still produces an AUC of `82.1-82.3`

or sometimes, it even goes down to `81.8-81.9`

etc. Not much improvement. Why is this happening?

**3**) If it's of no use, how does statsmodel logistic regression identified them as significant feature (with p<0.05)?

**4**) I guess we can look at any metric. As my data is slightly imbalanced (33:67 is the class proportion), I am using only metrics like AUC and F1 score. Should I be looking at accuracy only?

**5**) Should I balance the dataset because I am using statsmodel Logistic regression to identify the risk factors from the summary output? Because I use tree based models later to do the classification which can handle imbalance well, so I didn't balance.Basically what I am trying to know is even for `significant factor identification using statsmodel logistic regression, should I balance the dataset?

**6**) Can you let me know what is the problem here and how can I address this?

**7**) How much of an improvement in performance is considered as valid/meaningful to be considered as new findings?

**Contents**hide

#### Best Answer

A few general points before answering the individual questions.

First, in logistic regression (unlike in linear regression) coefficient estimates will be biased if you omit *any* predictor associated with outcome whether or not it is correlated with the included predictors. This page gives an analytic demonstration for the related probit regression.

Second, it's not necessary (even if it's desirable) to know the mechanism through which a predictor is related to outcome. If it improves outcome prediction (either on its own or as a control for other predictors) it can be useful. *"Answer[ing] the question does [this] new feature really effect/explain the outcome behavior?'" generally can't be done by statistical modeling; modeling like yours can point the way to the more detailed experimental studies needed to get to the mechanism.*

Third, class imbalance problems typically arise from using an improper scoring rule or from just not having enough members of the minority class to get good estimates. See this page among many on this site. Your nicely designed study has over 1500 in the minority class, so the latter is certainly not a problem. Accuracy and F1 score are not strictly proper scoring rules, and the AUC (equivalent to the concordance or C-index) is not very sensitive for detecting differences among models (note that these issues are essentially the same in survival modeling or in logistic regression). So concentrate on using a correct and sensitive measure of model quality.

Fourth, even with your sample size using a single test/train split instead of modeling-process validation by bootstrapping might be leading you astray. See this page and its links. *With bootstrapping you take several hundred samples of the same size as your data set, but with replacement, after you have built your model on the entire data set. You do not set aside separate training, validation, and test sets; you use all of the data for the model building and evaluation process. Bootstrapping mimics the process of taking your original sample from the underlying population. You repeat the entire model building process (including feature selection steps) on each bootstrap sample and test, with appropriate metrics, the performance of each model on the full original data set. Then pool the results over all the models from the bootstraps. You can evaluate bias and optimism/overfitting with this approach, and if you are doing feature selection you can compare among the hundreds of models to see the variability among the selected features.*

Fifth, with respect to feature selection, predictors in clinical data are often highly inter-correlated in practice. In such cases the specific features selected by any method will tend to depend on the particular sample you have in hand. You can check this for yourself with the bootstrapping approach described above. That will be true of any modeling method you choose. That is one of many reasons why you will find little support on this site for automated model selection. In any case, the initial choice of features to evaluate should be based on your knowledge of the subject matter.

So with respect to the questions:

Congratulations on identifying 2 new risk factors associated with outcome. A predictive model certainly should include them if they are going to be generally available to others in your field. Under the first and second general points above, however, you might want to reconsider removing from your model

*any*predictors that might, based on your knowledge of the subject matter, be associated with outcome. With over 1500 in the minority class you are unlikely to be overfitting with 60 features (if they are all continuous or binary categorical). The usual rule of thumb of 15 minority-class members per evaluated predictor would allow you up to 100 predictors (including levels of categorical variables beyond the second and including interaction terms). If any predictor is going to be available in practice and is expected to be related to outcome based on your knowledge of the subject matter, there's no reason to remove it just because it's not "statistically significant."The third and fourth general points above might account for this finding. AUC is not a very sensitive measure for comparing models, and using a fixed test/train split could lead to split-dependent imbalances that would be avoided if you did bootstrap-based model validation, as for example with the rms package in R. That leads to:

A logistic regression model optimizes a log-loss, effectively a strictly proper scoring rule that would be expected to be more sensitive than AUC. Note that the size of your study will make it possible to detect "significance" at

*p*< 0.05 for smaller effects than would be possible with a smaller study. Use your knowledge of the subject matter to decide if these statistically significant findings are likely to be clinically significant.Avoid accuracy. Avoid F1. Be cautious in using AUC. Use a strictly proper scoring rule.

See the third general point above. If your ultimate goal is to use something like boosted classification trees then there is probably no need to do this preliminary logistic regression. Note, however, that a well calibrated logistic regression model can be much easier to interpret than any but the simplest (and potentially most unreliable) tree models. And make sure that your optimization criterion in a tree model provides a proper scoring rule; once again, avoid accuracy as a criterion.

There really is no problem. Bootstrap-based logistic model validation and calibration instead of the single fixed test/train split could provide a much better sense of how your model will perform on new data. If your model is well calibrated (e.g., linearity assumptions hold) then you could use the logistic regression model directly instead of going on to a tree-based model. If you need to make a yes/no decision based solely on the model, choose a probability cutoff that represents the tradeoff between false-negative and false-positive findings.

The answer to your last question depends on your knowledge of the subject matter. Again, this is the issue of statistical significance versus clinical significance. Only you and your colleagues in the field can make that determination.

### Similar Posts:

- Solved – Interpreting conflicting results from Random Forest & Logistic Regression
- Solved – Odds ratio interpretation if no significant correlation between outcome and predictor
- Solved – Does logistic regression determine the direction of the association
- Solved – can relative risk be calculated from a 2 by 3 table
- Solved – When (and why) is a conditional logistic regression equivalent to a Cox proportional hazards model