I have seen two approaches in binary logistic regression with categorical independent variables (IV) with more than two levels. In one approach, a reference category for the IV is defined and the rest of the categories are tested regarding this reference category,thus obtaining p-values for each category compared to the reference category (which is what I typically do). However, I have seen logistic regressions outputs showing an overall significance (or global significance) for categorical IVs outputs (only one p-value). I don't understand the second approach. I have read similar threads, but I have specific questions that they do not resolve:

- What additional information does the second approach really provide? If there is an overall significance, would not be there differences between some of the categories?
- Does the second approach assume that the IV is continuous (providing an estimate by unit of change in X)?
- Could it happen that there were differences between the categories of an IV, but the overall test was not significant?

Perhaps they are basic questions, but I would appreciate your help.

**Contents**hide

#### Best Answer

I think you're referring to a likelihood ratio test.

Could it happen that there were differences between the categories of an IV, but the overall test was not significant?

The null hypothesis of the LRT is that *all coefficients for the categorical variable are 0*, with the alternative being that *at least one coefficient is not 0*.

I suppose it could be the case that you could fail to reject the null of the LRT and yet find differences *between* categories. Those two things aren't mutually exclusive.

What additional information does the second approach really provide? If there is an overall significance, would not be there differences between some of the categories?

Evaluating the statistical significance of the categorical variables via looking at their p-values does not tell us about the categorical variable as a whole, only about the single coefficient's statistical significance.

Here is an example in R

`set.seed(0) N = 100 cat = factor(sample(1:5, N, replace = T)) x = rnorm(N) eta = model.matrix(~x+cat)%*%c(1,2,0,0,0,0) p = 1/(1+exp(-eta)) y = rbinom(length(p),1,p) model = glm(y~x+cat, family = binomial()) summary(model) Call: glm(formula = y ~ x + cat, family = binomial()) Deviance Residuals: Min 1Q Median 3Q Max -2.0345 -0.7243 0.2921 0.6635 1.8355 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.6631 0.5648 1.174 0.2404 x 1.9981 0.4647 4.299 1.71e-05 *** cat2 0.8766 0.8555 1.025 0.3056 cat3 0.3210 0.8327 0.386 0.6998 cat4 1.1713 0.8468 1.383 0.1666 cat5 1.8251 0.8722 2.093 0.0364 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 122.173 on 99 degrees of freedom Residual deviance: 86.275 on 94 degrees of freedom AIC: 98.275 Number of Fisher Scoring iterations: 5 `

A priori, we know that the categories have no effect on the outcome, and yet `cat5`

comes out as significant. So if we did not have access to the true data generating mechanism, we may be tempted to say that the `cat`

variable has an impact on the outcome.

But, that would be erroneous, since we are basing our decision on only one category of the variable. To determine if a model with the `cat`

variable does better than a model without the `cat`

variable, we can do a likelihood ratio test.

`model0 = glm(y~x, family = binomial()) anova(model0,model, test = 'LRT') Analysis of Deviance Table Model 1: y ~ x Model 2: y ~ x + cat Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 98 92.167 2 94 86.275 4 5.8923 0.2073 `

We fail to reject the null from this test. That means that from our data, we can not say that at least one of the coefficients from the `cat`

variable is 0. And that would be correct. That the `cat5`

variable is significant is just an artifact of sampling and random error.

### Similar Posts:

- Solved – Overall significance test for the effect of an independent continuous variable on a categorical dependent variable
- Solved – Many categorical variables in Cox regression model
- Solved – How to interpret insignificant categorical variables for logistic regression
- Solved – How to interpret insignificant categorical variables for logistic regression
- Solved – Test for a comparison between groups on multiple categorical variables