I've been doing some work on regression, and a paper in particular caught my attention where they utilised two categorical variables in a logistic regression. From my understanding, if you're using two categorical variables, does that not just come down to conditional probabilities? Out of interest, I wrote some code, and was surprised to see that the method returned a p value (reproducible code below). I'm struggling to understand what the null and alternative hypotheses are in this context, so that I can interpret what the p value is telling me. Any assistance in dissecting the resulting model summary would be greatly appreciated.
Example interpretation:
P(Dep = Class1 | Pred = C) = 20/120 = 1/6
Is the model summary telling me this?
Code
foo <- data.frame(Pred = c(rep("A",80),rep("B",20), rep("C",40),rep("D",60)), Dep = c(rep("Class1",120), rep("Class2",80))) fit <- glm(Dep ~ Pred, family=binomial(link='logit'), data = foo) summary(fit)
Output
Call: glm(formula = Dep ~ Pred, family = binomial(link = "logit"), data = foo) Deviance Residuals: Min 1Q Median 3Q Max -1.17741 -0.00003 -0.00003 0.00003 1.17741 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.157e+01 3.268e+03 -0.007 0.995 PredB -7.168e-11 7.308e+03 0.000 1.000 PredC 2.157e+01 3.268e+03 0.007 0.995 PredD 4.313e+01 4.992e+03 0.009 0.993 (Dispersion parameter for binomial family taken to be 1) Null deviance: 269.205 on 199 degrees of freedom Residual deviance: 55.452 on 196 degrees of freedom AIC: 63.452 Number of Fisher Scoring iterations: 20
Best Answer
Your intuition is right.
The logistic regression model gives identical inference to a Pearson chi-sq test of "independence" for categorical data (in the long run). In both cases, the null hypothesis is that the conditional probabilities are equal to the marginal probabilities. You can show with some algebra that this necessarily implies that the odds ratio is 1.
The minute differences in the actual logistic regression and Pearson test statistics owes to how they're computed. The $p$values you get in R from calling summary.glm
come from a Wald test whereas the Pearson test is a related Score test.
The problem with your example is that you have a singular logistic model. One advantage of the score test is that it can provide test statistics for models like this one which "explode". For a more sane example, consider the following:
set.seed(123) foo2 <- as.data.frame(sapply(foo, sample)) ## permute to avoid singularity fit <- glm(Dep ~ Pred, data=foo2, family=binomial) library(lmtest) waldtest(fit, test='Chisq') chisq.test(table(foo2))
Gives us:
> waldtest(fit, test='Chisq') Wald test Model 1: Dep ~ Pred Model 2: Dep ~ 1 Res.Df Df Chisq Pr(>Chisq) 1 196 2 199 -3 1.7774 0.6199 > chisq.test(table(foo2)) Pearson's Chi-squared test data: table(foo2) X-squared = 1.7882, df = 3, p-value = 0.6175
Similar Posts:
- Solved – Logistic Regression using two categorical variables
- Solved – Logistic Regression using two categorical variables
- Solved – ols vs logistic regression, does the interpretation of interaction differs
- Solved – Why is power analysis with logistic regression so liberal compared to chi squared
- Solved – Formula for computing the Pearson $chi^2$, comparison with R