# Solved – Logistic Regression using two categorical variables

I've been doing some work on regression, and a paper in particular caught my attention where they utilised two categorical variables in a logistic regression. From my understanding, if you're using two categorical variables, does that not just come down to conditional probabilities? Out of interest, I wrote some code, and was surprised to see that the method returned a p value (reproducible code below). I'm struggling to understand what the null and alternative hypotheses are in this context, so that I can interpret what the p value is telling me. Any assistance in dissecting the resulting model summary would be greatly appreciated.

Example interpretation:

P(Dep = Class1 | Pred = C) = 20/120 = 1/6

Is the model summary telling me this?

Code

``foo <- data.frame(Pred   = c(rep("A",80),rep("B",20),                              rep("C",40),rep("D",60)),                   Dep    = c(rep("Class1",120),                              rep("Class2",80))) fit      <- glm(Dep ~ Pred, family=binomial(link='logit'), data = foo) summary(fit) ``

Output

``   Call: glm(formula = Dep ~ Pred, family = binomial(link = "logit"),      data = foo)  Deviance Residuals:       Min        1Q    Median        3Q       Max   -1.17741  -0.00003  -0.00003   0.00003   1.17741    Coefficients:               Estimate Std. Error z value Pr(>|z|) (Intercept) -2.157e+01  3.268e+03  -0.007    0.995 PredB       -7.168e-11  7.308e+03   0.000    1.000 PredC        2.157e+01  3.268e+03   0.007    0.995 PredD        4.313e+01  4.992e+03   0.009    0.993  (Dispersion parameter for binomial family taken to be 1)      Null deviance: 269.205  on 199  degrees of freedom Residual deviance:  55.452  on 196  degrees of freedom AIC: 63.452  Number of Fisher Scoring iterations: 20 ``
Contents

The logistic regression model gives identical inference to a Pearson chi-sq test of "independence" for categorical data (in the long run). In both cases, the null hypothesis is that the conditional probabilities are equal to the marginal probabilities. You can show with some algebra that this necessarily implies that the odds ratio is 1.

The minute differences in the actual logistic regression and Pearson test statistics owes to how they're computed. The \$p\$values you get in R from calling `summary.glm` come from a Wald test whereas the Pearson test is a related Score test.

The problem with your example is that you have a singular logistic model. One advantage of the score test is that it can provide test statistics for models like this one which "explode". For a more sane example, consider the following:

``set.seed(123) foo2 <- as.data.frame(sapply(foo, sample)) ## permute to avoid singularity fit <- glm(Dep ~ Pred, data=foo2, family=binomial) library(lmtest) waldtest(fit, test='Chisq') chisq.test(table(foo2)) ``

Gives us:

``> waldtest(fit, test='Chisq') Wald test  Model 1: Dep ~ Pred Model 2: Dep ~ 1   Res.Df Df  Chisq Pr(>Chisq) 1    196                      2    199 -3 1.7774     0.6199 > chisq.test(table(foo2))      Pearson's Chi-squared test  data:  table(foo2) X-squared = 1.7882, df = 3, p-value = 0.6175 ``

Rate this post