How do you correctly plot results from a GLM used to test a categorical variable? Here is a reproducible example in R (the data are listed below the code):
# run the following code if you don't have the libraries installed: # install.packages("dplyr","ggplot2", "MASS") # loading libraries library(dplyr) library(ggplot2) library(MASS) # making fake data - including response (rs) from a negative binomial distribution # and a two-level categorical variable (type) set.seed(246) df <- data.frame(rs = c(rnegbin(n=100, mu=2, theta=1), rnegbin(n=30, mu=3, theta=10)), type = c(rep("A", times=100), rep("B", times=30))) # now doing the stats m1 <- glm.nb(rs~type, df) anova(m1) summary(m1) # make summary table containing means, standard deviation (sd), # count (n), and standard error (se) for each Type (A and B) # going to use to graph results df1 <- df %>% group_by(type) %>% summarise(means = mean(rs), sd = sd(rs), n = n(), se = sd/sqrt(n)) df1 # now making the graph with df1 ggplot(df1, aes(x=type,y=means)) + geom_bar(stat="identity", color="black",fill="grey") + geom_errorbar(aes(ymax = means + se, ymin = means - se), width = 0.2) + labs(y ="Mean response",x= "Categorical variable") + theme_bw() + theme(axis.text = element_text(size=22), axis.title = element_text(size=24)) # the generated data: rs type 2 A 0 A 1 A 1 A 2 A 0 A 0 A 2 A 0 A 3 A 9 A 0 A 11 A 0 A 0 A 4 A 4 A 3 A 1 A 5 A 1 A 0 A 0 A 1 A 0 A 6 A 13 A 3 A 2 A 5 A 1 A 0 A 4 A 1 A 4 A 1 A 1 A 3 A 3 A 3 A 3 A 1 A 2 A 0 A 2 A 6 A 2 A 3 A 0 A 2 A 1 A 2 A 1 A 0 A 0 A 0 A 0 A 0 A 0 A 0 A 1 A 0 A 3 A 1 A 1 A 1 A 1 A 4 A 0 A 3 A 4 A 0 A 1 A 9 A 6 A 0 A 0 A 0 A 1 A 8 A 3 A 0 A 1 A 0 A 4 A 3 A 0 A 5 A 1 A 1 A 0 A 1 A 1 A 13 A 1 A 0 A 6 A 3 A 0 A 1 A 1 B 4 B 4 B 5 B 6 B 2 B 4 B 2 B 3 B 1 B 2 B 5 B 2 B 2 B 4 B 1 B 3 B 4 B 2 B 5 B 5 B 4 B 6 B 1 B 5 B 2 B 1 B 5 B 3 B 2 B
The GLM says there is no statistical difference between Type A and Type B. I want to plot this result. I made a summary table that I use to plot the mean response for each Type and include error bars. I think I must be calculating standard error incorrectly. Do you have any advice for how to properly plot my results?
Best Answer
Let's calculate predicted values and confidence intervals from your model, with built-in functions
pdat <- expand.grid(type = c(levels(df$type))) pred <- predict(m1, newdata = pdat, se.fit = T) cumlut <- 2 # could use 1.96 predframe <- data.frame(type = pdat, fit = pred$fit, ci.lower = pred$fit - cmult*pred$se.fit, ci.upper = pred$fit + cmult*pred$se.fit) head(predframe) type fit ci.lower ci.upper 1 A 0.756122 0.5439703 0.9682737 2 B 1.163151 0.8034228 1.5228788
or by hand
pdat <- expand.grid(type = c(levels(df$type))) X <- model.matrix(formula(m1)[-2], pdat) fit <- X %*% coef(m1) pvar <- diag(X %*% vcov(m1) %*% t(X)) predframe <- data.frame(type = pdat, fit, ci.lower = fit - cmult*sqrt(pvar), ci.upper = fit + cmult*sqrt(pvar)) head(predframe) type fit ci.lower ci.upper 1 A 0.756122 0.5439703 0.9682737 2 B 1.163151 0.8034228 1.5228788
They are equivalent. Now, using your code:
ggplot(predframe, aes(x=type,y=fit))+ geom_bar(stat="identity", color="black",fill="grey")+ geom_errorbar(aes(ymax = ci.upper, ymin = ci.lower), width = 0.2)+ labs(y ="Mean response",x= "Categorical variable")+ theme_bw()+ theme(axis.text = element_text(size=22), axis.title = element_text(size=24))
You should look at confidence intervals rather than standard error when you make any inference using your model. The differnce between this and your plot is that the standard errors you have calculated are from your dataset (raw data), while the one you should use to plot confidence intervals are the one you get from the model.