# Solved – What do p-values for levels of a categorical variable represent in Poisson regression

I have a Poisson model with varying densities:

``set.seed(1) df = data.frame(density = 1:5, events = rpois(2000, 1:5)) ``

If I regress on this, I get that the intercept is approximately `log(3)`, which makes sense because 3 is the mean of 1:5.

``glm(events ~ 1, df, family = poisson)  # returns 1.089 ``

But now suppose I want to read back the coefficients of density:

``glm(events ~ as.factor(density), df, family = poisson) ``

(For simplicity I've used `density` as both the ID of the field and its density.) I would expect the coefficient of `density[i]` to be `log(3-i)` because the intercept would still be 3. However, it doesn't seem like the intercept remains 3 – in this case, the intercept is set to `log(1)`. In playing around with this, it seems like glm sets the intercept to be the coefficient of the first factor.

Now I'm starting to wonder what the p values in a glm regression indicate. Is the null hypothesis that `density[i]` is the same as the intercept (aka `density`)? Or is it that `density[i] = mean(density)`?

Contents

The model coefficients are estimated contrasts based on how the data frame generates contrasts in the factor levels for density. Take a look at this:

``fit <- glm(events ~ as.factor(density), df, family = poisson)  model.matrix(fit) ``

To see how these contrasts are estimated, store the GLM as an object in the workspace. The intercept in this case is now the average log rate when `density` is equal to 1 (which is the log of 1, i.e. close to 0). Each of the parameters, such as the first, which is labeled `as.factor(density)2` is the log relative rate comparing events when density is equal to 2 versus density equal to 1.

Each of these model parameters has a known limiting asymptotic distribution due to the central limit theorem. The theory on this is well understood, but a bit advanced. Consult McCullagh & Nelder, "Generalized Linear Models" for a statement of the result. Basically, as with linear regression, the natural parameters in the generalized linear models converge to a normal distribution under replications of the study. Thus, we can calculate the limiting distribution under the null hypothesis and calculate the probability of observing model coefficients as inconsistent or more inconsistent than what was experimentally obtained. This is very similar to the usual interpretation of a \$p\$-value as obtained from OLS model parameters, or simple Pearson tests of contingency tables, or the t-test.

Note that, had you removed the `as.factor` coding of `density`, you would have estimated an averaged log relative rate comparing values of `density` differing by 1 unit, and the intercept would have been the interpolated to be the log event rate when `density=0`, which may or may not be a useless quantity. The log relative rates in the data you generated are not constant, so the model effects would represent an "averaged effect".

For instance:

``   ## the actual relative rates comparing subsequent density values relRates <- exp(diff(log(1:5))   modelFit <- glm(events ~ density, data=df, family=poisson)    ## model based relative rate, weighted by random data exp(coef(modelFit))     ## the approximate average log relative rate, converted to relative rate exp(mean(log(relRates))   ``

Rate this post