Solved – What do p-values for levels of a categorical variable represent in Poisson regression

I have a Poisson model with varying densities:

set.seed(1) df = data.frame(density = 1:5, events = rpois(2000, 1:5)) 

If I regress on this, I get that the intercept is approximately log(3), which makes sense because 3 is the mean of 1:5.

glm(events ~ 1, df, family = poisson)  # returns 1.089 

But now suppose I want to read back the coefficients of density:

glm(events ~ as.factor(density), df, family = poisson) 

(For simplicity I've used density as both the ID of the field and its density.) I would expect the coefficient of density[i] to be log(3-i) because the intercept would still be 3. However, it doesn't seem like the intercept remains 3 – in this case, the intercept is set to log(1). In playing around with this, it seems like glm sets the intercept to be the coefficient of the first factor.

Now I'm starting to wonder what the p values in a glm regression indicate. Is the null hypothesis that density[i] is the same as the intercept (aka density[1])? Or is it that density[i] = mean(density)?

The model coefficients are estimated contrasts based on how the data frame generates contrasts in the factor levels for density. Take a look at this:

fit <- glm(events ~ as.factor(density), df, family = poisson)  model.matrix(fit) 

To see how these contrasts are estimated, store the GLM as an object in the workspace. The intercept in this case is now the average log rate when density is equal to 1 (which is the log of 1, i.e. close to 0). Each of the parameters, such as the first, which is labeled as.factor(density)2 is the log relative rate comparing events when density is equal to 2 versus density equal to 1.

Each of these model parameters has a known limiting asymptotic distribution due to the central limit theorem. The theory on this is well understood, but a bit advanced. Consult McCullagh & Nelder, "Generalized Linear Models" for a statement of the result. Basically, as with linear regression, the natural parameters in the generalized linear models converge to a normal distribution under replications of the study. Thus, we can calculate the limiting distribution under the null hypothesis and calculate the probability of observing model coefficients as inconsistent or more inconsistent than what was experimentally obtained. This is very similar to the usual interpretation of a $p$-value as obtained from OLS model parameters, or simple Pearson tests of contingency tables, or the t-test.

Note that, had you removed the as.factor coding of density, you would have estimated an averaged log relative rate comparing values of density differing by 1 unit, and the intercept would have been the interpolated to be the log event rate when density=0, which may or may not be a useless quantity. The log relative rates in the data you generated are not constant, so the model effects would represent an "averaged effect".

For instance:

   ## the actual relative rates comparing subsequent density values relRates <- exp(diff(log(1:5))   modelFit <- glm(events ~ density, data=df, family=poisson)    ## model based relative rate, weighted by random data exp(coef(modelFit))[2]     ## the approximate average log relative rate, converted to relative rate exp(mean(log(relRates))   

Similar Posts:

Rate this post

Leave a Comment