I noticed while tinkering with a multivariate regression model there was a small but noticeable multicollinearity effect, as measured by variance inflation factors, within the categories of a categorical variable (after excluding the reference category, of course).
For example, say we have a dataset with continuous variable y and one nominal categorical variable x which has k possible mutually exclusive values. We code those $k$ possible values as 0/1 dummy variables $x_1, x_2,dots ,x_k$. Then we run a regression model $y = b_0 + b_1x_1 + b_2x_2 + dots + b_{k-1}x_{k-1}$. The VIF scores for the $k-1$ dummy variables turn out to be non-zero. In fact, as the number of categories increases, the VIFs increase. Centering the dummy variables doesn't appear to change the VIFs.
The intuitive explanation seems to be that the mutually exclusive condition of the categories within the categorical variable causes this slight multicollinearity. Is this a trivial finding or is it an issue to consider when building regression models with categorical variables?
Best Answer
I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.
The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.
Here is an R
function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:
trial <- function(n, k1=2, k2=2) { df <- expand.grid(1:k1, 1:k2) df <- do.call(rbind, lapply(1:n, function(i) df)) df$y <- rnorm(k1*k2*n) fit <- lm(y ~ Var1+Var2, data=df) vif(fit) }
Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:
sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates
This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line
df <- subset(df, subset=(y < 0))
before the fit
line in trial
. This removes half the data at random. Re-running
sapply(1:5, function(i) trial(i, 10, 3))
shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10))
produces comparable values.
Similar Posts:
- Solved – Multicollinearity among categorical variables – Is it normal
- Solved – Classification when the column data is string
- Solved – Why is the categorical variable split up into separate variables in the regression model in r
- Solved – PROCESS (Hayes) by using IV categorical
- Solved – PROCESS (Hayes) by using IV categorical