Penalized models can be used to estimate models where the number of parameters is equal to or even greater than the sample size. This situation can arise in log-linear models of large sparse tables of categorical or count data. In these settings, it is often also desirable or helpful to collapse tables by combining levels of a factor where those levels are not distinguishable in terms of how they interact with other factors.
Two questions:
- Is there a way to use penalized models such as LASSO or elastic net to test for the collapsibility of levels within each factor?
- If the answer to the first question is yes, can, and should, this be set up in such a way that the collapse of levels and the estimation of model coefficients occurs in a single step?
Best Answer
It is possible. We can use a variant of the fused lasso to accomplish this.
We can use the estimator $$hat{beta} = argmin_{beta} frac{-1}{n} sum_{i=1}^n left(y_i beta^T x_i – e^{beta^T x_i} right) + sum_{textrm{factors g}} lambda_g left(sum_{j in g} |beta_j| + frac{1}{2} sum_{j,k in g} |beta_j – beta_k| right).$$
Note that $frac{-1}{n} sum_{i=1}^n left(y_i beta^T x_i – e^{beta^T x_i} right)$ is the loss function for log-linear models.
This encourages the coefficients within a group to be equal. This equality of coefficients is equivalent to collapsing the $j^{th}$ and $k^{th}$ levels of the factor together. In the case of when $hat{beta}_j=0$, it's equivalent to collapsing the $j^{th}$ level with the reference level. The tuning parameters $lambda_g$ can be treated as constant, but this if there's only a few factors, it could be better to treat them as separate.
The estimator is a minimizer of a convex function, so it can be computed efficiently via arbitrary solvers. It's possible that if a factor has many, many levels, these pairwise differences will get out of hands—in this case, knowing more structure about possible patterns of collapse will be necessary.
Note that this is all accomplished in one step! This is part of what makes lasso-type estimators so cool!
Another interesting approach is to use the OSCAR estimator, which is like above except the penalty $|[-1 , 1] cdot [beta_i , beta_j]'|_1$ is replaced by $|[beta_i , beta_j]|_infty$.