Solved – Feature Importance for Multinomial Logistic Regression

I have trained a logistic regression model with 4 possible output labels. I want to determine the overall feature importance for each feature irrespective of a specific output label. In case of binary classification, we can simply infer feature importance using feature coefficients. However, when the output labels are more than 2, things get a bit tricky. For multinomial logistic regression, multiple one vs rest classifiers are trained. For example, if there are 4 possible output labels, 3 one vs rest classifiers will be trained. Each classifier will have its own set of feature coefficients. While calculating feature importance, we will have 3 coefficients for each feature corresponding to a specific output label. Is there a way to aggregate these coefficients into a single feature importance value?
Can we just take the mean or weighted mean of these coefficients to get a single feature importance value?

The most relevant question to this problem I found is
However, this question has no answers yet and it uses log-linear model instead of logistic regression.

You can also fit one multinomial logistic model directly rather than fitting three rest-vs-one binary regressions.

To do so, if you call $y_i$ a categorical response coded by a vector of three $0$ and one $1$ whose position indicates the category, and if you call $pi_i$ the vector of probabilities associated to $y_i$, you can directly minimize cross entropy : $$H = -sum_i sum_{j = 1..4} y_{ij} log(pi_{ij}) + (1 – y_{ij})log(1 – pi_{ij})$$ (this is also the negative log-likelihoood of the model).

The parameter of your multinomial logistic regression is a matrix $Gamma$ with 4-1 = 3 lines (because a category is reference category) and $p$ columns where $p$ is the number of features you have (or $p + 1$ columns if you add an intercept). Each column corresponds to a feature. So to see importance of $j$-th feature you can for instance make a test (e.g. likelihood ratio test or Wald type test) for $mathcal{H}_0 : Gamma_{,j} = 0$ where $Gamma_{,j}$ denotes $j$-th column of $Gamma$. The $p$-value you get gives you the signicativity of your features.

In this case, likelihood ratio test actually sums up to looking at twice the gain of cross entropy you get by removing a feature, and comparing this to a $chi^2_k$ distribution where $k$ is the dimension of the removed feature.

I Hope this helps.

Similar Posts:

Rate this post

Leave a Comment