Anyone know how to calculate the most informative features where the attributes are normally distributed using Naive Bayes?

My understanding, at least if you have binary attributes, is that you compute max(Pr(feature=1 |classLabel))/min(Pr(feature=1 | classLabel)), for any class label. This will give you the informativeness of feature =1 over two class labels.

But how would you compute the most informative features using a Gaussian Naive Bayes classifier? Any sources would be much appreciated also.

**Contents**hide

#### Best Answer

You could estimate the mutual information of each feature $i$ with the class label (also known as the *expected information gain*),

$$I[C, X_i] = H[C] – H[C mid X_i].$$

The most informative feature is the one which on average produces the least uncertainty in $C$, which is measured by the entropy $H[C mid X_i]$. We can estimate the entropy by averaging over data points:

$$H[C mid X_i] approx -frac{1}{N} sum_n sum_c p(c mid x_{ni}) log p(c mid x_{ni}).$$

In your case, presumably,

$$p(c mid x_{ni}) = frac{mathcal{N}(x_{ni}; mu_{ci}, sigma_{ci}^2)}{sum_{c'} mathcal{N}(x_{ni}; mu_{c'i}, sigma_{c'i}^2)}.$$

Finding the most informative combination of features is a bit trickier. Say you want to find the three most informative features, then you would have to estimate $H[C mid (X_i, X_j, X_k)]$ for all $binom{M}{3}$ possible combinations of three features.

For choosing many features, you could try a greedy approach where you first pick the most informative feature $i$. Then you choose the second feature based on $H[C mid (X_i, X_j)]$, by fixing $i$ and testing all $M$ possible choices for $j$.