I have one categorical variable that can assume 5 possible values: 1, 2, 3, 4, 5. I observe the value of the variable in two samples of different size. Now I want to compare the frequency distribution of my variable in the two sample.

Is there any statistic that measures the concentration of the distribution (something like the Gini coefficient, but not cumulative), so to take the value of 1 if all observations of the sample are in one category and 0 if they are equally distributed across the 5 categories. Looking at the statistic, I could then say that one sample is more *heterogeneous* than the other.

**Contents**hide

#### Best Answer

Here's one such suggestion:

Assuming *nominal* categories labelled $1, 2, …, k$

Consider the sum of the squares of the proportions in each category,

$$sum_{i=1}^k p_i^2$$

If all values are in 1 category it takes the value 1 and if they're uniformly spread across categories it takes the value $k/k^2 = 1/k$.

So subtract $frac{1}{k}$ and divide by $1-frac{1}{k}$ to give it the right range.

That leaves us with the concentration coefficient $frac{(sum_{i=1}^k p_i^2)-frac{1}{k}}{1-frac{1}{k}}=frac{(ksum_{i=1}^k p_i^2)-1}{k-1}$

Note that this is a linear rescaling of the Simpson diversity index to make it a concentration (i.e. flipped around) and with the desired endpoints.

Edit: Note added later – the Simpson index is also called the Herfindahl index (one of many cases where the same thing is called different names in different areas), and the above concentration measure is the normalized Herfindahl index $H^*$.

There are other such indexes which you might prefer to similarly modify.

As another possibility from economics, there's also the Gini coefficient.

For ordered categories (where further apart is more heterogeneous), you might want to consider subtracting the *second* of the two polarization indexes discussed here from 1.

Alternatively, if you need uniformity to be zero, and 50-50-polarizaton to be negative, a simple rescaling can achieve that.

For the ordered case, further clarification is needed.