# Solved – Mahalanobis distance on non-normal data

Mahalanobis distance, when used for classification purposes, typically assumes a multivariate normal distribution, and the distances from the centroid should then follow a \$chi^2\$ distribution (with \$d\$ degrees of freedom equal to the number of dimensions/features). We can calculate the probability that a new data point belongs to the set using its Mahalanobis distance.

I have data sets that do not follow a multivariate normal distribution (\$d approx 1000\$). In theory, each feature should follow a Poisson distribution, and empirically this seems to be the case for many (\$approx 200\$) features, and those that do not are in the noise and can be removed from the analysis. How can I classify new points on this data?

I guess there are two components:

1. What is an appropriate "Mahalanobis distance" formula on this data (i.e. multivariate Poisson distribution)? Is there a generalization of the distance to other distributions?
2. Whether I use the normal Mahalanobis distance or another formulation, what should the distribution of these distances be? Is there a different way to do the hypothesis test?

Alternatively…

The number of known data points \$n\$ in each class varies widely, from \$n=1\$ (too few; I'll determine a minimum empirically) to around \$n=6000\$. The Mahalanobis distance scales with \$n\$, so distances from one model/class to the next cannot be directly compared. When the data is distributed normally, the chi-squared test provides a way to compare distances from different models (in addition to providing critical values or probabilities). If there is another way to directly compare the "Mahalanobis-like" distances, even if it does not provide probabilities, I could work with that.

Contents