# Solved – Does variance only work on normally distributed data (as a measure of dispersion)

It says in wikipedia

The role of the normal distribution in the central limit theorem is in
part responsible for the prevalence of the variance in probability and
statistics.

I understand this as
When we use variance/SD as a measure of dispersion, we are actually looking for the "scaling parameter" of a normal distribution, since a random random variable is likely to follow approximately a normal distribution to CLT.

In the case that the data is not normally distributed, is variance/SD still a reasonable measure of dispersion?

Say the data is uniformly distributed, the average absolute deviation seems to be a better measure of dispersion than the variance, because it can be seen as the "scaling parameter" for the uniform distribution, am I right?

Update
I mean, say I have two sets of samples, one is `{1,1,1,-1,-1,-1}` and the other one is drawn from a normal distribution \$N(0,1)\$, their variances are both 1. The two sets will be considered as of the same degree of dispersion if we use variance as the measure.

But it feels like we are forcefully treating them both as Gaussian then work out the distribution parameters and say "yeah they're equal in terms of dispersion".

Contents

Your question is a little vague, but no, variance isn't used because of its association with the normal distribution. Most distributions have at least a mean and a variance. Some do not have a variance. Some can either have or not have a variance. Some have no mean and so do not have a variance.

Just for mental clarification on your side, if a distribution has a mean then \$bar{x}approxmu,\$ but if it does not then \$bar{x}approxtext{nothing}\$. That is it gravitates nowhere and any calculation just floats around the real number line. It doesn't mean anything. The same is true if you calculate a standard deviation for a distribution that does not have one. It has no meaning.

The variance is a property of a distribution. You are correct in that it can be used to scale the problem, but it is deeper than that. In some theoretical frameworks, it is a measure of our ignorance, or more precisely, uncertainty. In others, it measures how large of an effect chance can have on outcomes.

Although variance is a conceptualization of dispersion, it is an incomplete conceptualization. Both skew and kurtosis further explain how the dispersion operates on a problem.

For many problems in a null hypothesis framework of thinking, the Central Limit Theorem makes the discussion of problems simpler and so it doesn't hurt that there is a linkage between the normal distribution, with its very well defined distributional properties, and the use of the standard deviation. However, this is more true for simple problems than complex ones. This is also less true for Bayesian methods which do not use a null hypothesis and which do not depend on the sampling distribution of the estimator.

The average absolute deviation is a valuable tool in parameter free and distribution free methods, but less valuable for the uniform distribution. If you actually had a bounded uniform distribution, then the mean and the variance are known.

Let me give you a uniform distribution problem that may not be as simple as you think. Consider that a new enemy battle tank has appeared on the battlefield. You do not know how many they have, let alone that they existed. You want to estimate the total number of tanks.

Tanks have serial numbers on their engines, or used to before someone figured this out. The probability of capturing any one specific serial number is \$1/N\$ where \$N\$ is the total of the tanks. Of course you do not know \$N\$, so this is an interesting problem. You need to know N. You can only see the distribution of captured serial numbers and not know if the largest number captured is also the last tank built. It probably is not.

In that case, the mean and standard deviation provide the most powerful tools to solve the problem, despite the intuition that the standard deviation is a bad estimator.

It will be true that it is a bad estimator for certain problems, but you need to learn them on a case by case basis.

Statistical tools are chosen based on needs, rules of math and trade-offs between real world costs and limitations and the demands of the problem. Sometimes that is the variance, but sometimes it is not. The best thing to do is to learn why the rules are designed the way they are and that is too long for a posting here.

I would recommend a good practitioners book on non-parametric statistics and if you have had calculus a good introductory practitioners book on Bayesian methods.

Rate this post