Solved – Why not use modulus for variance?

I am trying to wrap my mind around the variance definition.

Given a set of values S and n = #(S), the variance is defined as:

$$
operatorname{var}(S) = frac{sum_{i=1}^n( S_i – operatorname{mean}(S) )^2} n
$$

And the square root of that (standard deviation) measures how far away the values are on average from the mean.

However, there is a simpler formula that also measures how far away the values are from the mean:

$$
operatorname{another Possible Def For Var}(S) = frac{sum_{i=1}^n|S_i – operatorname{mean}(S)|}{n}
$$

I am trying to understand the reasoning behind the fact we use square root instead of a simpler modules function there. Is there a real reason why variance was defined the first way and not the second way?

# EDIT #

Ok, looks like the given reasons so far are way more advanced than what I was expecting.

The argument of squaring it as opposed to taking the modulus saying the modulus make the math more complicated is valid, but more of a consequence of the definition rather than a reason for it being defined as it is IMHO. Same thing goes for the Central Limit Theorem.

I ended up finding the exact same question at Khan Academy. There, the following reasons were also given:

  1. "Squaring emphasizes larger differences (think of the effect
    outliers have)."
    Another comment also points out: "In addition to
    amplifying large differences from the mean, squaring also MINIMIZES
    tiny differences from the mean"
    .

These are the most convincing reasons I found so far. The modulus will not emphasize large values, neither will it minimize small values. HOWEVER, the same argument goes to any even power. A power of 4 will also amplify large differences and minimize tiny differences (it will actually do a better job at those). So why not take the power of 4 then? (or any other even number for that matter).

  1. "(…) you can also view the equation as being the Euclidean distance between all the points and the mean of the points"

That's more of a "nice-to-have" than a reason to me. If anything, the modules would give the Manhattan distance. So what?

Having said all that, I am not 100% convinced yet. I believe this question is way deeper than it looks at first glance and judging from the Khan Academy number of upvotes, I am not the only one confused about it.

Best Answer

Let $mu=operatorname{E}(X).$

The main reason for using $sqrt{operatorname{var}(X)} = sqrt{operatorname{E}((X-mu)^2)}$ as a measure of dispersion, rather that using the mean absolute deviation $operatorname{E}(|X-mu|),$ is that if $X_1,ldots,X_n$ are independent, then $$ operatorname{var}(X_1+cdots+X_n) = operatorname{var}(X_1)+cdots+operatorname{var}(X_n). tag 1 $$ Nothing like that works with the mean absolute deviation. For example, try it with $X_1,X_2,X_3,simoperatorname{i.i.d.} operatorname{Bernoulli}(1/2).$

In any problem where you use the central limit theorem, you need this.

For example: What is the standard deviation of the number of heads that appear when a coin is tossed $900$ times? That's easy to find because of $(1).$

Similar Posts:

Rate this post

Leave a Comment