I am trying to wrap my mind around the variance definition.
Given a set of values S and n = #(S), the variance is defined as:
$$
operatorname{var}(S) = frac{sum_{i=1}^n( S_i – operatorname{mean}(S) )^2} n
$$
And the square root of that (standard deviation) measures how far away the values are on average from the mean.
However, there is a simpler formula that also measures how far away the values are from the mean:
$$
operatorname{another Possible Def For Var}(S) = frac{sum_{i=1}^n|S_i – operatorname{mean}(S)|}{n}
$$
I am trying to understand the reasoning behind the fact we use square root instead of a simpler modules function there. Is there a real reason why variance was defined the first way and not the second way?
# EDIT #
Ok, looks like the given reasons so far are way more advanced than what I was expecting.
The argument of squaring it as opposed to taking the modulus saying the modulus make the math more complicated is valid, but more of a consequence of the definition rather than a reason for it being defined as it is IMHO. Same thing goes for the Central Limit Theorem.
I ended up finding the exact same question at Khan Academy. There, the following reasons were also given:
- "Squaring emphasizes larger differences (think of the effect
outliers have)." Another comment also points out: "In addition to
amplifying large differences from the mean, squaring also MINIMIZES
tiny differences from the mean".
These are the most convincing reasons I found so far. The modulus will not emphasize large values, neither will it minimize small values. HOWEVER, the same argument goes to any even power. A power of 4 will also amplify large differences and minimize tiny differences (it will actually do a better job at those). So why not take the power of 4 then? (or any other even number for that matter).
- "(…) you can also view the equation as being the Euclidean distance between all the points and the mean of the points"
That's more of a "nice-to-have" than a reason to me. If anything, the modules would give the Manhattan distance. So what?
Having said all that, I am not 100% convinced yet. I believe this question is way deeper than it looks at first glance and judging from the Khan Academy number of upvotes, I am not the only one confused about it.
Best Answer
Let $mu=operatorname{E}(X).$
The main reason for using $sqrt{operatorname{var}(X)} = sqrt{operatorname{E}((X-mu)^2)}$ as a measure of dispersion, rather that using the mean absolute deviation $operatorname{E}(|X-mu|),$ is that if $X_1,ldots,X_n$ are independent, then $$ operatorname{var}(X_1+cdots+X_n) = operatorname{var}(X_1)+cdots+operatorname{var}(X_n). tag 1 $$ Nothing like that works with the mean absolute deviation. For example, try it with $X_1,X_2,X_3,simoperatorname{i.i.d.} operatorname{Bernoulli}(1/2).$
In any problem where you use the central limit theorem, you need this.
For example: What is the standard deviation of the number of heads that appear when a coin is tossed $900$ times? That's easy to find because of $(1).$