I am trying to wrap my mind around the variance definition.

Given a set of values ** S** and

**, the variance is defined as:**

*n = #(S)*$$

operatorname{var}(S) = frac{sum_{i=1}^n( S_i – operatorname{mean}(S) )^2} n

$$

And the square root of that (standard deviation) measures how far away the values are on average from the mean.

However, there is a simpler formula that also measures how far away the values are from the mean:

$$

operatorname{another Possible Def For Var}(S) = frac{sum_{i=1}^n|S_i – operatorname{mean}(S)|}{n}

$$

I am trying to understand the reasoning behind the fact we use square root instead of a simpler modules function there. Is there a real reason why variance was defined the first way and not the second way?

**Contents**hide

## # EDIT #

Ok, looks like the given reasons so far are way more advanced than what I was expecting.

The argument of squaring it as opposed to taking the modulus saying the modulus make the math more complicated is valid, but more of a consequence of the definition rather than a reason for it being defined as it is IMHO. Same thing goes for the Central Limit Theorem.

I ended up finding the exact same question at Khan Academy. There, the following reasons were also given:

*"Squaring emphasizes larger differences (think of the effect*Another comment also points out:

outliers have)."*"In addition to*.

amplifying large differences from the mean, squaring also MINIMIZES

tiny differences from the mean"

These are the most convincing reasons I found so far. The modulus will not emphasize large values, neither will it minimize small values. HOWEVER, the same argument goes to any even power. A power of 4 will also amplify large differences and minimize tiny differences (it will actually do a better job at those). So why not take the power of 4 then? (or any other even number for that matter).

*"(…) you can also view the equation as being the Euclidean distance between all the points and the mean of the points"*

That's more of a "nice-to-have" than a reason to me. If anything, the modules would give the Manhattan distance. So what?

Having said all that, I am not 100% convinced yet. I believe this question is way deeper than it looks at first glance and judging from the Khan Academy number of upvotes, I am not the only one confused about it.

#### Best Answer

Let $mu=operatorname{E}(X).$

The main reason for using $sqrt{operatorname{var}(X)} = sqrt{operatorname{E}((X-mu)^2)}$ as a measure of dispersion, rather that using the mean absolute deviation $operatorname{E}(|X-mu|),$ is that if $X_1,ldots,X_n$ are independent, then $$ operatorname{var}(X_1+cdots+X_n) = operatorname{var}(X_1)+cdots+operatorname{var}(X_n). tag 1 $$ Nothing like that works with the mean absolute deviation. For example, try it with $X_1,X_2,X_3,simoperatorname{i.i.d.} operatorname{Bernoulli}(1/2).$

In any problem where you use the central limit theorem, you need this.

For example: What is the standard deviation of the number of heads that appear when a coin is tossed $900$ times? That's easy to find because of $(1).$