Variance can be combined as
$$v=frac{1}{n-1}left(sum_{i = 1}^{numGroups}n_{i}(m_{i}-m)^2+ sum_{i = 1}^{numGroups}(n_{i}-1)v_{i}right)$$
where $v$ is the combined variance, $n$ is the total sample size, $n_i$ is the number of points in group $i$, $numGroups$ is the total number of groups, $m_i$ is the mean of group $i$, $m$ is the combined mean, $v_i$ is the variance of the $i^{th}$ group
Is there a name for this formula or any reference to it?
Best Answer
Let $x_{i,j}$ denote the $j$-th data point in the $i$-th group which has $n_i$ data points. There are $N$ such groups and thus a total of $sum_{i=1}^N n_i = n$ data points.
If the sample mean and sample variance of the $i$-th group are $m_i$ and $v_i$ respectively, then we have $$n_icdot m_i = sum_{j=1}^{n_i} x_{i,j}quad text{and} quad (n_i-1)v_i = sum_{j=1}^{n_i} left(x_{i,j} – m_iright)^2.$$ It follows that $displaystyle sum_{i=1}^N sum_{j=1}^{n_i} x_{i,j} = sum_{i=1}^N n_icdot m_i = ncdot m$ where $m$ is the overall mean of the $n$ data points. Similarly, the sum $displaystyle sum_{i=1}^N (n_i-1)v_i = sum_{i=1}^N sum_{j=1}^{n_i}left(x_{i,j} – m_iright)^2$ can be recognized as the sum of the squared deviations of the data points from the means of their respective groups. This is not quite what we want for calculating the variance of the $n$ data points: we need to know the sum of the squared deviations from $m$. Fortunately, all that is needed is a little algebra. We have that $$begin{align} sum_{i=1}^Nsum_{j=1}^{n_i} left(x_{i,j} – mright)^2 &= sum_{i=1}^N left[sum_{j=1}^{n_i}left(x_{i,j}^2 -2x_{i,j}m + m^2right)right]\ &= sum_{i=1}^N left[left(sum_{j=1}^{n_i}x_{i,j}^2right) -2n_im_im + n_im^2right]\ &= sum_{i=1}^N left[left(sum_{j=1}^{n_i}x_{i,j}^2right) + n_i(m^2 -2m_im + m_i^2) – n_im_i^2right]\ &=sum_{i=1}^N left[n_i(m_i-m)^2 + sum_{j=1}^{n_i}left(x_{i,j}^2-m_i^2right) right]\ &= sum_{i=1}^N left[n_i(m_i-m)^2 + sum_{j=1}^{n_i}left(x_{i,j}^2-2x_{i,j}m_i + m_i^2right) right]\ &= sum_{i=1}^N left[n_i(m_i-m)^2 + sum_{j=1}^{n_i}left(x_{i,j}-m_iright)^2 right]\ &= sum_{i=1}^N left[n_i(m_i-m)^2 + (n_i-1)v_i right]. end{align}$$ All that remains is to divide both sides by $n-1$ and we are done.
Similar Posts:
- Solved – ANOVA sum of squares between groups
- Solved – Expected value of a product of random variables
- Solved – Sufficient statistic for bivariate or multivariate normal
- Solved – What’s the rationale behind the degrees of freedom in Levene’s test
- Solved – What’s the rationale behind the degrees of freedom in Levene’s test