Solved – How to combine subsets consisting of mean, variance, confidence, and number of sampled points used

I have a data set that has been divided into $n$ data subsets.

I am sampling from each of these subsets and getting a tuple consisting of mean, variance, confidence and number of sampled points used.

How can I combine these results?

I do not know how other than a simple function of the number of points and their averages. This wont take into account either the variance or the confidence of the score.

Let $n_i, m_i, v_i$ be the number of samples, observed mean, and variance in sample $i$. Let $n, m, v$ be similar figures for the combined data (sorry I redefined $n$ here).

$$m = frac{1}{n}sum_i n_i m_i$$.

Now for the variance:

$$v = frac{1}{n-1}sum_{i,j} (x_{i,j} – m)^2$$

with $x_{i,j}$ the $j^{th}$ observation of sample $i$ and $j=1,2,ldots, n_i$.

Play around a little:

$$(x_{i,j} -m)^2 = (x_{i,j} – m_i + m_i – m)^2 = (x_{i,j} -m_i)^2 + (m_i-m)^2 +2(x_{i,j}-m_i)(m_i-m)$$.

Terms $(m_i-m)$ can be factored out of the summation over $j$:

$$v = frac{1}{n-1}left[sum_i n_i(m_i-m)^2 + 2sum_i(m_i-m)sum_j(x_{i,j}-m_i) + sum_{i,j} (x_{i,j} – m_i)^2right]$$.

Since $sum_j (x_{i,j}-m_i)=0$, the middle term cancels out. So you're left with:

$$v=frac{1}{n-1}left[sum_i n_i(m_i-m)^2 + sum_i(n_i-1)v_iright]$$

Confidence intervals are obtained with $m$ and $v$. Is that what you were looking for ?

Similar Posts:

Rate this post

Leave a Comment