I have a data set that has been divided into $n$ data subsets.
I am sampling from each of these subsets and getting a tuple consisting of mean, variance, confidence and number of sampled points used.
How can I combine these results?
I do not know how other than a simple function of the number of points and their averages. This wont take into account either the variance or the confidence of the score.
Best Answer
Let $n_i, m_i, v_i$ be the number of samples, observed mean, and variance in sample $i$. Let $n, m, v$ be similar figures for the combined data (sorry I redefined $n$ here).
$$m = frac{1}{n}sum_i n_i m_i$$.
Now for the variance:
$$v = frac{1}{n-1}sum_{i,j} (x_{i,j} – m)^2$$
with $x_{i,j}$ the $j^{th}$ observation of sample $i$ and $j=1,2,ldots, n_i$.
Play around a little:
$$(x_{i,j} -m)^2 = (x_{i,j} – m_i + m_i – m)^2 = (x_{i,j} -m_i)^2 + (m_i-m)^2 +2(x_{i,j}-m_i)(m_i-m)$$.
Terms $(m_i-m)$ can be factored out of the summation over $j$:
$$v = frac{1}{n-1}left[sum_i n_i(m_i-m)^2 + 2sum_i(m_i-m)sum_j(x_{i,j}-m_i) + sum_{i,j} (x_{i,j} – m_i)^2right]$$.
Since $sum_j (x_{i,j}-m_i)=0$, the middle term cancels out. So you're left with:
$$v=frac{1}{n-1}left[sum_i n_i(m_i-m)^2 + sum_i(n_i-1)v_iright]$$
Confidence intervals are obtained with $m$ and $v$. Is that what you were looking for ?
Similar Posts:
- Solved – name or reference in a published journal/book for the following variance formula
- Solved – Variance of subsample
- Solved – Sum of sample mean and sample variance sampling distribution
- Solved – Examples of when confidence interval and credible interval coincide
- Solved – Expectation of the variance of the sampling set without replacement