Solved – use a paired t-test on data that are averages

I am comparing 15 pairs of averages paired by two categories with a paired t test. The samples used to determine the averages in each pair are not equal within any single pair of data. For example, if pair 1 consists of 95.5 in category A and 83.6 in category B, the 95.5 average was from a sample of 120 while the 83.6 average was from a sample of 90.

Is this still an acceptable use of the paired t test?

It's probably not a problem at all, provided the sample sizes are similar to each other.

There could be complications with small sample sizes, though. Intuitively, an average of a small sample is more variable than averages of large samples. If some pairs are both based on small samples, they could create unusual outlying values. It's well known that the Student $t$ test does not work well in such cases.


Let's pursue this a little further with a model and a simulation. Because this is meant only to illustrate a phenomenon, I'll propose only the simplest possible model–one for which the $t$ test is ordinarily without any problems at all–and use only the simplest form of the $t$ test to avoid technical complications.

The model is that for each pair of averages, $(bar x_A, bar x_B)$, the data contributing to the first average are a sample of some Normal distribution with mean $mu_A$ and variance $sigma^2$; and the data contributing to the second average are a sample of some Normal distribution with mean $mu_B$ and variance $sigma^2$. The null hypothesis asserts that in every pair $mu_A=mu_B$. The alternative hypothesis is that there is some systematic difference $delta$ and that in each pair $mu_A = mu_B + delta$.

If all the sample sizes were the same, say equal to $m$, then every one of the observations would behave as independent Normal random variables with variance $sigma^2/m$. This is where a paired $t$ test is ideal: the differences $bar x_A – bar x_B$ then have Normal distributions with mean $0$ and variance $2sigma^2/m$ and they are all independent. The Student $t$ statistic–equal to the difference in means of the $bar x_A$ and the means of the $bar x_B$, divided by the estimated standard error of those differences, then will have exactly a $t_{n-1}$ distribution, where $n$ is the number of pairs.

However, when the sample sizes vary, the resulting distribution is not exactly a Student $t$ distribution. The numerator of the $t$ statistic, being a linear combination of independent Normal variables, is still Normal; but the denominator, being the square root of a sum of squares of Normal variables having different variances, no longer has a $chi^2$ distribution. We therefore have no right to expect the ratio to have a Student $t$ distribution.


To see whether this might be a practical issue, I simulated $10,000$ paired t-tests for $20$ pairs. First I computed the values in each pair as the average of two independent values (equal to $2$). All data were independently drawn from the same Normal distribution. (You may easily vary the code to simulate your particular circumstances.) I collected the t-statistic for each iteration. Here they are, displayed as a histogram. On it is drawn (as a red curve) the Student $t$ distribution with $20-1=19$ degrees of freedom: it is supposed to describe this histogram well, especially in the tails, which correspond to significant results.

Figure 1

It's a nice agreement between theory and simulation, confirming the appropriateness of the t-test (and, incidentally, showing the code is likely working as intended).

To create an extreme case (but not the most extreme), I supposed that one pair was based on samples of size $2$ while the others were based on samples of size $200$, but otherwise all data were independently drawn from the same Normal distribution.

Figure 2

Something went very wrong. The single pair based on a small sample has caused the $t$ statistics to be less extreme than we might otherwise suppose, but rarely near zero. This is due, as previously suggested, to its effect on the estimated standard deviation: the inflated SD pulls in the tails–it's hard to get a large fraction when its denominator is large–but also the deviation of the single pair dominates the numerator. Accordingly, numerator and denominator tend to be comparable (but can have different signs). That's why the histogram bunches up near $pm 1$.

As a result, the t-test will have a harder time detecting a departure from the null hypothesis: it will be less powerful than we think.


In practice, we can expect there to be a certain amount of this behavior in your data. A deeper analysis of variance estimates and $chi^2$ distributions indicates it really won't be much of a problem unless there are indeed some pairs with radically smaller sample sizes than others.

Although I haven't fully analyzed any alternatives (the question did not ask for a solution, only for whether a paired t-test would work!), I believe that this analysis could readily be extended to study an obvious weighted version of the t-test, weighting the data by the reciprocals of their sample sizes, and that the weighted version would have superior performance. It could take some effort to figure out the appropriate degrees of freedom to use in general.


This is the R code that created the figures. By suitably changing you can specify the sample sizes in your pairs and re-run it to study the extent to which using a paired t-test might be problematic. <- cbind(A=c(2, rep(200, 19)), B=c(2, rep(200, 19))) <- cbind(A=rep(2,20), B=rep(2,20)) n <- nrow( # Number of pairs mu <- c(A=0, B=0)  # The underlying group means n.sim <- 1e4       # Simulation size  # Create the data. x <- array(rnorm(n.sim*length(, rep(mu, each=n), 1/sqrt(,                  dim=c(n, 2, n.sim))  # Run the t-tests. t.stat <- apply(x, 3, function(y) {   z <- y[,1]-y[,2]   mean(z) / sd(z) * sqrt(length(z))   })  # Display the results and compare to the Student t distribution. hist(t.stat, freq=FALSE, breaks=50) curve(dt(x, n-1), add=TRUE, col="Red", lwd=2) 

Similar Posts:

Rate this post

Leave a Comment