Reading through CV all-time classics I came across a statement that I would like to clarify. This is the post and my question refers to the closing remarks: "I have to note that all of the knowledge I just imparted is somewhat obsolete; now that we have computers, we can do better than t-tests. As Frank notes, you probably want to use Wilcoxon tests anywhere you were taught to run a t-test."
The lack of worries about whether it is sound to assume that the distribution of the sample means is normal enough to run the t-test is obviously a huge advantage. And I see that computers can rank long lists of differences between two vectors of data in a breeze… I remember doing it manually many years ago, but I digress…
So, is the t-test truly a thing of the past? What about permutation tests? Are they too ad hoc in the sense of typically entailing writing a few lines of code?
I wouldn't say the classic one sample (including paired) and two-sample equal variance t-tests are exactly obsolete, but there's a plethora of alternatives that have excellent properties and in many cases they should be used.
Nor would I say the ability to rapidly perform Wilcoxon-Mann-Whitney tests on large samples – or even permutation tests – is recent, I was doing both routinely more than 30 years ago as a student, and the capability to do so had been available for a long time at that point.
While it's vastly easier to code a permutation test – even from scratch – than it once was$^dagger$, it wasn't difficult even then (if you had code to do it once, modifications to do it under different circumstances – different statistics, different data, etc – were straightforward, generally not requiring a background in programming).
So here are some alternatives, and why they can help:
Welch-Satterthwaite – when you're not confident variances will be close to equal (if sample sizes are the same, the equal variance assumption is not critical)
Wilcoxon-Mann-Whitney – Excellent if tails are normal or heavier than normal, particularly under cases that are close to symmetric. If tails tend to be close to normal a permutation test on the means will offer slightly more power.
robustified t-tests – there are a variety of these that have good power at the normal but also work well (and retain good power) under heavier tailed or somewhat skew alternatives.
GLMs – useful for counts or continuous right skew cases (e.g. gamma) for example; designed to deal with situations where variance is related to mean.
random effects or time-series models may be useful in cases where there's particular forms of dependence
Bayesian approaches, bootstrapping and a plethora of other important techniques which can offer similar advantages to the above ideas. For example, with a Bayesian approach it's quite possible to have a model that can account for a contaminating process, deal with counts or skewed data, and handle particular forms of dependence, all at the same time.
While a plethora of handy alternatives exist, the old stock standard equal variance two-sample t-test can often perform well in large, equal-size samples as long as the population isn't very far from normal (such as being very heavy tailed/skew) and we have near-independence.
The alternatives are useful in a host of situations where we might not be as confident with the plain t-test… and nevertheless generally perform well when the assumptions of the t-test are met or close to being met.
The Welch is a sensible default if distribution tends not to stray too far from normal (with larger samples allowing more leeway).
While the permutation test is excellent, with no loss of power compared to the t-test when its assumptions hold (and the useful benefit of giving inference directly about the quantity of interest), the Wilcoxon-Mann-Whitney is arguably a better choice if tails may be heavy; with a minor additional assumption, the WMW can give conclusions that relate to mean-shift. (There are other reasons one might prefer it to the permutation test)
[If you know you're dealing with say counts, or waiting times or similar kinds of data, the GLM route is often sensible. If you know a little about potential forms of dependence, that, too is readily handled, and the potential for dependence should be considered.]
So while the t-test surely won't be a thing of the past, you can nearly always do just as well or almost as well when it applies, and potentially gain a great deal when it doesn't by enlisting one of the alternatives. Which is to say, I broadly agree with the sentiment in that post relating to the t-test… much of the time you should probably think about your assumptions before even collecting the data, and if any of them may not be really expected to hold up, with the t-test there's usually almost nothing to lose in simply not making that assumption since the alternatives usually work very well.
If one is going to the great trouble of collecting data there's certainly no reason not to invest a little time sincerely considering the best way to approach your inferences.
Note that I generally advise against explicit testing of assumptions – not only does it answer the wrong question, but doing so and then choosing an analysis based on the rejection or non-rejection of the assumption impact the properties of both choices of test; if you can't reasonably safely make the assumption (either because you know about the process well enough that you can assume it or because the procedure is not sensitive to it in your circumstances), generally speaking you're better off to use the procedure that doesn't assume it.
$dagger$ Nowadays, it's so simple as to be trivial. Here's a complete-enumeration permutation test and also a test based on sampling the permutation distribution (with replacement) for a two-sample comparison of means in R:
# set up some data x <- c(53.4, 59.0, 40.4, 51.9, 43.8, 43.0, 57.6) y <- c(49.1, 57.9, 74.8, 46.8, 48.8, 43.7) xyv <- stack(list(x=x,y=y))$values nx <- length(x) # do sample-x mean for all combinations for permutation test permmean = combn(xyv,nx,mean) # do the equivalent resampling for a randomization test randmean <- replicate(100000,mean(sample(xyv,nx))) # find p-value for permutation test left = mean(permmean<=mean(x)) # for the other tail, "at least as extreme" being as far above as the sample # was below right = mean(permmean>=(mean(xyv)*2-mean(x))) pvalue_perm = left+right "Permutation test p-value"; pvalue_perm # this is easier: # pvalue = mean(abs(permmean-mean(xyv))>=abs(mean(x)-mean(xyv))) # but I'd keep left and right above for adapting to other tests # find p-value for randomization test left = mean(randmean<=mean(x)) right = mean(randmean>=(mean(xyv)*2-mean(x))) pvalue_rand = left+right "Randomization test p-value"; pvalue_rand
(The resulting p-values are 0.538 and 0.539 respectively; the corresponding ordinary two sample t-test has a p-value of 0.504 and the Welch-Satterthwaite t-test has a p-value of 0.522.)
Note that the code for the calculations is in each case 1 line for the combinations for the permutation test and the p-value could also be done in 1 line.
Adapting this to a function which carried out a permutation test or randomization test and produced output rather like a t-test would be a trivial matter.
Here's a display of the results:
# Draw a display to show distn & p-value region for both opar <- par() par(mfrow=c(2,1)) hist(permmean, n=100, xlim=c(45,58)) abline(v=mean(x), col=3) abline(v=mean(xyv)*2-mean(x), col=3, lty=2) abline(v=mean(xyv), col=4) hist(randmean, n=100, xlim=c(45,58)) abline(v=mean(x), col=3) abline(v=mean(xyv)*2-mean(x), col=3, lty=2) abline(v=mean(xyv), col=4) par(opar)
- Solved – Difference between Randomization test and Permutation test
- Solved – Determine if there is a difference between two large vectors of different sizes of non-normal quantiative data in R
- Solved – the proper way to report test statistics used in a permutation analysis
- Solved – Can bootstrap be used to replace non-parametric tests
- Solved – Can bootstrap be used to replace non-parametric tests