Solved – Are parametric tests on rank transformed data equivalent to non-parametric test on raw data

Many non-parametric tests are identical to their parametric equivalent on ranked data. At least, that's what I learned from this blog post on Friedman's test and skimming this 1981 article.. This seems immensely practical, especially for paedagogical purposes. But I couldn't find any demonstrations of this equivalence, so I decided to try it out myself.

However, although they match closely, they don't match exactly and for paired samples, the difference is large. Am I missing something or is this "equivalence" imperfect? Here're a few examples:

``# generate two dependent samples. set.seed(42) x1 = rnorm(20) x2 = x1 + rnorm(20, 1, 4) x = data.frame(score=c(x1,x2), time=rep(c('pre', 'post'), each=20))  # Correlation of ranks. Exact correlation. # p_spearman=0.0074, p_pearson=0.0064 cor.test(x1, x2, method='spearman') cor.test(rank(x1), rank(x2), method='pearson')  # Unpaired samples between-subjects-difference test. # p_wilcox=0.718, p_t-test=0.711 wilcox.test(x\$score ~ x\$time) t.test(rank(x\$score) ~ x\$time)  # Paired samples within-subject-difference test. Bad p-value? # p_mann-whitney=0.927, p_t-test=1.00 wilcox.test(x1, x2, paired=T) t.test(rank(x1), rank(x2), paired=T) ``
Contents

I think it's important to clearly distinguish between

a. using a parametric statistic on the ranks as the basis for a nonparametric test

b. using a parametric test as is, on the ranks

(we might also consider a third option — like "b." but in some way scaling or adjusting the statistic to get a better approximation to the "true" p-value from ordinary tables. I'll ignore this possibility for now, but it may be a fruitful endeavour.)

In the first case, we would compute the statistic as usual, but when finding the p-value we'd look at the distribution of that test statistic under the null. In particular, the non-parametric rank-based tests are permutation tests (which – because the set of ranks is fixed for each sample size, for continuous distributions – don't depend on the specific observed values). So we would compute the permutation distribution of the parametric test applied to the ranks.

When we do that we do indeed sometimes get a test that's equivalent to a well-known non-parametric test (equivalent in this case means that it "orders" the set of possible samples in the same way, so it will always give the same p-values)

In the second case, we simply ignore that we have ranks and treat the ranks as if they were independent samples from whatever the assumed distribution was. That won't give the same p-values as the nonparametric test. Indeed, in small samples the distribution can't be right. However, for some tests, at larger sample sizes it can become fairly close, and then the tests will have about the right significance levels. When that happens, p-values may be quite similar to what they were in the first case.

We can see this with the ordinary equal variance two-sample t-test vs the Wilcoxon test:

The first plot shows us that indeed in this example the p-values for each of the samples are in the same order (the monotonicity indicates that the "equivalent tests" under part a was holding up — as is already known for this pair of tests). It is also encouraging because it looks like the p-vaue pairs are quite close to the \$y=x\$ line. The second plot shows the difference in p-values. Now we can see that the t-test applied directly to ranks as if they were i.i.d normal data gives p-values that are nearly always lower than the Wilcoxon-Mann-Whitney (and indeed, typically too low).

[Other sample sizes show similar patterns – at equal sample sizes the broad shape of the pattern of differences remains, but the scale on the y-axis of the second plot gets smaller as sample size goes up; at unequal sample sizes the shape of the second plot changes but the lower p-values for the t remains.]

So if we use the test as in "b.", we reject too often at any significance level.

However, since that difference grows smaller as sample size increases, if both samples are large, this may not bother us much.

(Note that this discussion hasn't investigated power yet, nor any other tests than this simple comparison, but many of the points I made will carry over to other tests.)

Oh, I guess people will want code. I did that in R:

``n1=40;n=n1+n1 res=replicate(1000,{v=sample(n);                     c(t.test(v[1:n1],v[(n1+1):n],var.equal=TRUE)\$p.value,                       wilcox.test(v[1:n1],v[(n1+1):n])\$p.value)                     }) ``

note that `v` contains the current random permutation of ranks under the null

Takes about a second on my laptop. Note that the t-test p-values are in the first row of `res` and the WMW p-values are in the second row.

Rate this post