# Solved – Confusion related to kruskal wallis test

I have some confusion related to Kruskal wallis test. I have an example lets say

``X=[2 2 35 10 9 8 11 12]; Y=[1 1 1 2 2 2 2 2]; ``

Y is the group variable

Now when I ran the kruskalwallis test

``p = kruskalwallis(X,Y,'off') ``

I got p values of around 0.4. I was assuming the Kruskal wallis test takes the median. So it should have been robust when I added an outlier with value 35 in the third position. Why isn't it robust to that. Is it because I have very few samples. Can anyone explain?

Contents

If Y is meant to be a grouping variable, the p-value in R is around 0.45

``> kruskal.test(x~y)      Kruskal-Wallis rank sum test  data:  x by y  Kruskal-Wallis chi-squared = 0.5622, df = 1, p-value = 0.4534 ``

But it makes no difference whether that 35 is set to 13 or 35 or 1300 – the p-value is exactly the same. It is clearly robust to outliers.

With continuity correction, the p-value is somewhat higher.

Edit:

Here's an illustration of just how the Kruskal-Wallis p-value responds as you move the third observation around – that is, this is an empirical influence curve for the p-value as `x` is moved (takes the various values of delta). We see that the Kruskal-Wallis is highly insensitive to all but a small range of values for `x` (it is constant to the left of \$[1,2]\$ and constant to the right of it). It's really insensitive.

The grey line is the p-value with x omitted. As you see, no value for `x` will allow the Kruskal-Wallis to attain that p-value, though making `x=2` comes closest.

I was assuming the Kruskal wallis test takes the median.

It's a rank-based ANOVA. It doesn't actually 'use' the median for anything.

The measure of location-shift that corresponds to the Wilcoxon-Mann-Whitney (and hence to the Kruskal-Wallis) is the median of pairwise differences between the samples.

``> median(outer(x[y==1],x[y==2],"-"))  -7 ``

Compare:

``> wilcox.test(x~y,conf.int=TRUE)      Wilcoxon rank sum test with continuity correction  data:  x by y  W = 5, p-value = 0.5486 alternative hypothesis: true location shift is not equal to 0  95 percent confidence interval:  -10   5  sample estimates: difference in location               -6.999992    #<------------------------------- ``

(I'm not sure why it doesn't have better accuracy there)

If you change the 35 to 13 or 1300, you get the same estimate of shift.

If you add a whole new observation – if your original data in the first group was just (2, 2), then adding an additional observation changes the p-value. (This would be the case even if the median was the estimate of location shift.)

Rate this post