I have some confusion related to Kruskal wallis test. I have an example lets say
X=[2 2 35 10 9 8 11 12]; Y=[1 1 1 2 2 2 2 2];
Y is the group variable
Now when I ran the kruskalwallis test
p = kruskalwallis(X,Y,'off')
I got p values of around 0.4. I was assuming the Kruskal wallis test takes the median. So it should have been robust when I added an outlier with value 35 in the third position. Why isn't it robust to that. Is it because I have very few samples. Can anyone explain?
Best Answer
If Y is meant to be a grouping variable, the p-value in R is around 0.45
> kruskal.test(x~y) Kruskal-Wallis rank sum test data: x by y Kruskal-Wallis chi-squared = 0.5622, df = 1, p-value = 0.4534
But it makes no difference whether that 35 is set to 13 or 35 or 1300 – the p-value is exactly the same. It is clearly robust to outliers.
With continuity correction, the p-value is somewhat higher.
Edit:
Here's an illustration of just how the Kruskal-Wallis p-value responds as you move the third observation around – that is, this is an empirical influence curve for the p-value as x[3]
is moved (takes the various values of delta).
We see that the Kruskal-Wallis is highly insensitive to all but a small range of values for x[3]
(it is constant to the left of $[1,2]$ and constant to the right of it). It's really insensitive.
The grey line is the p-value with x[3] omitted. As you see, no value for x[3]
will allow the Kruskal-Wallis to attain that p-value, though making x[3]=2
comes closest.
I was assuming the Kruskal wallis test takes the median.
It's a rank-based ANOVA. It doesn't actually 'use' the median for anything.
The measure of location-shift that corresponds to the Wilcoxon-Mann-Whitney (and hence to the Kruskal-Wallis) is the median of pairwise differences between the samples.
> median(outer(x[y==1],x[y==2],"-")) [1] -7
Compare:
> wilcox.test(x~y,conf.int=TRUE) Wilcoxon rank sum test with continuity correction data: x by y W = 5, p-value = 0.5486 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: -10 5 sample estimates: difference in location -6.999992 #<-------------------------------
(I'm not sure why it doesn't have better accuracy there)
If you change the 35 to 13 or 1300, you get the same estimate of shift.
If you add a whole new observation – if your original data in the first group was just (2, 2), then adding an additional observation changes the p-value. (This would be the case even if the median was the estimate of location shift.)
Similar Posts:
- Solved – Kruskal-Wallis vs Jonckheere-Terpstra Test
- Solved – Mann-Whitney U test or Kruskal Wallis test for comparing median of two groups
- Solved – Kruskal–Wallis non-parametric alternatives for groups with different shaped distributions
- Solved – Kruskal–Wallis non-parametric alternatives for groups with different shaped distributions
- Solved – Comparing unbalanced groups with ANOVA/Kruskal-Wallis when one group has only 1 observation