I would like to ask whether it is possible to perform an analysis of variance on data that is not normally distributed and has unequal variance, but I have large enough sample size.
I have read that for normal distributed data with equal variance we can perform ANOVA test. It also says that the assumptions do not need necessarily to be be met in case that we have large enough sample size (is this statement true for both assumptions – variance equality and normal distribution?).
An alternative for ANOVA might be Welch's anova (if we have unequal variation), but it says that normal distribution is required. Unfortunately I can not find if normality assumption can be violated if we have large enough sample size (for Welch's anova test).
Another alternative might be a Kruskal–Wallis H test since it does not require normally distributed data, but in some articles it says that 'roughly 'equal variance between groups must be met.
The problem is that I am not sure what 'roughly' exactly means. In my case the values are from an interval [-6,6] and it can only be a whole number. My maximal standard deviation difference is 1, which I think is not large since the range of values is 12. If I perform for example Levene Test for Equality of Variances it gives me p-value less than 0.05 which means that data has unequal variance? But can I ignore the results of the test since the variance equality only needs to be 'roughly' met?.
To conclude, I would like to know which test can I use if I have large enough sample size with non-normal distribution with unequal variance (can I use the tests that I have mentioned above or there exists another alternative for my scenario)?
Best Answer
Generally speaking, a one-way ANOVA is reasonably robust against non-normality as long as skewness is slight and there are no far outliers. If your observations are integers between $pm 6,$ there is no chance for far outliers, and I suppose group means of moderate-sized samples will be nearly normal.
However, inequality of variances can easily give misleading results in a one-way ANOVA. So I think it is especially worthwhile to protect against effects of heteroskedasticity.
I suggest you use the version of a one-way ANOVA implemented in the oneway.test
procedure in R. This ANOVA does not assume equal variances.
Here is an example with simulated data for 4 levels of the factor (groups) and $r = 20$ replications per factor. Of course, my simulated data may not imitate your data well, but you can see how oneway.test
works.
set.seed(2020) n = 20; k = 4 x1 = rbinom(n, 12, .3) -6 x2 = rbinom(n, 12, .35)-6 x3 = rbinom(n, 12, .4) -6 x4 = rbinom(n, 12, .4) -6 x = c(x1, x2, x3, x4) g = as.factor(rep(1:k, each=n)) var(x1); var(x2); var(x3); var(x4) [1] 2.042105 [1] 4.642105 [1] 3.628947 [1] 2.515789 boxplot(x ~ g, col="skyblue2", pch=20, horizontal=T)
stripchart(x ~ g, pch=20, meth="stack")
oneway.test(x ~ g) One-way analysis of means (not assuming equal variances) data: x and g F = 4.4883, num df = 3.000, denom df = 41.779, p-value = 0.008076
There are significant differences among group means. Still avoiding the assumption of equal variances, you can use Welch 2-samples for ad hoc comparisons, using Bonferroni (or some other method) to protect against false discovery.
There is a significant difference between Groups 1 and 3:
t.test(x1, x3) Welch Two Sample t-test data: x1 and x3 t = -3.0986, df = 35.241, p-value = 0.003806 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.7307616 -0.5692384 sample estimates: mean of x mean of y -2.60 -0.95
But there is no significant difference between Groups 3 and 4 (not surprising because they were simulated from the same distribution.)
t.test(x3,x4)$p.val [1] 0.7881982
Similar Posts:
- Solved – Analysis of variance for nonnormal data with unequal variance
- Solved – Analysis of variance for nonnormal data with unequal variance
- Solved – the difference between Welch T Test and Z-test
- Solved – ANOVA:How to detect non-normality with a QQPlot in the presence of non-homogeneous variance
- Solved – Some of the data is not normally distributed, what test should i use