Solved – Statistical Significance with large data sets

When I was a Ph.D. student I was trained in no uncertain terms that

When we had large numbers of data results, that the number of significant results HAD to be, in and of themselves SIGNIFICANT!

I am referring to the use and abuse of statistical significance in a large data set using different methodologies of predictive analytics.

I would appreciate if anyone could kindly offer constructive comments with reference to this issue.

Thank you

It is certainly possible to have a very large sample and a non-significant result. Perhaps this is simplest to demonstrate with categorical data. Suppose you have data on one condition on two groups of data, e.g. Men and women and whether last name ends in a vowel or consonant. Suppose you find, among 1,000,000 people, that exactly equal proportions of men and women have last names that end in vowels. Then the p value will be 1.00. I.e.

vowel <- c(rep('Y', 100000), rep('N', 900000)) sex <- c(rep('F', 50000), rep('M', 50000),          rep('F', 450000), rep('M', 450000))  table(vowel, sex) chisq.test(vowel, sex) 

What is true is that, when N is very large, even trivial differences from the exact null hypothesis will be significant. e.g. in the above, if even 1% more men had a last name ending in a vowel, then p has 9 0's:

vowel <- c(rep('Y', 100000), rep('N', 900000)) sex <- c(rep('F', 49000), rep('M', 51000),          rep('F', 450000), rep('M', 450000))  table(vowel, sex) chisq.test(vowel, sex) 

Similar Posts:

Rate this post

Leave a Comment