Solved – Uniform Distribution Test

I've got a data-set which I assume is uniformly distributed. Say I've got N=20000 samples and a suspected p=0.25. This means that I would expect each option to show up roughly 5000 times.

How do I calculate the following interval [5000 - x, 5000 + x] such that I can say with a certain confidence that the data-set is probably NOT uniformly distributed since the number of times an option shows up falls outside of the interval?

EDIT
ABCDBCDADBCDA, BDCAADBCDADBA, ADCDBDACDBDAD, CDBDACDBDACDA, That's some sample data. A sample is one cookie string! Now I want for each position in that cookie string determine if a character there is too rare or too common at that position. So I count, for all samples, the number of A's on positon 0, the number of B's, C's and D's. Suppose I get a count of 5 A's on position 0 and I would expect a count of roughly 50 A's then the character A is too rare at position 0. That's what I want to do for each character position.

You might try assuming–as your null hypothesis–that the distribution is discrete uniform independent of string position. Then tabulate the frequencies of each letter by position in a 4 x 13 contingency table. You can then test for non-independence with a simple chi-square test; with n=20,000 observations in your one sample, you shouldn't have any sparse table problems. You can also eyeball this with a stacked bar chart, one 4-color ABCD bar for each string position. This is useful if you reject the null with the chi-square test.

Just to be sure, you might also want to check your data overall to see if it actually fits a discrete uniform distribution using a chi-square goodness of fit test. After all, the distribution of characters could be independent of position without being uniformly distributed.

If you want to estimate confidence intervals, treat the ABCD distribution as a multinomial distribution. You can estimate standard errors from the variance-covariance matrix, which has diagonal (variance) entries np[i](1-p[i]) and off-diagonal (covariance) entries -np[i]p[j].

A good reference for all this is Agresti's Categorical Data Analysis.

If the distribution of letters seems independent of position, you can do further exploration using the runs test. Treat the letters as ordinal data, A < B < C < D; the runs test will check for runs that are too long (too few) or too short (too many). This type of test, and many others, are described in Knuth's Seminumerical Algorithms, where he discusses tests for random number generators.

Looks like you have a lot of tabulating to do–enjoy!

Similar Posts:

Rate this post

Leave a Comment