I've got a data-set which I assume is uniformly distributed. Say I've got
N=20000 samples and a suspected
p=0.25. This means that I would expect each option to show up roughly
How do I calculate the following interval
[5000 - x, 5000 + x] such that I can say with a certain confidence that the data-set is probably NOT uniformly distributed since the number of times an option shows up falls outside of the interval?
ABCDBCDADBCDA, BDCAADBCDADBA, ADCDBDACDBDAD, CDBDACDBDACDA, That's some sample data. A sample is one cookie string! Now I want for each position in that cookie string determine if a character there is too rare or too common at that position. So I count, for all samples, the number of A's on positon 0, the number of B's, C's and D's. Suppose I get a count of 5 A's on position 0 and I would expect a count of roughly 50 A's then the character A is too rare at position 0. That's what I want to do for each character position.
You might try assuming–as your null hypothesis–that the distribution is discrete uniform independent of string position. Then tabulate the frequencies of each letter by position in a 4 x 13 contingency table. You can then test for non-independence with a simple chi-square test; with n=20,000 observations in your one sample, you shouldn't have any sparse table problems. You can also eyeball this with a stacked bar chart, one 4-color ABCD bar for each string position. This is useful if you reject the null with the chi-square test.
Just to be sure, you might also want to check your data overall to see if it actually fits a discrete uniform distribution using a chi-square goodness of fit test. After all, the distribution of characters could be independent of position without being uniformly distributed.
If you want to estimate confidence intervals, treat the ABCD distribution as a multinomial distribution. You can estimate standard errors from the variance-covariance matrix, which has diagonal (variance) entries
np[i](1-p[i]) and off-diagonal (covariance) entries
A good reference for all this is Agresti's Categorical Data Analysis.
If the distribution of letters seems independent of position, you can do further exploration using the runs test. Treat the letters as ordinal data,
A < B < C < D; the runs test will check for runs that are too long (too few) or too short (too many). This type of test, and many others, are described in Knuth's Seminumerical Algorithms, where he discusses tests for random number generators.
Looks like you have a lot of tabulating to do–enjoy!
- Solved – Uniform Distribution Test
- Solved – Predicting next character based on few previous — how to combine predictions
- Solved – Odds of X occurrences in a row given Y trials (A coin flip problem)
- Solved – How to do a chi-square test without being given the critical value table
- Solved – How to perform goodness of fit test and how to assign probability with uniform distribution?