Solved – Does a high likelihood of duplicate samples invalidate the data or certain operations on it

We have data from random polling within a group of people, sampled over a long period of time. As the pool is relatively small and anonymity is crucial there is a high likelihood that we have duplicate samples (though we can't tell for certain because we'd expect responses to vary over time).

Does this somehow invalidate the data, or mean that certain operations on it will not be meaningful? Or is it OK to proceed as normal so long as I state this likelihood?


I'm happy to provide more detail if needed – just say what would help in a comment.

Also I'm making this CW. Please feel free to edit the question if there are other relevant implications of duplicate data that would be worth specifying.

You would normally make the assumption of independence of observations in your modelling.

Alternatively if you expected correlation between observations it would be good to model this and estimate that correlation. You can't do this as you don't know which observations are likely to be correlated.

If you assume independence when some observations are in fact positively correlated you will understimate the between subject variance. This means you are more likely to find "significant differences" than statistical theory would suggest. You can think of it as appearing to have more samples than you in fact do have as some are almost repeats.

Similar Posts:

Rate this post

Leave a Comment