I have a dataset of tv viewing data (channel, time, # viewers) and want to get some confidence in its quality.
What are some standard ways to do this?
Best Answer
This is a pretty broad question (it would depend on your exact definition of 'quality'), but if I were you I would start with the following:
Clean obvious errors (e.g. negative numbers of viewers).
Check that you have enough samples for all your channels.
Check the distribution viewers by time in your dataset. You should have more data in peak hours than, say, at 3 a.m.
Search for outliers in the time series of #viewers for a given channel. You might want to avoid sudden spikes due to, for example, a breaking news announcement.
And from there, it really depends on what do you want to do with the data. But those should give you a starting point.