Solved – standard ways to check data quality?

I have a dataset of tv viewing data (channel, time, # viewers) and want to get some confidence in its quality.

What are some standard ways to do this?

This is a pretty broad question (it would depend on your exact definition of 'quality'), but if I were you I would start with the following:

  • Clean obvious errors (e.g. negative numbers of viewers).

  • Check that you have enough samples for all your channels.

  • Check the distribution viewers by time in your dataset. You should have more data in peak hours than, say, at 3 a.m.

  • Search for outliers in the time series of #viewers for a given channel. You might want to avoid sudden spikes due to, for example, a breaking news announcement.

And from there, it really depends on what do you want to do with the data. But those should give you a starting point.

Similar Posts:

Rate this post

Leave a Comment