Solved – How to measure how “predictable” a dataset is

Is there a way to measure how "predictable" a dataset is based on some of its inherent attributes such as its entropy level or its amount of self-similarity?
If so, how is that "predictability score" related to your confidence level of any possible prediction in the future?

Allow me to ask this question with an example:

Consider you have the first 50 data points for a straight line: y = x (but you don't know for sure that y = x is the formula that produced the data points). You're asked to predict the next n number of data points where n is as far into the future as you feel comfortable.

Certainly, you would feel very confident predicting the next few data points, or the next 50 or more as continuing along in the same vein: y = x. But at some point you would say, "Well, what are the chances that my prediction, at y = 10^100 is x = 10^100?" maybe the chances are less than y = 51 so x = 51.

So even with a straight line as the dataset, you may not know if that straight line was a blip in an otherwise curvey structure or if it should be straight forever. So you have to discount your confidence when making predictions about it into the future when extrapolating what you think you might know.

When you have more chaotic data, of course, the rate at which your confidence at being able to predict it's future goes down faster.

The question is: How do you calculate how quickly your confidence diminishes to chance given a dataset, that is, before ever actually making a prediction? How do I get the prior likelihood that I can't be better than chance Z predicted datapoints out into the future?

enter image description here

In a regression setting, this comes down to the level of observation noise and how much entropy is in the posterior distribution of the parameters of your chosen model.

In particular, the uncertainty of the model parameters will automatically result in larger variance in predictions the further away you move from the available data.

If you are not familiar with Gaussian Processes, I think you would get a lot out of studying those, in regard to this question. You can download for free one of the best texts here:

http://www.gaussianprocess.org/gpml/

or check out these nice videos:

http://videolectures.net/mlss09uk_rasmussen_gp/

Similar Posts:

Rate this post

Leave a Comment