I'm creating a program that identifies trending music artists that uses Twitter metrics. I have data in a histogram format that represents the frequency of twitter @mentions of an artist for the last 30 days. I need my program to recognize a "significant" change (number that is quite greater than the rest in the sample) in the frequency of @mentions. Here is a data sample:
In this scenario,
438 is the number of significance. It's fairly obvious in this data set b/c it is 2x greater than any other value.
I need to build an equation that recognizes this number of significance across various data sets. In my first attempt at writing this equation, I will calculate the average of the 30 values and compare the average to the highest value in the data set. After doing some sampling, I will come up with a percentage difference that justifies significance.
This is a simple approach, but I'm worried its too subjective. Are there any other ways of doing this?
There is no universal definition of an outlier. One criterion amongst others is what I call the "boxplot criterion". It is in fact the criterion that is frequently used to compute the length of the whiskers in a boxplot. This criterion has been put forward in Tukey's book on Exploratory Data Analysis.
Anything outside the following interval could be considered as an outlier:
$$ [Q_1 – 1.5*(Q_3 – Q_1), Q_3 + 1.5*(Q_3 – Q_1)] $$
where $Q_1$ is the first quartile and $Q_3$ is the third quartile of the variable of interest. If you apply this rule to the data above, you will see that the value 428 will come out as an outlier.
For what it's worth, the hardest part is not the detection of an outlier (once you have agreed on a definition), but what you should do about it. You can have a look at the questions under the outlier tag.
- Solved – When to use Equal-Frequency-Histograms
- Solved – Should text pre-processing come before or after POS-tagging
- Solved – A question about Histogram-to-distribution transformation
- Solved – Calculating two standard deviations above the mean
- Solved – Why is “using counts of a frequency table as data” for a histogram a “mistake”