Solved – Calculating the optimal number of bins for severely skewed data

I have a data set with a sample size over three million numeric values. Close to 20% are either 0 or 1, with the maximum being nearly 18500. So the data is clearly quite heavily positively skewed.

I am trying to categorize some of this data by putting it into bins of equal width, so I decided to try and find the optimal number of bins. Using the Freedman-Diaconis rule it gave me a value of 126044.0262335108, this is clearly a ridiculously large number of bins for the data.

Breaking the set into the Inter-decile range also proved fruitless giving me [0, 1, 1, 2, 3, 5, 8, 17, 47]

Reading elsewhere the square root of the sample size was suggested, this gave 1732.05081 which is more reasonable. However the method is quite crude.

I also looking into Doane's formula given here. But reading up on this method it seems to have been based on an incorrect hypothesis.

How should I deal with this level of skew in the data?

What is the best way to categorize this data?

Contents