I have a data set with a sample size over three million numeric values. Close to 20% are either 0 or 1, with the maximum being nearly 18500. So the data is clearly quite heavily positively skewed.
I am trying to categorize some of this data by putting it into bins of equal width, so I decided to try and find the optimal number of bins. Using the Freedman-Diaconis rule it gave me a value of 126044.0262335108, this is clearly a ridiculously large number of bins for the data.
Breaking the set into the Inter-decile range also proved fruitless giving me [0, 1, 1, 2, 3, 5, 8, 17, 47]
Reading elsewhere the square root of the sample size was suggested, this gave 1732.05081 which is more reasonable. However the method is quite crude.
I also looking into Doane's formula given here. But reading up on this method it seems to have been based on an incorrect hypothesis.
How should I deal with this level of skew in the data?
What is the best way to categorize this data?
Assuming that your goal is to visualise your data, no binning can allow you to appreciate the distribution in the range 0-47 and the remaining cases up to 18500. Even if you can fit the 0-47 range in a single cm of paper, the maximum (18500) will lie over 3 meters away. If you have a very large support, like a corridor wall, you can follow Nick Cox's comment and draw spikes at every value to convey a sense of the real scale of your data – in fact astronomers and science museums often do so to convey the real scale of the Solar System.
If your histogram needs to fit in a piece of paper or a screen, I would transform the variable as PtrZlnk suggests in his answer. Transformation log10(X+k) could work, and square or cubic root and asinh may work, too.
Once your variable is transformed, you could apply any standard binning strategy to the transformed variable, but don't forget to label your axes in the original variable.
- Solved – How to divide ordinal set into bins
- Solved – general/golden rule for appropriate binning in a histogram
- Solved – (Frequency Histogram) – Bins of equal width
- Solved – Why does increasing the sample size of coin flips not improve the normal curve approximation
- Solved – Fitting exponential data to histogram