Solved – When to use equal frequency binning and when equal width binning

When transforming numerical variables into categorical variables I'm not aware of when should I use equal frequency binning and when equal width binning. Seems that each of them has their own advantages but I can't distinguish them.

Binning is something I would rarely do myself on data. Many algorithms will bin continuous data for performance (XGboost, LGBM, …) but the way they bin to create histograms is not as trivial as equal width or frequency.

In general, however, equal width is better for graphical representations (histograms) and is more intuitive, but it might have problems if the data is not evenly distributed, it's sparse, or has outliers, as you will have many empty, useless bins. Equal frequency will instead guarantee that every bin contains the roughly the same amount of data, which is usually preferable if you have to then use the data in any kind of model/algorithm as bins will be more significative in representing the underlying distribution.

Similar Posts:

Rate this post

Leave a Comment