Solved – How to detect the number of distributions in a set of data

I'm trying to detect subpopulations of data in a set of data. The problem is that the data may have more than one subpopulation and the subpopulations may not be normally distributed. For example, the data could look like this….

enter image description here

The Y axis is the count and the X axis is the value. What I want is to determine the number of subpopulations and their peak values. The programming language I'm most familiar with is Python, however, my methods have been less than successful. It becomes problematic because in the example I showed there is a subpopulation with two peaks. My method looks for peaks and valleys so in this example it would identify three subpopulations even though there are actually two. I'm new to statistics so I was wondering if anyone here could suggest a way to detect subpopulations of data.

I have successsfully detected a Level Shift in series like this using Tsay's procedures http://www.unc.edu/~jbhill/tsay.pdf. His procedures although initially directed to time series are general enough such that when the arima structure is (0,0,0)(0,0,0) it will detect the significant level shift(s) in the mean. The concept here is akin to single dimension (characteristic) cluster analysis. If you post your data, I will give it a try. I don't know about detecting peaks but they might show up as Pulses (outliers).

EDIT: 5 minutes after receipt of the 6851 ordered values .

Following is a plot of the ordered value enter image description here with 3 points in time denoted as "L" for Level Shift . The equation developed gives us a clue as to the groupings enter image description here . it contains arima structure reflecting the orderliness of the data ( low to high ) and is just a filter that allows us to detect the structural breaks. Remember all models are wrong but this model(equation) seems useful. Does it approximately match you "human judgement" ?

group1; 1-1748 group2; 1749-3688 group3; 3689-6166 group4; 6167-6851

Detecting level shifts in this case is detecting change of intercept. The detection of level shifts is an on-or about test and seems to have picked out the primary contrast at 3689 perhaps suggesting a rerun just using the first 3688 and a second run with just the remainder. Clearly this innovative application provides some good results but it seems to warrant further research. Visually there is a change in trend at 1749 suggesting that a new group has formed as the slope of the line has changed.

Similar Posts:

Rate this post

Leave a Comment