I'm trying to detect subpopulations of data in a set of data. The problem is that the data may have more than one subpopulation and the subpopulations may not be normally distributed. For example, the data could look like this….
The Y axis is the count and the X axis is the value. What I want is to determine the number of subpopulations and their peak values. The programming language I'm most familiar with is Python, however, my methods have been less than successful. It becomes problematic because in the example I showed there is a subpopulation with two peaks. My method looks for peaks and valleys so in this example it would identify three subpopulations even though there are actually two. I'm new to statistics so I was wondering if anyone here could suggest a way to detect subpopulations of data.
Best Answer
I have successsfully detected a Level Shift in series like this using Tsay's procedures http://www.unc.edu/~jbhill/tsay.pdf. His procedures although initially directed to time series are general enough such that when the arima structure is (0,0,0)(0,0,0) it will detect the significant level shift(s) in the mean. The concept here is akin to single dimension (characteristic) cluster analysis. If you post your data, I will give it a try. I don't know about detecting peaks but they might show up as Pulses (outliers).
EDIT: 5 minutes after receipt of the 6851 ordered values .
Following is a plot of the ordered value with 3 points in time denoted as "L" for Level Shift . The equation developed gives us a clue as to the groupings
. it contains arima structure reflecting the orderliness of the data ( low to high ) and is just a filter that allows us to detect the structural breaks. Remember all models are wrong but this model(equation) seems useful. Does it approximately match you "human judgement" ?
group1; 1-1748 group2; 1749-3688 group3; 3689-6166 group4; 6167-6851
Detecting level shifts in this case is detecting change of intercept. The detection of level shifts is an on-or about test and seems to have picked out the primary contrast at 3689 perhaps suggesting a rerun just using the first 3688 and a second run with just the remainder. Clearly this innovative application provides some good results but it seems to warrant further research. Visually there is a change in trend at 1749 suggesting that a new group has formed as the slope of the line has changed.
Similar Posts:
- Solved – forecasting sharp seasonal peak in time series
- Solved – Brier Score and extreme class imbalance
- Solved – How to specify when a level shift begins and ends or in the case of data series with multiple level shifts how to id when one level shift beings/ends
- Solved – Generalized Linear Model in SPSS with common values among predictors treated as subpopulations. Why
- Solved – Which ways should be performed detecting outliers before k-means clustering