Solved – Creating clusters for binary data

I have a set of data with patients and their diseases. I would like to use hierarchical clustering or some kind of cluster analysis to make a dendrogram to see which diseases cluster together in this population. This is basically what it looks like, except with more diseases and more patients.

Moya    Hypothyroid Hyperthyroid    Celiac    1       1           0             0    1       1           0             0           0       0           1             1    0       0           0             0    1       1           0             0    1       0           1             0    1       1           0             0    1       1           0             0    0       0           1             1    0       0           1             1 

How would I go about making a dendrogram considering this is all binary data? Should I use Hierarchical clustering or UPGMA or something else?

Latent class modeling would be one approach to finding underlying, "hidden" partitions or groupings of diseases. LC is a very flexible method with two broad approaches: replications based on repeated measures across subjects vs replications based on cross-classifying a set of categorical variables with no repeated measures. Your data would fit the second type.

All LC models have 2 stages: in stage 1, a dependent or target variable is identified and a regression model is built. In stage 2, the residual (a single "latent" vector) from the stage 1 model is analyzed and partitions are created capturing the variability (or heterogeneity) in that vector — these are the "latent classes."

Freeware is out there for downloading that would probably work pretty well for you. One of these is an R module called polCA available here. Note that this approach is to be used only with binary data such as yours:

http://www.jstatsoft.org/article/view/v042i10

If you have about $1,000 to spend on a commercial product, Latent Gold is available from www.statisticalinnovations.com Having used on Latent Gold for years, I'm a big fan of that product for its analytic power and range of solutions. For instance, polCA is only useful for LC models with categorical information whereas LG works for true mixtures…plus, their developers are always adding new modules. The most recent addition builds LC models using hidden Markov chains. Bear in mind that LG is not an "end-to-end" data platform, i.e., it is not good for heavy data manipulation or lifting.

Mplus is another commercially available product for this class of models with pricing similar to LG.

Similar Posts:

Rate this post

Leave a Comment