I'm trying to figure out how to group elements of a binary matrix based on a given Jaccard distance cutoff. For example, suppose that I have information on the types of food carried by various grocery stores, and I want to group the stores in such a way that each store in the group is at least 30% similar to at least one other store in the same group.
The input looks like this:
> groceries food1 food2 food3 food4 food5 food6 food7 store1 1 1 0 0 0 0 0 store2 1 0 1 1 0 0 0 store3 1 0 0 1 0 0 0 store4 1 1 0 0 0 0 0 store5 0 0 0 0 1 1 1 store6 0 0 0 0 1 0 1
And I'm able to calculate the Jaccard distances like so:
> d=dist((groceries), method="binary") > d store1 store2 store3 store4 store5 store2 0.7500000 store3 0.6666667 0.3333333 store4 0.0000000 0.7500000 0.6666667 store5 1.0000000 1.0000000 1.0000000 1.0000000 store6 1.0000000 1.0000000 1.0000000 1.0000000 0.3333333
I can eyeball the distance object and see that some of the stores meet my 30% similarity cutoff (Jaccard distance <= 0.7):
store1 & store3 store1 & store4 store2 & store3 store3 & store4 store5 & store6
So I'd like there to be two groups: one made up of stores 1, 2, 3, and 4, and another made up of stores 5 and 6. Eyeballing the distance matrix obviously doesn't scale, though – is there a way to do this automatically?
[I'm particularly interested in being able to do it in R, so if there's an R function that can do it automatically, that would be most useful.]
Best Answer
This is the components you get with single linkage clustering at cutoff 0.7, which is the exact same thing as constructing a graph with edge weight cutoff 0.7 and computing the connected components of that graph.