Solved – “Updating” hierarchical clustering

Assume there is a distance matrix d [n x n], on which we can do hierarchical clustering (say, using average distances). The dendrogram/clustering resulting from this operation is then c.

Are there any algorithms that start from the already calculated c and d, and then deal with incremental updates to d and how they affect c instead of recalculating everything from scratch? It need not give the precise same results of hierarchical clustering but it should maintain more or less the same logic.


To provide more details:
Imagine matrix D and the hierarchical tree (not just the cluster labels) h. This matrix D is now updated to a matrix D' with new distances (no new objects, just updated distances) that are "relatively close to" the old ones. We can also look at the more simple case where just one point is updated (i.e. one column in the matrix) and all other distances stay the same. In principle one could then stepwise update the entire matrix. (Question is whether at this point you still gain anything.)

Is there a way to go from h to h' without recalculating everything from scratch? I.e. can we simplify the clustering procedure by knowing the previous cluster, and the fact that the distances haven't changed much?

Adding / removing points would of course also be interesting. Basically the case of substituting one column could be written as removing one column and adding one column, or the other way around…

Hierarchical clustering results are not very well updateable.

If the nearest neighbor of a new point (similar for a disappearing) point is at height h, then you should be able to keep anything below this value.

This is inherent to the design of hierachical clustering; so if you try to find an approximation for more efficient updating (I believe I have seen such a paper) then your result quality can drop to "completely wrong result" pretty much instantly.

Here's a paper on updating MST (= single linkage clustering) in O(n) when a new node arrives: but it won't transfer to other linkages.

Here's a simple proof:

Theorem: Updating hierarchical clustering takes at least O(n) time for linkages with runtime O(n^2) (e.g. single-link) and at least O(n^2) for linkages with runtime O(n^3) (all popular others).

Proof: Otherwise, we had a more efficient algorithm for hierarchical clustering by repeated insertion of points, which uses O(n*updatecost).

So in the general case, updating will at least be O(n^2) (assuming that the worst case O(n^3) for dendrogram construction is proven).

In many cases it may be better to label your old data with the cluster ids, then train a classifier, and rather use your classifier for new points than trying to update the dendrogram tree.

Similar Posts:

Rate this post

Leave a Comment