I'm implementing a clustering task over a precomputed distance matrix.
There are several distances I can use for the pair-wise distance matrix, some of them are not a metric (not symmetric).
Can anyone tell me what are the disadvantages of using a non-metric distance when clustering?
Best Answer
Full metric properties are rarely required if you don't need strong theoretical results. In particular, $d(x,y)=0Rightarrow x=y$ is not realistic on natural data. Just because two observed values are identical does not imply they are the same observation (this property can easily be restored by working on equivalence classes if you have the triangle inequality, for theoretical results). Triangle inequality is mostly used for acceleration. Before using an algorithm, you should check whether it makes such assumption. Usually they will still work, but the results may worse, if the assumption does not hold. (There may be cases where convergence relies on the triangle inequality, too.)
Symmetry is much harder. Few implementations will allow this, even if a number of algorithms could support this (e.g. DBSCAN). A lot of code you see assumes symmetry in many places…
Consider this simple but well-understandable transformation: $$s(x,y) := min{ d(x,y), d(y,x) }$$ which restores symmetry.
Similar Posts:
- Solved – How to choose the right distance matrix for clustering
- Solved – A valid distance metric for high dimensional data
- Solved – How to get a valid distance metric
- Solved – Are Mutual Information and Kullback–Leibler divergence equivalent
- Solved – Are Mutual Information and Kullback–Leibler divergence equivalent