Solved – Any suggestion for clustering method for unknown number of clusters and non-Euclidean distance

I need some suggestion for clustering (unsupervised classification) method for a consulting project. I am looking for a method that hopefully has the following properties:

  1. The subject of my study has three properties. One is represented by
    a (non-Euclidean) distance matrix and the other two are in the form
    of vectors in Euclidean space. The distance matrix comes from
    sequences and can be in the form of percent of dissimilarity or
    other measurement of distance of sequences. The algorithm should be
    able to take both vectors in euclidean space and non-euclidean
    distance as input. For example, K-medoids can work with a distance
    matrix but K-means can not.

  2. I would like the algorithm to select the number of clusters and the
    weight for three properties automatically (with prior knowledge and
    constraint).

  3. I have information of previously identified “centers of clusters”. I
    would like to incorporate it as prior or initial values.

  4. As a statistician, I would prefer the method to have a clear
    likelihood or loss function.

The closest thing I can think of is fitting a mixture model in Bayesian framework using reverse jump MCMC to determine the number of clusters. The vectors in R^d can be easily formulated into a normal likelihood but how to deal with the distance matrix is unclear to me. I can restrict the mean of normal likelihood to be at each of the observation of get the MCMC running but that does not have a clear mathematical / statistical meaning.

Does anyone have experience with a similar problem? Suggestion to references will be highly appreciated!

I think that using a MAP/Bayesian criterion of in combination with a mixture of Gaussians is a sensible choice. Points

You will of course object that MOGs require Euclidean input data. The answer is to find a set of points that give rise to the distance matrix you are given. An example technique for this is multidimensional scaling: $text{argmin}_{lbrace x_i rbrace} sum_{i, j}(||x_i – x_j||_2 – D_{ij})^2$ where $D_{ij}$ is the distance of point $i$ to point $j$.

Similar Posts:

Rate this post

Leave a Comment