I need some suggestion for clustering (unsupervised classification) method for a consulting project. I am looking for a method that hopefully has the following properties:

The subject of my study has three properties. One is represented by

a (non-Euclidean) distance matrix and the other two are in the form

of vectors in Euclidean space. The distance matrix comes from

sequences and can be in the form of percent of dissimilarity or

other measurement of distance of sequences. The algorithm should be

able to take both vectors in euclidean space and non-euclidean

distance as input. For example, K-medoids can work with a distance

matrix but K-means can not.I would like the algorithm to select the number of clusters and the

weight for three properties automatically (with prior knowledge and

constraint).I have information of previously identified “centers of clusters”. I

would like to incorporate it as prior or initial values.As a statistician, I would prefer the method to have a clear

likelihood or loss function.

The closest thing I can think of is fitting a mixture model in Bayesian framework using reverse jump MCMC to determine the number of clusters. The vectors in R^d can be easily formulated into a normal likelihood but how to deal with the distance matrix is unclear to me. I can restrict the mean of normal likelihood to be at each of the observation of get the MCMC running but that does not have a clear mathematical / statistical meaning.

Does anyone have experience with a similar problem? Suggestion to references will be highly appreciated!

**Contents**hide

#### Best Answer

I think that using a MAP/Bayesian criterion of in combination with a mixture of Gaussians is a sensible choice. Points

You will of course object that **MOGs require Euclidean input data**. The answer is to find a set of points that give rise to the distance matrix you are given. An example technique for this is multidimensional scaling: $text{argmin}_{lbrace x_i rbrace} sum_{i, j}(||x_i – x_j||_2 – D_{ij})^2$ where $D_{ij}$ is the distance of point $i$ to point $j$.

### Similar Posts:

- Solved – use log-likelihood distance on data of only continuous variables
- Solved – Why does k-means clustering algorithm use only Euclidean distance metric
- Solved – Why does k-means clustering algorithm use only Euclidean distance metric
- Solved – Hierarchical Clustering: What is the difference between linkages and distance measures
- Solved – SOM based on a not euclidean distance