I need some suggestion for clustering (unsupervised classification) method for a consulting project. I am looking for a method that hopefully has the following properties:
The subject of my study has three properties. One is represented by
a (non-Euclidean) distance matrix and the other two are in the form
of vectors in Euclidean space. The distance matrix comes from
sequences and can be in the form of percent of dissimilarity or
other measurement of distance of sequences. The algorithm should be
able to take both vectors in euclidean space and non-euclidean
distance as input. For example, K-medoids can work with a distance
matrix but K-means can not.I would like the algorithm to select the number of clusters and the
weight for three properties automatically (with prior knowledge and
constraint).I have information of previously identified “centers of clusters”. I
would like to incorporate it as prior or initial values.As a statistician, I would prefer the method to have a clear
likelihood or loss function.
The closest thing I can think of is fitting a mixture model in Bayesian framework using reverse jump MCMC to determine the number of clusters. The vectors in R^d can be easily formulated into a normal likelihood but how to deal with the distance matrix is unclear to me. I can restrict the mean of normal likelihood to be at each of the observation of get the MCMC running but that does not have a clear mathematical / statistical meaning.
Does anyone have experience with a similar problem? Suggestion to references will be highly appreciated!
Best Answer
I think that using a MAP/Bayesian criterion of in combination with a mixture of Gaussians is a sensible choice. Points
You will of course object that MOGs require Euclidean input data. The answer is to find a set of points that give rise to the distance matrix you are given. An example technique for this is multidimensional scaling: $text{argmin}_{lbrace x_i rbrace} sum_{i, j}(||x_i – x_j||_2 – D_{ij})^2$ where $D_{ij}$ is the distance of point $i$ to point $j$.
Similar Posts:
- Solved – use log-likelihood distance on data of only continuous variables
- Solved – Why does k-means clustering algorithm use only Euclidean distance metric
- Solved – Why does k-means clustering algorithm use only Euclidean distance metric
- Solved – Hierarchical Clustering: What is the difference between linkages and distance measures
- Solved – SOM based on a not euclidean distance