I have data that look like this:

`amount creator accounts 100 john cash, accounts payable 325 jane accounts receivable, cash 200 john tax account, accounts payable, cash `

How should these data points be clustered?

Thoughts so far:

Popular, consensus answer seems to be to one-hot encode the categorical and multivalue_categorical fields, and then scale the numeric field to [0,1]. This causes two primary problems: extremely sparse/high-dimensional data (4,000 dimensions in my case), and a numeric column that is perhaps not weighted appropriately.

Attempt to apply differing algorithms to each data type and mash them together somehow. This could involve market-basket type analysis for the multivalue_categorical, k-modes for the categorical, and k-means for numeric (or k-prototypes for the categorical and numeric).

Is there any method/implementation that would allow for these three types of data to be clustered without one-hot encoding the categorical and multivalue categorical? I have looked into SOM as an unsupervised NN that performs clustering, but I haven't seen evidence that it can handle multivalue categorical.

**Contents**hide

#### Best Answer

With mixed data types the basic answer is to use Gower's distance (see @ttnphns' thorough explainer here: Hierarchical clustering with mixed type data – what distance/similarity to use?). The gist of it is that you get the distance measure of your preference for each variable individually, then average them. You can also do a weighted average of the constituent distances, if you think some should be given more credibility than others.

For your continuous variable, the absolute difference should be fine. Simple matching is presumably fine for your categorical variable, creator. That is, $1$, if two rows have the same creator, and $0$ otherwise. Then you just need to find a metric for your multivalue categorical variable. I think it is fine for you to think of this as a single variable, but I suspect it is ultimately better to think of it as a set of binary variables, where all possible options constitutes the set. From there, if the option is listed, that amounts to having a $1$ in that column, and $0$ otherwise. Thus, you have a high-dimensional binary space. There have been *lots* of measures defined for binary data (see: Choi, Cha, & Tappert, *A Survey of Binary Similarity and Distance Measures*, pdf, for a list of **76 !**). You need to decide which makes sense. The constituent distance measures each gets normalized, and then you use whichever clustering algorithm you like that can work with a distance matrix instead of the raw data (see, e.g., my answer here: How to use both binary and continuous variables together in clustering?).

### Similar Posts:

- Solved – Convert categorical data with large number of levels to numeric data and what kind of mapping to use
- Solved – Is (a) multicollinearity and/or (b) binary variables an issue for DBSCAN? if so, how can one correct for these issues
- Solved – Combine two, three, (n) metrics for calculating dissimilarity matrix
- Solved – Coding of categorical variables in logistic regression
- Solved – Interpreting Silhouette plot for Cluster Analysis