I have a dataset with 15 variables. Some variables are numeric, continuous. Other variables are boolean, dichotomous (true/false). There's also one variable categorical, nominal.

`str(df) 'data.frame': 30 obs. of 15 variables: nom : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ... X1 : logi FALSE TRUE FALSE TRUE TRUE FALSE ... X3 : logi TRUE TRUE TRUE TRUE FALSE FALSE ... X3 : logi TRUE FALSE FALSE FALSE TRUE FALSE ... X4 : logi FALSE TRUE FALSE TRUE FALSE FALSE ... X5 : logi TRUE FALSE FALSE FALSE FALSE TRUE ... X1.1: num 1.026 -0.285 -1.221 0.181 -0.139 ... X2.1: num -0.045 -0.785 -1.668 -0.38 0.919 ... X3.1: num 1.13 -1.46 0.74 1.91 -1.44 ... X4.1: num 0.298 0.637 -0.484 0.517 0.369 ... X5.1: num 1.997 0.601 -1.251 -0.611 -1.185 ... X6 : num 0.0597 -0.7046 -0.7172 0.8847 -1.0156 ... X7 : num -0.0886 1.0808 0.6308 -0.1136 -1.5329 ... X8 : num 0.134 0.221 1.641 -0.219 0.168 ... X9 : num 0.704 -0.106 -1.259 1.684 0.911 .. X10 : android android OS windows7 windows8... [...] `

I would like to cluster **the variables** (not data cases) `x1, x2, ..., x9`

(probably omitting the nominal `X10`

) into clusters or subsets of correlated variables, for example `(x1,x2,x6),(x3,x5), ...`

As the variable have mixed types, it is impossible to use `cor()`

, I think. It is also impossible to use Gower similarity coefficient, because it is a similarity between data *cases*.

Can you help me to find an idea to process this, please? I would prefer a solution in R.

**Contents**hide

#### Best Answer

Traditional FA and cluster algorithms were designed for use with continuous (i.e., gaussian) variables. Mixtures of continuous and qualitative variables invariably give erroneous results. In particular and in my experience, the categorical information will dominate the solution.

A better approach would be to employ a variant of finite mixture models which are often intended for use with mixtures of continuous and categorical information. Latent class mixture models (which are FMMs) have a huge literature built up around them. Much of that literature is focused in the field of marketing science where these methods see wide use for, e.g., consumer segmentation…but that's not the only field where they are used.

The software I know and recommend for latent class modeling is neither free nor R-based but, in terms of proprietary software, it's not *that* expensive. It's called *Latent Gold*, is sold by Statistical Innovations and costs about $1,000 for a perpetual license. If your project has a budget, it could easily be expensed. *LG* offers a wide suite of tools including FA for mixtures, clustering of mixtures, longitudinal markov chain-based clustering, and more.

Otherwise, the only R-based freeware I know about (polCA, https://www.jstatsoft.org/article/view/v042i10) is intended for use with multi-way contingency tables. I'm not aware that this tool can accept anything other than categorical information. There may be others. If you poke around, maybe you can find some alternatives.

### Similar Posts:

- Solved – Gower distance with R functions; “gower.dist” and “daisy”
- Solved – Mixing variable types in latent class/profile analysis
- Solved – Multiple Linear Regression – categorical variables
- Solved – Correlation between scale and categorical variable
- Solved – Correlations with unordered categorical variables