I am computing a matrix of distances for categorical data. I am using the Jaccard distance since as far as I understood it should be working properly with this kind of data. I have BOTH binary and non-binary.
My question is: can I use the Jaccard method to compute distances for data including BOTH binary and non-binary variables (as in Mydata
in the example below) WITHOUT transforming the non-binary in binary? If the answer is not, is there an alternative way or I have to transform every attribute in a (0,1) variable? A Jaccard code in R
(function vegdist
in package vegan
) provides me results but I am not able to reproduce them if I include both the binary and non binary attributes.
I provide an example of the data I have
a <- c(1,1,0,0) b <- c(0,1,0,1) c <- c(3,2,1,0) Mydata <- as.data.frame(cbind(a,b,c)) >Mydata 1 0 3 1 1 2 0 0 1 0 1 0
where the attribute c
is the non-binary, with possible values within (0,4). The R
function provides me the following distance matrix for Mydata
but I am not able to reproduce it manually. For instance, the first element 0.40
is the distance between
observation 1 and 2 along the 3 attributes)
1 2 3 2 0.40 3 0.75 0.75 4 1.00 0.75 1.00
Best Answer
If you are willing to treat c as a continuous variable, you can use Gower's dissimilarity coefficient on a mixture of binary and continuous data. This can sometimes be done with ordered categorical variables with no ill effects.
For your toy data, this would look like:
obs1 obs2 obs3 obs4 obs1 0 obs2 .44444444 0 obs3 .55555556 .77777778 0 obs4 1 .55555556 .44444444 0
Similar Posts:
- Solved – Distances for binary and non binary categorical data
- Solved – Distances for binary and non binary categorical data
- Solved – use Manhattan distance on binary data for hierarchical clustering
- Solved – use Manhattan distance on binary data for hierarchical clustering
- Solved – NMDS from Jaccard and Bray-Curtis identical. Is that a bad thing