Solved – Distances for binary and non binary categorical data

I am computing a matrix of distances for categorical data. I am using the Jaccard distance since as far as I understood it should be working properly with this kind of data. I have BOTH binary and non-binary.

My question is: can I use the Jaccard method to compute distances for data including BOTH binary and non-binary variables (as in Mydata in the example below) WITHOUT transforming the non-binary in binary? If the answer is not, is there an alternative way or I have to transform every attribute in a (0,1) variable? A Jaccard code in R (function vegdist in package vegan) provides me results but I am not able to reproduce them if I include both the binary and non binary attributes.

I provide an example of the data I have

a <- c(1,1,0,0) b <- c(0,1,0,1) c <- c(3,2,1,0) Mydata <- as.data.frame(cbind(a,b,c))  >Mydata  1 0 3  1 1 2  0 0 1  0 1 0 

where the attribute c is the non-binary, with possible values within (0,4). The R function provides me the following distance matrix for Mydata but I am not able to reproduce it manually. For instance, the first element 0.40 is the distance between
observation 1 and 2 along the 3 attributes)

     1    2    3   2 0.40             3 0.75 0.75        4 1.00 0.75 1.00 

If you are willing to treat c as a continuous variable, you can use Gower's dissimilarity coefficient on a mixture of binary and continuous data. This can sometimes be done with ordered categorical variables with no ill effects.

For your toy data, this would look like:

           obs1       obs2       obs3       obs4 obs1          0 obs2  .44444444          0 obs3  .55555556  .77777778          0 obs4          1  .55555556  .44444444          0 

Similar Posts:

Rate this post

Leave a Comment