Solved – Differences in Spearman coefficient between R and pandas

I've noticed a small difference between pandas and R with regards to how they calculate Spearman coefficients. It seems as if some rounding occurs. I see no such difference when calculating Kendall or Pearson. Does anyone know what might cause this? A simple example is provided below illustrating the differences.

#------------  R CODE -------------- x <- seq(-1,1,length=100) y <- x^2  ( cor(cbind(x,y), method="spearman") ) ( cor(cbind(x,y), method="kendall") )  #------------  OUTPUT FROM R ------------- > ( cor(cbind(x,y), method="spearman") )            x          y x 1.00000000 0.01310547 y 0.01310547 1.00000000 > ( cor(cbind(x,y), method="kendall") )             x           y x 1.000000000 0.009296686 y 0.009296686 1.000000000  #------ PYTHON -------------- import pandas as pd import numpy as np  x = np.linspace(-1,1,100) y = x**2  df = pd.DataFrame({"x":x, "y":y})  print(df.corr(method="spearman"))  print(df.corr(method="kendall"))  #--------  OUTPUT FROM PYTHON --------------    x  y x  1  0 y  0  1           x         y x  1.000000  0.009297 y  0.009297  1.000000 

This almost certainly has to do with how the 2 program determine ties. The way you have constructed your data, y[i] and y[100 – i + 1] should be equal(using R indexing), but because of the way that computers handle floating point numbers, they are not. For example, in R:

> y[1] == [100] [1] TRUE > y[2] == y[99] [1] FALSE > y[2] - y[99] [1] -2.220446e-16 

In digging through the pandas code, it appears they are using a relative tolerance of 1e-7 to determine if 2 floating point numbers differ (see function float64_are_diff in this file . So, in the pandas implementation, these elements are tied.

Note that if you do a similar example in R using integers, then it arrives at the same answer as pandas:

> x <- seq(-50,50,length=101) > y <- x^2 > cor(x,y, method="spearman") [1] 0 

Similar Posts:

Rate this post

Leave a Comment