I've noticed a small difference between pandas and R with regards to how they calculate Spearman coefficients. It seems as if some rounding occurs. I see no such difference when calculating Kendall or Pearson. Does anyone know what might cause this? A simple example is provided below illustrating the differences.
#------------ R CODE -------------- x <- seq(-1,1,length=100) y <- x^2 ( cor(cbind(x,y), method="spearman") ) ( cor(cbind(x,y), method="kendall") ) #------------ OUTPUT FROM R ------------- > ( cor(cbind(x,y), method="spearman") ) x y x 1.00000000 0.01310547 y 0.01310547 1.00000000 > ( cor(cbind(x,y), method="kendall") ) x y x 1.000000000 0.009296686 y 0.009296686 1.000000000 #------ PYTHON -------------- import pandas as pd import numpy as np x = np.linspace(-1,1,100) y = x**2 df = pd.DataFrame({"x":x, "y":y}) print(df.corr(method="spearman")) print(df.corr(method="kendall")) #-------- OUTPUT FROM PYTHON -------------- x y x 1 0 y 0 1 x y x 1.000000 0.009297 y 0.009297 1.000000
Best Answer
This almost certainly has to do with how the 2 program determine ties. The way you have constructed your data, y[i] and y[100 – i + 1] should be equal(using R indexing), but because of the way that computers handle floating point numbers, they are not. For example, in R:
> y[1] == [100] [1] TRUE > y[2] == y[99] [1] FALSE > y[2] - y[99] [1] -2.220446e-16
In digging through the pandas code, it appears they are using a relative tolerance of 1e-7 to determine if 2 floating point numbers differ (see function float64_are_diff in this file . So, in the pandas implementation, these elements are tied.
Note that if you do a similar example in R using integers, then it arrives at the same answer as pandas:
> x <- seq(-50,50,length=101) > y <- x^2 > cor(x,y, method="spearman") [1] 0
Similar Posts:
- Solved – How to get the p value of AD test using the results of scipy.stats.anderson()
- Solved – Correlation coefficient for sets with non-linear correlation
- Solved – Equivalent to Spearman correlation for non-monotonic data
- Solved – Spearman’s rho to correlate discrete with binary variables
- Solved – Polynomial regression seems to give different coefficients depending on Python or R