Here's the code snippet:
df = pd.DataFrame(data=[1,1,2,2,3,3,3], columns =list('A')) def m(x): if x == 1: return 2 if x == 2: return 3 if x == 3: return 1 return -1 df['B'] = df['A'].map(m) print df.head(n=10) A B 0 1 2 1 1 2 2 2 3 3 2 3 4 3 1 5 3 1 6 3 1
As we can see, column B is created by mapping value from column A, thus they should have correlation of value 1, but what I got from below is all not satisfying. Could anyone give me some idea on how to calculate correlation of discrete data for two columns? Great thanks!
df['A'].cov(df['B']) -0.47619047619047611 df['A'].corr(df['B'], method='spearman') -0.68000000000000016 df['A'].corr(df['B'], method='kendall') -0.50000000000000011 df['A'].corr(df['B']) -0.58823529411764708
Best Answer
There is nothing wrong in your calculation. However, your mapping is not linear and therefore correlation between your variables is not 1 nor -1.
I suggest trying mapping 3 to 4 instead of 1 and compute correlation again. Then you should get correlation = 1.
For a different test, mapping 1 to 3, 2 to 2 and 3 to 1 should produce correlation = -1.
And please notice that correlation is just a measure on how much the variables are linearly related. If they are related by a deterministic mapping but that mapping is not linear, correlation will be low.
Similar Posts:
- Solved – Is it possible to calculate Pearson/Spearman correlation of more than two judges in MATLAB?
- Solved – Multi-collinearity test – MATLAB
- Solved – Correlation between four (more) variables
- Solved – Correlation between four (more) variables
- Solved – How to determine which columns best correlate with target column in a SQL Server Table