I am trying to calculate the Pearson correlation coefficient according this formula over a large dataset:

Mostly, my values are between -1 and 1, but sometimes I get weird numbers like:

`1.0000000002 -3 `

And so on. Is it possible to have weird data that would result in this, or does this mean that I have an error in calculation?

For example, I notice that sometimes my summation of X is 1, and hence summation of X^2 would be 1. This results in a value like 1.00000002. Other times, I will have the summation of XY as 0, and then I will have the resulting calculation be -3. Is this statistically possible, or is there an error in my calculations?

**Contents**hide

#### Best Answer

The formulas you're using have *long* been known to be numerically unstable. If the squared means are large compared to the variances and/or products-of-means are large compared to the covariances, then the difference in the numerator and in the bracketed terms in the denominator can have problems with catastrophic cancellation.

This can sometimes lead to calculated variances or covariances that don't even retain a single digit of precision (i.e. that are worse than useless).

Don't use these formulas. They made some sense when people calculated *by hand*, where you could see, and deal with such loss of precision when it happened — e.g. use of these formulas was normally preceded by eliminating the common digits, so numbers like this:

` 8901234.567... 8901234.575... 8901234.412... `

would first have 8901234 subtracted (at least) — which would save a lot of time in the working as well as avoid the cancellation issue. Means (and similar quantities) would then be adjusted back at the end, while variances and covariances could be used as-is.

Similar ideas (and other ideas) can be used with computers, but really you need to use them all the time, rather than trying to guess when you might need them.

Efficient ways to deal with this issue have been known for over half a century — e.g. see Welford's 1962 paper [1] (where he gives one-pass variance and covariance algorithms — stable two-pass algorithms were well know already). Chan et al [2] (1983) compare a number of variance algorithms and offer a way to decide when to use which (though in most implementations generally people use only one algorithm).

See Wikipedia's discussion on this issue in relation to variance and its discussion on variance algorithms.

Similar comments apply to covariance.

[1] B. P. Welford (1962),

"Note on a Method for Calculating Corrected Sums of Squares and Products",

Technometrics Vol. 4 , Iss. 3, 419-420

(citeseer link)

[2] T.F. Chan, G.H. Golub and R.J. LeVeque (1983)

"Algorithms for Computing the Sample Variance: Analysis and Recommendations",

*The American Statistician*, Vol. 37, No. 3 (Aug.1983), pp. 242-247

Tech report version

### Similar Posts:

- Solved – Expected value and variance of arithmetic mean of random variables
- Solved – Why is Pearson’s ρ only an exhaustive measure of association if the joint distribution is multivariate normal
- Solved – How to test the variance in timeseries
- Solved – Disadvantages of Mean Squared Error
- Solved – How does variance change as sample size increases