# Solved – Is it allowed to use averages on a dataset to improve correlation

I have a dataset with a dependent and an independent variable. Both are not a time series. I have 120 observations. The correlation coefficient is 0.43

After this calculation, I have added a column for both variables with the average for every 12 observations, resulting in 2 new columns with 108 observations (pairs). The correlation coefficient of these columns is 0.77

It seems I improved the correlation in this way. Is this allowed to do? Did I increase the explanation power of the independent variable by using averages?

Contents

Let's have a look at two vectors, the first being

``    2 6 2 6 2 6 2 6 2 6 2 6 ``

and the second vector being

``   6 2 6 2 6 2 6 2 6 2 6 2 ``

Calculating the Pearson correlation you'll get

``cor(a,b) [1] -1 ``

However if you take the average of successive pairs for values both vectors are identical. Identical vectors have correlation 1.

``  4 4 4 4 4 4   ``

This simple example illustrates a downside of your method.

Edit: To explain it more generally: The correlation coefficient is computed in the following way.

\$frac{E[(X-mu_X)(Y-mu_Y)]}{sigma_X sigma_Y}\$

Averaging some \$X\$s and some \$Y\$s changes the differences between \$X\$ and \$mu_X\$ as well as the difference between \$Y\$ and \$mu_Y\$.

Rate this post