Solved – Is it allowed to use averages on a dataset to improve correlation

I have a dataset with a dependent and an independent variable. Both are not a time series. I have 120 observations. The correlation coefficient is 0.43

After this calculation, I have added a column for both variables with the average for every 12 observations, resulting in 2 new columns with 108 observations (pairs). The correlation coefficient of these columns is 0.77

It seems I improved the correlation in this way. Is this allowed to do? Did I increase the explanation power of the independent variable by using averages?

Let's have a look at two vectors, the first being

    2 6 2 6 2 6 2 6 2 6 2 6 

and the second vector being

   6 2 6 2 6 2 6 2 6 2 6 2 

Calculating the Pearson correlation you'll get

cor(a,b) [1] -1 

However if you take the average of successive pairs for values both vectors are identical. Identical vectors have correlation 1.

  4 4 4 4 4 4   

This simple example illustrates a downside of your method.

Edit: To explain it more generally: The correlation coefficient is computed in the following way.

$frac{E[(X-mu_X)(Y-mu_Y)]}{sigma_X sigma_Y}$

Averaging some $X$s and some $Y$s changes the differences between $X$ and $mu_X$ as well as the difference between $Y$ and $mu_Y$.

Similar Posts:

Rate this post

Leave a Comment