Solved – How to rearrange 2D data to get given correlation

I have the following simple dataset with two continuous variables; i.e.:

d = data.frame(x=runif(100,0,100),y = runif(100,0,100)) plot(d$x,d$y) abline(lm(y~x,d), col="red") cor(d$x,d$y) # = 0.2135273 

Base distribution

I need to rearrange the data in the way to have correlation between variables to be ~0.6. I need to keep means and other descriptive statistics (sd,min,max,etc.) of both variables constant.

I know it is possible to make almost any correlation with the given data i.e.:

d2 = with(d,data.frame(x=sort(x),y=sort(y))) plot(d2$x,d2$y) abline(lm(y~x,d2), col="red") cor(d2$x,d2$y) # i.e. 0.9965585 

enter image description here

If I try to use sample function for this task:

cor.results = c() for(i in 1:1000){     set.seed(i)     d3 = with(d,data.frame(x=sample(x),y=sample(y)))     cor.results =  c(cor.results,cor(d3$x,d3$y)) } 

I get quite wide range of correlations:

> summary(cor.results)      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.  -0.281600 -0.038330 -0.002498 -0.001506  0.034380  0.288800 

but this range depends on number of rows in data frame and decreasing with increase of size.

> d = data.frame(x=runif(1000,0,100),y = runif(1000,0,100)) > cor.results = c() > for(i in 1:1000){ + set.seed(i) + d3 = with(d,data.frame(x=sample(x),y=sample(y))) + cor.results =  c(cor.results,cor(d3$x,d3$y)) + } > summary(cor.results)       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.  -0.1030000 -0.0231300 -0.0005248 -0.0005547  0.0207000  0.1095000 

My question is:

How to rearrange such dataset to get given correlation (i.e. 0.7)?
(It will be also good if method will remove dependence on dataset size)

Here is one way to rearrange the data that is based on generating additional random numbers.

We draw samples from a bivariate normal distribution with specified correlation. Next, we compute the ranks of the $x$ and $y$ values we obtain. These ranks are used to order the original values. For this approach, we have top sort both the original $x$ and $y$ values.

First, we create the actual data set (like in your example).

set.seed(1) d <- data.frame(x = runif(100, 0, 100), y = runif(100, 0, 100))  cor(d$x, d$y) # [1] 0.01703215 

Now, we specify a correlation matrix.

corr <- 0.7  # target correlation corr_mat <- matrix(corr, ncol = 2, nrow = 2) diag(corr_mat) <- 1 corr_mat #      [,1] [,2] # [1,]  1.0  0.7 # [2,]  0.7  1.0 

We generate random data following a bivariate normal distribution with $mu = 0$, $sigma = 1$ (for both variables) and the specified correlation. In R, this can be done with the mvrnorm function from the MASS package. We use empirical = TRUE to indicate that the correlation is the empirical correlation (not the population correlation).

library(MASS) mvdat <- mvrnorm(n = nrow(d), mu = c(0, 0), Sigma = corr_mat, empirical = TRUE)  cor(mvdat) #      [,1] [,2] # [1,]  1.0  0.7 # [2,]  0.7  1.0 

The random data perfectly matches the specified correlation.

Next, we compute the ranks of the random data.

rx <- rank(mvdat[ , 1], ties.method = "first") ry <- rank(mvdat[ , 2], ties.method = "first") 

To use the ranks for the original data in d, we have to sort the original data.

dx_sorted <- sort(d$x) dy_sorted <- sort(d$y) 

Now, we can use the ranks to specify the order of the sorted data.

cor(dx_sorted[rx], dy_sorted[ry]) # [1] 0.6868986 

The obtained correlation does not perfectly match the specified one, but the difference is relatively small.

Here, dx_sorted[rx] and dy_sorted[ry] are resampled versions of the original data in d.

Similar Posts:

Rate this post

Leave a Comment