# Solved – How to rearrange 2D data to get given correlation

I have the following simple dataset with two continuous variables; i.e.:

``d = data.frame(x=runif(100,0,100),y = runif(100,0,100)) plot(d\$x,d\$y) abline(lm(y~x,d), col="red") cor(d\$x,d\$y) # = 0.2135273 `` I need to rearrange the data in the way to have correlation between variables to be ~0.6. I need to keep means and other descriptive statistics (sd,min,max,etc.) of both variables constant.

I know it is possible to make almost any correlation with the given data i.e.:

``d2 = with(d,data.frame(x=sort(x),y=sort(y))) plot(d2\$x,d2\$y) abline(lm(y~x,d2), col="red") cor(d2\$x,d2\$y) # i.e. 0.9965585 `` If I try to use `sample` function for this task:

``cor.results = c() for(i in 1:1000){     set.seed(i)     d3 = with(d,data.frame(x=sample(x),y=sample(y)))     cor.results =  c(cor.results,cor(d3\$x,d3\$y)) } ``

I get quite wide range of correlations:

``> summary(cor.results)      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.  -0.281600 -0.038330 -0.002498 -0.001506  0.034380  0.288800 ``

but this range depends on number of rows in data frame and decreasing with increase of size.

``> d = data.frame(x=runif(1000,0,100),y = runif(1000,0,100)) > cor.results = c() > for(i in 1:1000){ + set.seed(i) + d3 = with(d,data.frame(x=sample(x),y=sample(y))) + cor.results =  c(cor.results,cor(d3\$x,d3\$y)) + } > summary(cor.results)       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.  -0.1030000 -0.0231300 -0.0005248 -0.0005547  0.0207000  0.1095000 ``

My question is:

How to rearrange such dataset to get given correlation (i.e. 0.7)?
(It will be also good if method will remove dependence on dataset size)

Contents

Here is one way to rearrange the data that is based on generating additional random numbers.

We draw samples from a bivariate normal distribution with specified correlation. Next, we compute the ranks of the \$x\$ and \$y\$ values we obtain. These ranks are used to order the original values. For this approach, we have top sort both the original \$x\$ and \$y\$ values.

First, we create the actual data set (like in your example).

``set.seed(1) d <- data.frame(x = runif(100, 0, 100), y = runif(100, 0, 100))  cor(d\$x, d\$y) #  0.01703215 ``

Now, we specify a correlation matrix.

``corr <- 0.7  # target correlation corr_mat <- matrix(corr, ncol = 2, nrow = 2) diag(corr_mat) <- 1 corr_mat #      [,1] [,2] # [1,]  1.0  0.7 # [2,]  0.7  1.0 ``

We generate random data following a bivariate normal distribution with \$mu = 0\$, \$sigma = 1\$ (for both variables) and the specified correlation. In R, this can be done with the `mvrnorm` function from the `MASS` package. We use `empirical = TRUE` to indicate that the correlation is the empirical correlation (not the population correlation).

``library(MASS) mvdat <- mvrnorm(n = nrow(d), mu = c(0, 0), Sigma = corr_mat, empirical = TRUE)  cor(mvdat) #      [,1] [,2] # [1,]  1.0  0.7 # [2,]  0.7  1.0 ``

The random data perfectly matches the specified correlation.

Next, we compute the ranks of the random data.

``rx <- rank(mvdat[ , 1], ties.method = "first") ry <- rank(mvdat[ , 2], ties.method = "first") ``

To use the ranks for the original data in `d`, we have to sort the original data.

``dx_sorted <- sort(d\$x) dy_sorted <- sort(d\$y) ``

Now, we can use the ranks to specify the order of the sorted data.

``cor(dx_sorted[rx], dy_sorted[ry]) #  0.6868986 ``

The obtained correlation does not perfectly match the specified one, but the difference is relatively small.

Here, `dx_sorted[rx]` and `dy_sorted[ry]` are resampled versions of the original data in `d`.

Rate this post