I have the following simple dataset with two continuous variables; i.e.:
d = data.frame(x=runif(100,0,100),y = runif(100,0,100)) plot(d$x,d$y) abline(lm(y~x,d), col="red") cor(d$x,d$y) # = 0.2135273
I need to rearrange the data in the way to have correlation between variables to be ~0.6. I need to keep means and other descriptive statistics (sd,min,max,etc.) of both variables constant.
I know it is possible to make almost any correlation with the given data i.e.:
d2 = with(d,data.frame(x=sort(x),y=sort(y))) plot(d2$x,d2$y) abline(lm(y~x,d2), col="red") cor(d2$x,d2$y) # i.e. 0.9965585
If I try to use sample
function for this task:
cor.results = c() for(i in 1:1000){ set.seed(i) d3 = with(d,data.frame(x=sample(x),y=sample(y))) cor.results = c(cor.results,cor(d3$x,d3$y)) }
I get quite wide range of correlations:
> summary(cor.results) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.281600 -0.038330 -0.002498 -0.001506 0.034380 0.288800
but this range depends on number of rows in data frame and decreasing with increase of size.
> d = data.frame(x=runif(1000,0,100),y = runif(1000,0,100)) > cor.results = c() > for(i in 1:1000){ + set.seed(i) + d3 = with(d,data.frame(x=sample(x),y=sample(y))) + cor.results = c(cor.results,cor(d3$x,d3$y)) + } > summary(cor.results) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.1030000 -0.0231300 -0.0005248 -0.0005547 0.0207000 0.1095000
My question is:
How to rearrange such dataset to get given correlation (i.e. 0.7)?
(It will be also good if method will remove dependence on dataset size)
Best Answer
Here is one way to rearrange the data that is based on generating additional random numbers.
We draw samples from a bivariate normal distribution with specified correlation. Next, we compute the ranks of the $x$ and $y$ values we obtain. These ranks are used to order the original values. For this approach, we have top sort both the original $x$ and $y$ values.
First, we create the actual data set (like in your example).
set.seed(1) d <- data.frame(x = runif(100, 0, 100), y = runif(100, 0, 100)) cor(d$x, d$y) # [1] 0.01703215
Now, we specify a correlation matrix.
corr <- 0.7 # target correlation corr_mat <- matrix(corr, ncol = 2, nrow = 2) diag(corr_mat) <- 1 corr_mat # [,1] [,2] # [1,] 1.0 0.7 # [2,] 0.7 1.0
We generate random data following a bivariate normal distribution with $mu = 0$, $sigma = 1$ (for both variables) and the specified correlation. In R, this can be done with the mvrnorm
function from the MASS
package. We use empirical = TRUE
to indicate that the correlation is the empirical correlation (not the population correlation).
library(MASS) mvdat <- mvrnorm(n = nrow(d), mu = c(0, 0), Sigma = corr_mat, empirical = TRUE) cor(mvdat) # [,1] [,2] # [1,] 1.0 0.7 # [2,] 0.7 1.0
The random data perfectly matches the specified correlation.
Next, we compute the ranks of the random data.
rx <- rank(mvdat[ , 1], ties.method = "first") ry <- rank(mvdat[ , 2], ties.method = "first")
To use the ranks for the original data in d
, we have to sort the original data.
dx_sorted <- sort(d$x) dy_sorted <- sort(d$y)
Now, we can use the ranks to specify the order of the sorted data.
cor(dx_sorted[rx], dy_sorted[ry]) # [1] 0.6868986
The obtained correlation does not perfectly match the specified one, but the difference is relatively small.
Here, dx_sorted[rx]
and dy_sorted[ry]
are resampled versions of the original data in d
.
Similar Posts:
- Solved – Generating correlated uniform random variable with R
- Solved – Correlation between four (more) variables
- Solved – Correlation between four (more) variables
- Solved – invariance of correlation to linear transformation: $text{corr}(aX+b, cY+d) = text{corr}(X,Y)$
- Solved – invariance of correlation to linear transformation: $text{corr}(aX+b, cY+d) = text{corr}(X,Y)$