Solved – Correlation in R with imbalanced data

Let's say I have two vectors of different length. Of course, this will fail when I run cor.test in R.

x = rnorm(1,3,5,7,2,12,13,14,5,16)  y = rnorm(1,4,5,6,7)  cor.test(x,y)  > cor.test(x,y) Error in cor.test.default(x, y) : 'x' and 'y' must have the same length 

Obviously, regression also fails in this scenario given the length imbalance.

> lm(y~x) Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :    variable lengths differ (found for 'x') 

What is the proper statistical solution to get some sort of "correlation" or "association" metric when two variables are imbalanced? Any explanation of the stats behind it would be greatly appreciated?

Wikipedia writes "Correlation is any of a broad class of statistical relationships involving dependence" ( https://en.wikipedia.org/wiki/Correlation_and_dependence ) You can only have it in dependend data. Usually, when you write R-code like this:

x <- c(3, 4, 5, 6) y <- c(1, 6, 7, 8) cor(x, y, method="...") 

this implies, that the first element in x has something in common with the first element in y and the second has something in common with the second and the third with the third and so on. Correlation searches for things like "If some element in x is large, does the corresponding element in y have a tendency to be large as well?". Now, in something like

x = c(1,3,5,7,2,12,13,14,5,16)  y = c(1,4,5,6,7) 

There is no information, what y values should be large or small if x is 12, 13, 14, 5 or 16. Therefore, R throws an error so you can think about what went wrong.

Either your data has no clear definition of what a correlation amongst them should be, because it is not dependend data, or there are data points missing, which needs to be cleared by inserting NA where the data is missing

This would work:

x = c(1, 3, 5, 7, 2, 12, 13, 14, 5, 16)  y = c(1, 4, 5, 6, NA, NA, NA, 7, NA, NA) cor(x, y, na.rm = TRUE) 

Because now it is clear, which element in xbelongs to which in y.

For a more visual and maybe intuitive approach, let's for the moment consider Pearson correlation as an answer to the question, how well a regression line fits the data points in a scatter plot (with a grain of salt). Take your data example of 10 x values and 5 y values and see, if you can draw that in a scatter plot.

Similar Posts:

Rate this post

Leave a Comment