Solved – serious problem with dropping observations with missing values when computing correlation matrix

I have this huge data set with like 2500 variables and like 142 observations.

I want to run a correlation between Variable X and the rest of the variables. But for many columns, there are entries missing.

I tried to do this in R using "pairwise-complete" argument (use=pairwise.complete.obs) and it outputted a bunch of correlations. But then someone on StackOverflow posted a link to this article and it makes the "pairwise-complete" method in R look unusable.

My Question: How do I know when it is appropriate to use "pairwise-complete" option?

My use = complete.obs returned no complete element pairs, so if you could explain what that means too, that would be great.

The issue with correlations on pairwise complete observations

In the case you describe, the main issue is interpretation. Because you're using pairwise complete observations, you are actually analyzing slightly different datasets for each of the correlations, depending on which observations are missing.

Consider the following example:

a <- c(NA,NA,NA, 5, 6, 3, 7, 8, 3) b <- c(2, 8, 3, NA,NA,NA, 6, 9, 5) c <- c(2, 9, 6, 3, 2, 3, NA,NA,NA)  

Three variables in the dataset, a, b, and c, each has some missing values. If you calculate correlations on pairs of variables here, you'll only be able to use cases that don't have missing values for both of the variables in question. In this case, that means you'll be analyzing just the last 3 cases for the correlation between a and b, just the first three cases for the correlation between b and c, etc.

The fact that you're analyzing completely different cases when you calculate each correlation means that the resulting pattern of correlations can look nonsensical. See:

> cor(a,b, use = "pairwise.complete.obs") [1] 0.8170572 > cor(b,c, use = "pairwise.complete.obs") [1] 0.9005714 > cor(a,c, use = "pairwise.complete.obs") [1] -0.7559289 

This looks like a logical contradiction — a and b are strongly positively correlated, and b and c are also strongly positively correlated, so you would expect a and c to be positively correlated as well, but there's actually a strong association in the opposite direction. You can see why a lot of analysts don't like that.

Edit to include useful clarification from whuber:

Note that part of the argument depends on what "strong" correlation might mean. It is quite possible for a and b as well as b and c to be "strongly positively correlated" while there exists a "strong association in the opposite direction" between a and c, but not quite as extreme as in this example. The crux of the matter is that the estimated correlation (or covariance) matrix might not be positive-definite: that's how one should quantify "strong".

The issue with the type of missingness

You may be thinking to yourself, "Well, isn't it okay to just assume that the subset of cases I have available for each correlation follow more or less the same pattern I would get if I had complete data?" And yes, that's true — there's nothing fundamentally wrong with calculating a correlation on a subset of your data (although you lose precision and power, of course, because of the smaller sample size), as long as the available data are a random sample of all of the data that would have been there if you didn't have any missingness.

When the missingness is purely random, that's called MCAR (missing completely at random). In that case, analyzing the subset of the data that doesn't have missingness won't systematically bias your results, and it would be unlikely (but not impossible) to get the kind of nutsy correlation pattern I showed in the example above.

When your missingness is systematic in some way (often abbreviated MAR or NI, delineating two different kinds of systematic missingness) then you have much more serious issues, both in terms of potentially introducing bias in your calculations and in terms of your ability to generalize your results to the population of interest (because the sample you're analyzing is not a random sample from the population, even if your full dataset would have been).

There are a lot of great resources available to learn about missing data and how to deal with it, but my recommendation is Rubin: a classic, and a more recent article

Similar Posts:

Rate this post

Leave a Comment