Solved – Methods for imputation of missing values with spatially autocorrelated data

I am looking at the spatial patterns of turnover in aquatic assemblages using gradient forest and generalized dissimilarity models in R. I have species and environmental data for more than 400 sites sampled from streams across the country. However, I also have many missing values in my predictor (environmental) variables, up to 30% for some variables, so removing the cases with incomplete data sets or replacing them by the mean will result in bias and loss of information. About the missingness pattern, my data is really an assembly of data collected by different county administrations, and the decisions about which variables to sample are made at the county level. For example, some counties have monitored total and dissolved nutrients but others have only routinely monitored total nutrients. The missingness pattern is thus affected by those decisions. The next matrix is an example of how my data would look like:

     County V1  V2  V3  V4  V5  V6  [1,] 10     52  6   35  294 48  25  [2,] 10     22  7   41  53  42  NA  [3,] 10     118 NA  55  82  59  NA  [4,] 10     150 8   13  91  63  15  [5,] 10     500 NA  NA  NA  102 9   [6,] 9      58  7   NA  22  73  7   [7,] 9      9   6.5 NA  38  152 17  [8,] 9      9   7   NA  14  224 11  [9,] 9      142 5.5 NA  57  64  11 [10,] 9      90  6   NA  102 66  NA [11,] 6      30  7   9   NA  NA  11 [12,] 6      420 4.5 8   NA  NA  NA [13,] 6      43  4.5 3.5 NA  NA  NA [14,] 6      50  6.5 116 NA  NA  14 [15,] 6      10  NA  13  NA  NA  8  >  

where "County" is the different county administrations that have provided data, and "V1 to V6" are the environmental variables.
In county 10 all variables are sampled but some are missing at random.
In county 9 variable V3 is not routinely monitored.
In county 6 variables V4, and V5 are not monitored, in addition there are missing values in the routinely measured variables due to, e.g. failure in the sampling device.

I cannot forget to mention that my data are also spatially autocorrelated at small spatial scales (<10 to 100km).
I would like to estimate those missing values but I don’t know which method is the most appropriate to do so, and which R packages are recommended.

If you got all the missing data experts in a room, they probably wouldn't agree on which method is best, although they might agree that what is best depends very much on the precise goal and on what underlies your data.

While you are waiting for a missing data expert to answer, here is one amateur answer, although it draws on some experience with environmental data.

With your data you could try (a) to fill in missings spatially or you could try (b) to fill in using relationships between variables or you could try (c) a combination. In principle, using all the information you have sounds a good idea; in practice even (a) or (b) could be a lot of work and (c) could be a nightmare.

My own tendency would be to consider estimating missings as weighted combinations of neighbouring values, weights depending on distances between sampling points. Although it is not obvious from your listing, I guess you have a latitude and longitude or some equivalent somewhere. In principle that is a one-line equation for each variable; in practice it could still be a lot of work, dependent possibly on access to, and skills with, GIS software.

The set-up of counties seems, possibly, a red herring here. The streams don't know which county they are in. It's possible that being in a county is meaningful if there is something dependent on policy or efficiency, e.g. how lax or tough a county is in controlling pollution or other abuse of the rivers. Otherwise being in a county is just a proxy for spatial variation for other reasons (climate, soils, biota, geology, topography, land use, population pressure, etc., etc.) and distance-weighting will capture that as well as anything else.

You'd still need to compare anything you got with a run on the dataset trimmed of missings. The problem with imputing weighted averages is necessarily that you understate the real variability.

There are non-spatial methods (multiple imputation etc.) that people are more likely to recommend. I thought I'd speak up for a spatial interpolation method.

It is interesting sociologically, if immaterial statistically, that you talk about "the country" as if we knew what it was. That leads to a guess that you are from the United States.

Similar Posts:

Rate this post

Leave a Comment