I'm currently working with a data set that has numerous samples collected over time at different sites in a study area, and I'm interested in detecting a trend over time for that area. I know that in an ideal experimental or balanced situation, using a random slope and intercept model is a great way to get at the overall trend within the study area. With our data, however, many of the sites are missing samples and a handful of the sites only have one data point.
I'm curious if there's a way to intuitively understand how the sample imbalance will affect the estimate of the overall slope? To put it differently, are there ways to know if sample imbalances are causing problems, or are there things I can look for in my model output that would indicate I shouldn't trust what the model is estimating?
I created a contrived example with 20 data points to look at this. I put 10 data points with a slope of 1 into one site (a), and put the other 10 data points with a slope of -1 into unique sites (b through l). I had assumed that when I looked at both a random intercept and random slope and intercept model that they would be somewhat similar, or that at least the latter would give more weight to the site with good data over time.
> library(lme4) > set.seed(9999) > x = c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9) + rnorm(20,mean=0,sd=0.1) > y = c(0,1,2,3,4,5,6,7,8,9,9,8,7,6,5,4,3,2,1,0) + rnorm(20,mean=0,sd=0.1) > z = c(rep('a',10),'b','c','d','e','f','h','i','j','k','l') > z = factor(z) > m0 = lm(y~x) > m1 = lmer(y~x+(1|z)) > m2 = lmer(y~x+(1+x|z)) > summary(m0) > summary(m1) > summary(m2) > anova(m1,m2)
As expected, the slope of the linear model was near zero, but the results for the two mixed effects models were nearly opposite. Even though sites b through l only have one data point, it seems like they contribute more towards the slope because the trend is occurring over so many sites. The random slope and intercept model was also preferred to using model selection criteria.
> summary(m0)$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 4.53784796 1.2586990 3.6051890 0.002023703 x -0.01178748 0.2335094 -0.0504797 0.960296079 > summary(m1) Linear mixed model fit by REML ['lmerMod'] Formula: y ~ x + (1 | z) REML criterion at convergence: 62.0877 Random effects: Groups Name Variance Std.Dev. z (Intercept) 33.30788 5.7713 Residual 0.01583 0.1258 Number of obs: 20, groups: z, 11 Fixed effects: Estimate Std. Error t value (Intercept) -0.03597 1.74163 -0.02 x 0.99332 0.01386 71.66 Correlation of Fixed Effects: (Intr) x -0.036 > summary(m2) Linear mixed model fit by REML ['lmerMod'] Formula: y ~ x + (1 + x | z) REML criterion at convergence: 31.0386 Random effects: Groups Name Variance Std.Dev. Corr z (Intercept) 7.78818 2.7907 x 0.37691 0.6139 -1.00 Residual 0.01524 0.1234 Number of obs: 20, groups: z, 11 Fixed effects: Estimate Std. Error t value (Intercept) 8.2121 0.8566 9.587 x -0.8201 0.1882 -4.358 Correlation of Fixed Effects: (Intr) x -0.999 > anova(m1,m2) Data: Models: m1: y ~ x + (1 | z) m2: y ~ x + (1 + x | z) Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) m1 4 66.206 70.189 -29.103 58.206 m2 6 36.745 42.719 -12.372 24.745 33.462 2 5.419e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I see that under this extreme example, the random slope and intercept have an almost perfect correlation. Is what I can pull from this is that, in a sense, the model gives more value to the sites with only one data point because the overall trend is so strong but across multiple sites, but that I should view the slope estimate this model produces as suspect with such a high correlation? Is there anything else that should look for? For my specific study, I could also set some sort of criteria for what level of replication I thought was necessary to make proper inferences, e.g. eliminate all the sites that less than five samples.
Many thanks for your thoughts.
With 10 positive slopes on the same site whereas the slopes are negative on all other sites (although with only one observation and thus much more uncertainty for each individual site), no need of statistical models to think of a site effect that would bias the estimate. The first site would definitely look like an outlier and we would expect the actual slope to be negative and close to -1.
That's exactly what statistically tells you model 2, and given the strong difference in slope you gave it as input, no wonder model 2 has a higher log-likelihood than model 1. Here model 2 is the correct one, as is avoid your estimates being biased by the large number of outlying observations at site a.
The negative correlation between random effects comes from the fact that the relationships in site a and in other sites do not only have different slopes, they also have different intercepts: 0 in site a vs. 9 in the other sites. There is thus a strong negative correlation between intercept and slope. Here again, using model 2 gives you a better estimate of the intercept across sites, unbiased by the large number of observations in the outlying site a.
Including sites with few events (even only one) is still better than omitting them, as it improves the estimation of the residual variance and fixed effects (see Martin et al. 2011). So you should not set a threshold to eliminate sites with few observations.
- Solved – How to setup a model with hierarchical structure using lmer in R
- Solved – Variable slopes in a fixed effects model
- Solved – mixed models: extract slopes
- Solved – How to treat year variable in observational longitudinal data analysis
- Solved – Problem understanding the interaction term of mixed effects model