# Solved – Interpreting coefficients for random effects models with extremely unbalanced data

I'm currently working with a data set that has numerous samples collected over time at different sites in a study area, and I'm interested in detecting a trend over time for that area. I know that in an ideal experimental or balanced situation, using a random slope and intercept model is a great way to get at the overall trend within the study area. With our data, however, many of the sites are missing samples and a handful of the sites only have one data point.

I'm curious if there's a way to intuitively understand how the sample imbalance will affect the estimate of the overall slope? To put it differently, are there ways to know if sample imbalances are causing problems, or are there things I can look for in my model output that would indicate I shouldn't trust what the model is estimating?

I created a contrived example with 20 data points to look at this. I put 10 data points with a slope of 1 into one site (a), and put the other 10 data points with a slope of -1 into unique sites (b through l). I had assumed that when I looked at both a random intercept and random slope and intercept model that they would be somewhat similar, or that at least the latter would give more weight to the site with good data over time.

``> library(lme4) > set.seed(9999)  > x = c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9) + rnorm(20,mean=0,sd=0.1) > y = c(0,1,2,3,4,5,6,7,8,9,9,8,7,6,5,4,3,2,1,0) + rnorm(20,mean=0,sd=0.1) > z = c(rep('a',10),'b','c','d','e','f','h','i','j','k','l') > z = factor(z)  > m0 = lm(y~x) > m1 = lmer(y~x+(1|z)) > m2 = lmer(y~x+(1+x|z))  > summary(m0) > summary(m1) > summary(m2) > anova(m1,m2) ``

As expected, the slope of the linear model was near zero, but the results for the two mixed effects models were nearly opposite. Even though sites b through l only have one data point, it seems like they contribute more towards the slope because the trend is occurring over so many sites. The random slope and intercept model was also preferred to using model selection criteria.

`` > summary(m0)\$coefficients                 Estimate Std. Error    t value    Pr(>|t|)  (Intercept)  4.53784796  1.2586990  3.6051890 0.002023703  x           -0.01178748  0.2335094 -0.0504797 0.960296079   > summary(m1)  Linear mixed model fit by REML ['lmerMod']  Formula: y ~ x + (1 | z)    REML criterion at convergence: 62.0877    Random effects:   Groups   Name        Variance Std.Dev.   z        (Intercept) 33.30788 5.7713     Residual              0.01583 0.1258    Number of obs: 20, groups: z, 11   Fixed effects:              Estimate Std. Error t value  (Intercept) -0.03597    1.74163   -0.02  x            0.99332    0.01386   71.66   Correlation of Fixed Effects:    (Intr)  x -0.036   > summary(m2)  Linear mixed model fit by REML ['lmerMod']  Formula: y ~ x + (1 + x | z)    REML criterion at convergence: 31.0386    Random effects:   Groups   Name        Variance Std.Dev. Corr    z        (Intercept) 7.78818  2.7907               x           0.37691  0.6139   -1.00   Residual             0.01524  0.1234          Number of obs: 20, groups: z, 11   Fixed effects:              Estimate Std. Error t value  (Intercept)   8.2121     0.8566   9.587  x            -0.8201     0.1882  -4.358   Correlation of Fixed Effects:    (Intr)  x -0.999   > anova(m1,m2)  Data:   Models:  m1: y ~ x + (1 | z)  m2: y ~ x + (1 + x | z)     Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)      m1  4 66.206 70.189 -29.103   58.206                               m2  6 36.745 42.719 -12.372   24.745 33.462      2  5.419e-08 ***  ---  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``

I see that under this extreme example, the random slope and intercept have an almost perfect correlation. Is what I can pull from this is that, in a sense, the model gives more value to the sites with only one data point because the overall trend is so strong but across multiple sites, but that I should view the slope estimate this model produces as suspect with such a high correlation? Is there anything else that should look for? For my specific study, I could also set some sort of criteria for what level of replication I thought was necessary to make proper inferences, e.g. eliminate all the sites that less than five samples.

Contents