I need to compare total concentrations of a chemical across different tissues of a plant species. I sampled multiple tissues from each individual, and sampled at multiple sites. I do not know how to handle the individual and site covariates. There is good reason to expect a site effect on chemical concentrations. I need to control for the fact that concentrations will be correlated across tissues within individuals, but am only interested in whether concentrations differ consistently across tissues. I.e. are leaf concentrations always the lowest, etc.
I have two questions –
Question 1 what it the most appropriate analysis to use?
I do not think I have enough data for a mixed effects model, and would like to avoid using one if possible.
So far, I've used: (1) a three way ANOVA with tissue, individual, and site as factors.
3wy_mod <- (lm(total_conc ~ tissue+indv+site, data=data))
However, I feel like that isn't correct. I think that the model structure should have tissue nested within individual nested within site.
(2) to address the nesting issue, I've also used this model:
nest <- anova(lm(total_conc ~ tissue/indv/site, data=data))
The nested model seems correct. However, I'm not sure how to validate this model. This brings me to
How do I assess model fit, check assumptions, etc with whichever model structure turns out to be appropriate?
Here is an attempt at a reproducible example:
data <- structure(list(indv = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 19L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 21L, 21L, 21L, 21L, 22L, 22L, 22L, 22L, 23L, 23L, 23L, 23L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 26L, 26L, 26L, 26L, 26L, 27L, 27L, 27L, 27L, 28L, 28L, 28L, 28L), tissue = structure(c(3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 5L, 1L, 4L, 3L, 1L, 5L, 2L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 2L, 4L, 3L, 1L, 5L, 4L, 3L, 1L, 5L, 4L ), .Label = c("flower", "fruit", "leaf", "pollen", "stem"), class = "factor"), site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"), total_conc = c(8251.26, 3845.59, 6681.23, 504.47, 21494.77, 13514.46, 14244.08, 1091, 4156.25, 8301.19, 12968.21, 5536.14, 9034.05, 11401.89, 12010.11, 2013.21, 14997.58, 12266.25, 23068.28, 4784.39, 9326.1, 5522.93, 9814.8, 3467.77, 7107.22, 6129.18, 7330.3, 7242.1, 8668.38, 5214.82, 1891.02, 5918.55, 11268.32, 3267.63, 3648.9, 13406.13, 15847.31, 6775.97, 1781.21, 10943.45, 14477.16, 7890.34, 1738.36, 14467.25, 11777.99, 5551.34, 1136, 5537.86, 18114.18, 2895.41, 999.91, 3633.19, 7243.96, 4856.73, 8317.76, 15885.53, 6811.57, 15818.11, 1316.81, 15347.71, 8068.19, 14688.62, 2247.32, 17762.4, 6628.74, 16494.26, 11255.5, 19459.68, 5960.92, 20607.73, 3226.06, 11572.8, 7924.07, 12091.07, 2934.25, 12734.17, 5158.48, 16520.8, 1431.31, 14007.17, 4689.99, 23774.68, 5799.91, 12903.34, 18982.81, 9125.87, 14869.07, 20022.23, 28254.47, 9459.41, 1977.36, 22562.52, 22195.51, 12376.81, 2284.3, 1358.63, 24866.6, 15789.27, 9544.07, 275.32, 17669.95, 18935.79, 9357.06, 18818.68, 1676.3, 23224.24, 25432.66, 12328.59, 1367.51, 24176.56, 26710.84, 21184.79, 1642.89)), .Names = c("indv", "tissue", "site", "total_conc"), class = "data.frame", row.names = c(NA, -113L)) 3wy_mod <- anova(lm(total_conc ~ tissue+indv+site, data=data)) nest <- anova(lm(total_conc ~ tissue/indv/site, data=data))
The crossed, non-nested model is incorrect, as you suspect, and seems to be over-specified for your posted data. It's important to get the nesting correct, which can be confusing. This page is a good guide. With multiple individuals per site and each individual restricted to one site, the individuals are nested within site. If you were to use your
anova(lm()) approach you should be writing:
total_conc ~ tissue/site/indiv
which is expanded to:
total_conc ~ 1 + tissue + tissue:site + tissue:site:indiv
So this model examines the main effect of
tissue plus its interactions with
site and with
site:indiv, without main effects for
indiv. The result of your
anova(lm()) on your posted data* would then be:
Analysis of Variance Table Response: total_conc Df Sum Sq Mean Sq F value Pr(>F) tissue 4 3023661191 755915298 1288.243 0.02089 * tissue:site 12 1280847591 106737299 181.903 0.05788 . tissue:site:indv 95 1325836886 13956178 23.784 0.16203 Residuals 1 586780 586780
You only have 1 residual degree of freedom because most cases have only one measurement on each tissue from an individual. So the three-way interaction is fairly meaningless. Although results from the Type I sums of squares used by
anova(), when applied to unbalanced data, can depend on the order of entry of variables into the model formula, in this case there really is only one order of entry that makes sense, so that's not such an issue.
These data are highly unbalanced, with observations on
fruit restricted to 2
indiv at one particular
site. If that's characteristic of your real data, then you probably should consider removing such tissues with low numbers of observations from your analyses. Would you (or a skeptical reviewer) really trust analyses based on 2 observations from a single
site? Removing the
fruit observations from these data, at least, would leave a reasonably balanced design that would probably be good enough for standard nested ANOVA.
Note that both the nested ANOVA approach and mixed models with intercept-only terms like
(1|indiv) are making an implicit assumption that all effects of individuals and sites on
total_conc are additive without regard to the
tissue being evaluated. You must use your knowledge of the subject matter as to the validity of that assumption. With the raw means of
total_conc ranging from 2804 (
pollen) to 16844 (
fruit) I would be a bit worried about that assumption. In principle the formalism of a mixed model could incorporate random effects that differ among
tissue values, but you don't seem to have enough data for that. The suggestion from @MarkWhite to try a Bayesian multilevel model might help, but I have no experience with such models.
If there is some irreducible imbalance in the data then I think that many would argue for a mixed-model approach. My guess is that the errors that you discover with mixed models due to small numbers of observations would also be seen in the
anova(lm()) approach if you dug more deeply. For example, that approach on the crossed model applied to your data provides what seems to be a perfectly reasonable ANOVA table.
> anova(lm(total_conc ~ tissue+site+indv, data=data)) Analysis of Variance Table Response: total_conc Df Sum Sq Mean Sq F value Pr(>F) tissue 4 3023661191 755915298 47.967 < 2.2e-16 *** site 3 746832764 248944255 15.797 3.552e-08 *** indv 24 583953341 24331389 1.544 0.07733 . Residuals 81 1276485152 15759076
Nevertheless, the underlying crossed model can't define 3 regression coefficients because of singularities:
> summary(lm(total_conc ~ tissue+site+indv, data=data)) Call: lm(formula = total_conc ~ tissue + site + indv, data = data) Residuals: Min 1Q Median 3Q Max -9199.3 -2226.1 -84.8 1890.3 9891.5 Coefficients: (3 not defined because of singularities)
*Note that the
indiv values in the posted data need to be converted to factors first.
- Solved – Multilevel model with nested repeated measures design
- Solved – Using AIC to select between models that use nested and non-nested variables
- Solved – do an nested ANOVA but the variances are very unequal
- Solved – Which test to find out best concentration (the one having maximum effect)
- Solved – R chi square test with degrees of freedom