Hi everyone: I'm hoping for some conceptual and practical advice when it comes to statistical analysis using linear models.
The overview is this: I have values measured from 4 trials, each of which included two treatments. In two of the trials, the treatment had no effect, in the two others, it did have an effect. I'd like to use an LM to show this.
I've simulated some data that approximate my real data:
testdf <- data.frame(trial =c(rep ("1", 20), rep("2", 20), rep("3", 20), rep("4", 20)), treatment = c(rep(c(rep("A", 10), rep("B", 10)),4)), value = NA) #create a data frame of 80 observations trueval= 3.7 #the mean value overall treatmentval = 1.2 #the effect of treatment set.seed(6) #so we get the same answer each time testdf$value <- rnorm(80, mean=trueval) #values drawn from a normal dist testdf$value[testdf$trial=="3"&testdf$treatment=="B"] <-rnorm(10, mean=trueval+treatmentval) # values for trial 3 treatment B are different testdf$value[testdf$trial=="4"&testdf$treatment=="B"] <-rnorm(10, mean=trueval+treatmentval) # values for trial 4 treatment B are different head(testdf)
So in my made-up data I know that there's a treatment x trial interaction, but how to test this? An ANOVA gives me the expected result:
> summary(aov(value~trial*treatment, data = testdf)) Df Sum Sq Mean Sq F value Pr(>F) trial 3 4.12 1.375 1.127 0.34386 treatment 1 8.69 8.687 7.124 0.00939 ** trial:treatment 3 11.20 3.733 3.061 0.03351 *
But I don't want to use an ANOVA (in my real data I have some random effects to deal with).
My problem with using an LM is that with a factor variable (trial) the model output compares each level of the factor to each other, instead of giving me a P-value for the entire variable.
> summary(lm(value~trial*treatment, data = testdf)) Call: lm(formula = value ~ trial * treatment, data = testdf) Residuals: Min 1Q Median 3Q Max -1.7489 -0.7443 -0.0995 0.6794 2.4066 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.8054 0.3492 10.897 <2e-16 *** trial2 0.1057 0.4939 0.214 0.8312 trial3 -0.1710 0.4939 -0.346 0.7301 trial4 -0.4477 0.4939 -0.907 0.3676 treatmentB 0.1767 0.4939 0.358 0.7216 trial2:treatmentB -0.4871 0.6984 -0.697 0.4878 trial3:treatmentB 1.2905 0.6984 1.848 0.0687 . trial4:treatmentB 1.1261 0.6984 1.612 0.1113
This is annoying because
1. I have to decide which trial (if not 1) should be the "reference"
2. All subsequent trials are compared to that particular reference
3. Doing all these trial by trial comparisons is making the interaction hard to pick up.
My goal is to come up with some statistical justification for the statement "The treatment had a significant effect in Trials C and D, but not A and B."
I've decided I could run an lm for each trial separately and do some sort of Bonferroni-Holm p-val correction but I'm trying to limit the number of models I run (to avoid complaints of data dredging). Does anyone have another way to approach dealing with factor variables in linear models?
Thanks a million
Best Answer
It is appropriate to use the lm
function to construct general linear models, including traditional anovas. For routine use to generate anova tables, I recommend using the Anova
function in the car
package. This gives you type-II sums of squares, which is preferable in some situations to the type-I SS reported by the anova
function. You will also probably want to use the emmeans
or multcomp
packages for comparisons among treatments.
#### Uses testdf data frame from question if(!require(car)){install.packages("car")} if(!require(emmeans)){install.packages("emmeans")} model = lm(value ~ trial*treatment, data = testdf) library(car) Anova(model) ### Anova Table (Type II tests) ### ### Response: value ### Sum Sq Df F value Pr(>F) ### trial 4.124 3 1.1272 0.343864 ### treatment 8.687 1 7.1240 0.009393 ** ### trial:treatment 11.200 3 3.0614 0.033513 * ### Residuals 87.800 72 library(emmeans) marginal = emmeans(model, ~ treatment) cld(marginal, Letters=letters) ### treatment emmean SE df lower.CL upper.CL .group ### A 3.677087 0.1746028 72 3.329023 4.025152 a ### B 4.336150 0.1746028 72 3.988086 4.684215 b ### ### Results are averaged over the levels of: trial ### Confidence level used: 0.95 ### significance level used: alpha = 0.05 marginal = emmeans(model, ~ trial:treatment) cld(marginal, Letters=letters) ### trial treatment emmean SE df lower.CL upper.CL .group ### 4 A 3.357615 0.3492057 72 2.661486 4.053744 a ### 2 B 3.600667 0.3492057 72 2.904538 4.296796 ab ### 3 A 3.634336 0.3492057 72 2.938207 4.330465 ab ### 1 A 3.805358 0.3492057 72 3.109229 4.501487 ab ### 2 A 3.911040 0.3492057 72 3.214911 4.607169 ab ### 1 B 3.982038 0.3492057 72 3.285909 4.678167 ab ### 4 B 4.660384 0.3492057 72 3.964256 5.356513 ab ### 3 B 5.101512 0.3492057 72 4.405383 5.797641 b ### ### Confidence level used: 0.95 ### P value adjustment: tukey method for comparing a family of 8 estimates ### significance level used: alpha = 0.05 hist(residuals(model), col = "darkgray") plot(predict(model), residuals(model)) ### I think that these are the comparisons you want to compare ### treatments within trials marginal = emmeans(model, ~ treatment|trial) cld(marginal, Letters=letters) ### trial = 1: ### treatment emmean SE df lower.CL upper.CL .group ### A 3.805358 0.3492057 72 3.109229 4.501487 a ### B 3.982038 0.3492057 72 3.285909 4.678167 a ### ### trial = 2: ### treatment emmean SE df lower.CL upper.CL .group ### B 3.600667 0.3492057 72 2.904538 4.296796 a ### A 3.911040 0.3492057 72 3.214911 4.607169 a ### ### trial = 3: ### treatment emmean SE df lower.CL upper.CL .group ### A 3.634336 0.3492057 72 2.938207 4.330465 a ### B 5.101512 0.3492057 72 4.405383 5.797641 b ### ### trial = 4: ### treatment emmean SE df lower.CL upper.CL .group ### A 3.357615 0.3492057 72 2.661486 4.053744 a ### B 4.660384 0.3492057 72 3.964256 5.356513 b ### ### Confidence level used: 0.95 ### significance level used: alpha = 0.05
Similar Posts:
- Solved – Analysis of interaction with multiple levels in each factor (emmeans in mixed model)
- Solved – One-way ANOVA, clustering levels using Tukey Kramer HSD
- Solved – One-way ANOVA, clustering levels using Tukey Kramer HSD
- Solved – Negative binomial pairwise comparison in R
- Solved – In emmeans package, how to exclude certain uninteresting contrasts from pairwise comparisons