Solved – Interpreting R lm() output with factor as variable

Hi everyone: I'm hoping for some conceptual and practical advice when it comes to statistical analysis using linear models.

The overview is this: I have values measured from 4 trials, each of which included two treatments. In two of the trials, the treatment had no effect, in the two others, it did have an effect. I'd like to use an LM to show this.

I've simulated some data that approximate my real data:

testdf <- data.frame(trial =c(rep ("1", 20), rep("2", 20), rep("3", 20),  rep("4", 20)), treatment = c(rep(c(rep("A", 10), rep("B", 10)),4)), value =  NA)  #create a data frame of 80 observations  trueval= 3.7 #the mean value overall  treatmentval = 1.2 #the effect of treatment  set.seed(6) #so we get the same answer each time  testdf$value <- rnorm(80, mean=trueval)  #values drawn from a normal dist  testdf$value[testdf$trial=="3"&testdf$treatment=="B"] <-rnorm(10,  mean=trueval+treatmentval)  # values for trial 3 treatment B are different  testdf$value[testdf$trial=="4"&testdf$treatment=="B"] <-rnorm(10,  mean=trueval+treatmentval)   # values for trial 4 treatment B are different  head(testdf) 

So in my made-up data I know that there's a treatment x trial interaction, but how to test this? An ANOVA gives me the expected result:

> summary(aov(value~trial*treatment, data = testdf))                  Df Sum Sq Mean Sq F value  Pr(>F)     trial            3   4.12   1.375   1.127 0.34386     treatment        1   8.69   8.687   7.124 0.00939 **  trial:treatment  3  11.20   3.733   3.061 0.03351 *  

But I don't want to use an ANOVA (in my real data I have some random effects to deal with).

My problem with using an LM is that with a factor variable (trial) the model output compares each level of the factor to each other, instead of giving me a P-value for the entire variable.

> summary(lm(value~trial*treatment, data = testdf))  Call: lm(formula = value ~ trial * treatment, data = testdf)  Residuals:  Min      1Q  Median      3Q     Max  -1.7489 -0.7443 -0.0995  0.6794  2.4066   Coefficients:                     Estimate Std. Error t value Pr(>|t|)     (Intercept)         3.8054     0.3492  10.897   <2e-16 *** trial2              0.1057     0.4939   0.214   0.8312     trial3             -0.1710     0.4939  -0.346   0.7301     trial4             -0.4477     0.4939  -0.907   0.3676     treatmentB          0.1767     0.4939   0.358   0.7216     trial2:treatmentB  -0.4871     0.6984  -0.697   0.4878     trial3:treatmentB   1.2905     0.6984   1.848   0.0687 .   trial4:treatmentB   1.1261     0.6984   1.612   0.1113     

This is annoying because
1. I have to decide which trial (if not 1) should be the "reference"
2. All subsequent trials are compared to that particular reference
3. Doing all these trial by trial comparisons is making the interaction hard to pick up.

My goal is to come up with some statistical justification for the statement "The treatment had a significant effect in Trials C and D, but not A and B."

I've decided I could run an lm for each trial separately and do some sort of Bonferroni-Holm p-val correction but I'm trying to limit the number of models I run (to avoid complaints of data dredging). Does anyone have another way to approach dealing with factor variables in linear models?

Thanks a million

It is appropriate to use the lm function to construct general linear models, including traditional anovas. For routine use to generate anova tables, I recommend using the Anova function in the car package. This gives you type-II sums of squares, which is preferable in some situations to the type-I SS reported by the anova function. You will also probably want to use the emmeans or multcomp packages for comparisons among treatments.

#### Uses testdf data frame from question  if(!require(car)){install.packages("car")} if(!require(emmeans)){install.packages("emmeans")}  model = lm(value ~ trial*treatment, data = testdf)  library(car)  Anova(model)  ### Anova Table (Type II tests) ### ### Response: value ###                 Sum Sq Df F value   Pr(>F)    ### trial            4.124  3  1.1272 0.343864    ### treatment        8.687  1  7.1240 0.009393 ** ### trial:treatment 11.200  3  3.0614 0.033513 *  ### Residuals       87.800 72   library(emmeans)  marginal  = emmeans(model, ~ treatment)  cld(marginal, Letters=letters)  ###  treatment   emmean        SE df lower.CL upper.CL .group ###  A         3.677087 0.1746028 72 3.329023 4.025152  a     ###  B         4.336150 0.1746028 72 3.988086 4.684215   b    ### ### Results are averaged over the levels of: trial  ### Confidence level used: 0.95  ### significance level used: alpha = 0.05   marginal  = emmeans(model, ~ trial:treatment)  cld(marginal, Letters=letters)  ###  trial treatment   emmean        SE df lower.CL upper.CL .group ###  4     A         3.357615 0.3492057 72 2.661486 4.053744  a     ###  2     B         3.600667 0.3492057 72 2.904538 4.296796  ab    ###  3     A         3.634336 0.3492057 72 2.938207 4.330465  ab    ###  1     A         3.805358 0.3492057 72 3.109229 4.501487  ab    ###  2     A         3.911040 0.3492057 72 3.214911 4.607169  ab    ###  1     B         3.982038 0.3492057 72 3.285909 4.678167  ab    ###  4     B         4.660384 0.3492057 72 3.964256 5.356513  ab    ###  3     B         5.101512 0.3492057 72 4.405383 5.797641   b    ###  ### Confidence level used: 0.95  ### P value adjustment: tukey method for comparing a family of 8 estimates  ### significance level used: alpha = 0.05  hist(residuals(model), col = "darkgray") plot(predict(model), residuals(model))  ### I think that these are the comparisons you want to compare ###  treatments within trials  marginal  = emmeans(model, ~ treatment|trial)  cld(marginal, Letters=letters)  ### trial = 1: ###  treatment   emmean        SE df lower.CL upper.CL .group ###  A         3.805358 0.3492057 72 3.109229 4.501487  a     ###  B         3.982038 0.3492057 72 3.285909 4.678167  a     ###  ### trial = 2: ###  treatment   emmean        SE df lower.CL upper.CL .group ###  B         3.600667 0.3492057 72 2.904538 4.296796  a     ###  A         3.911040 0.3492057 72 3.214911 4.607169  a     ###  ### trial = 3: ###  treatment   emmean        SE df lower.CL upper.CL .group ###  A         3.634336 0.3492057 72 2.938207 4.330465  a     ###  B         5.101512 0.3492057 72 4.405383 5.797641   b    ###  ### trial = 4: ###  treatment   emmean        SE df lower.CL upper.CL .group ###  A         3.357615 0.3492057 72 2.661486 4.053744  a     ###  B         4.660384 0.3492057 72 3.964256 5.356513   b    ###  ### Confidence level used: 0.95  ### significance level used: alpha = 0.05  

Similar Posts:

Rate this post

Leave a Comment