I don't know if a similar problem has been asked before so if it has been, please provide me a link to the related/duplicate questions. I am sorry if I seem to be asking too much. But I really like to learn this stuff and this seems to be a good place to start asking.
I have been teaching myself statistics through self-study and I found Logan's Biostatistical Design and Analysis Using R very helpful in that it shows how the actual computations are done (in R) and how the results are interpreted. I particularly like the part about multiple regression (Chapter 9). I use R since it is the most accessible (and free) software that I can get my hands into.
Right now, I am trying to learn multivariate multiple regression. But unfortunately, I can't find a good resource. My specific problem is finding the best linear model for each response variable for the following morphometric data of a plant species (that some of my high school biology students are investigating), where Leaves
, CorL
, CorD
, FilL
, AntL
, AntW
, StaL
, StiW
, and HeiP
are the response variables and pH
, OM
, P
, K
(nutrient variables), Elev
, SoilTemp
, and AirTemp
(environment variables) are the independent variables.
I don't know if it is okay to proceed as in the case of only one dependent variable, but I went through the steps of Example 9B of Logan anyhow.
lily.csv
Leaves,CorL,CorD,FilL,AntL,AntW,StaL,StiW,HeiP,Elev,pH,OM,P,K,SoilTemp,AirTemp 55,213.4,114.6,170.3,10.6,2.35,210,6.7,0.93,1431,6.37,1,3,170,29,26 44,192.15,95.25,160.6,7.1,2.25,176.4,6.55,0.79,1471,6.02,1,0,180,25,23 38,156.75,95.5,155.2,5.65,1.8,170.9,4.4,0.78,1471,6.02,1,0,180,25,23 29,191.8,88.35,155.2,10,2.5,178.25,5.9,0.75,1464,5.99,1,3,150,25,22 36,200.85,99.4,161.9,6.5,1.55,187.4,6.15,0.8,1464,5.99,1,3,150,25,22 43,210.2,74,147,7,1,170,5,0.8,1464,5.99,1,3,150,25,22 34,183.2,97.3,149.5,6.9,1.85,168.8,5.45,0.71,1464,5.99,1,3,150,25,22 52,233.3,107.7,179.6,9.2,3.05,210,6.45,0.82,1464,5.99,1,3,150,25,22 43,205.7,108.8,164.6,9.4,2,190.9,5.15,0.66,1464,5.99,1,3,150,25,22 28,203.15,119.35,160.6,8.9,2.3,180,6.85,0.77,1503,5.98,3,2,240,29.5,25.5 45,188.85,100.5,150.6,6.4,2.3,174.85,7.7,0.84,1503,5.98,3,2,240,29.5,25.5 49,205.2,126.85,150.8,10.1,2.8,177.5,9,0.84,1487,6.09,4,4,180,26,25 35,187.7,102.35,142.1,5.55,1.85,175.35,5.75,0.56,1485,6.17,3.5,1,220,24,23 23,181.05,94.6,136.6,6.9,1.8,169.3,5.8,0.59,1485,6.17,3.5,1,220,24,23 31,172.5,63.7,113.6,5.2,1.5,151.2,4.7,0.57,1482,6.29,5,2,280,24,23 34,190.5,93.1,151.9,5.65,1.85,172.5,5.25,0.68,1482,6.29,5,2,280,24,23 41,185.85,85.2,148.6,5.9,1.05,169.6,5.9,0.62,1472,6.48,0.5,3,170,25.22,22.89 29,195,159.2,159.3,15,4,185,6.3,0.59,1472,6.48,0.5,3,170,25.22,22.89 31,115.6,108.6,165.8,8.5,3,200.5,7.5,0.83,1454,5.53,5,14,350,25.22,22.89 27,176.65,93.1,128.65,6.65,2.85,180.5,6.65,0.53,1454,5.53,5,14,350,25.22,22.89 33,210,119,148,7,3,193,6,0.62,1454,5.53,5,14,350,25.22,22.89 42,200,93,166,18.3,4.55,177,8,1.12,1454,5.53,5,14,350,25.22,22.89 42,205,101.4,166.8,9,2.5,190,8.2,1.12,1454,5.53,5,14,350,25.22,22.89 25,192.9,94.15,147.8,6.45,2.3,167.65,7.15,0.61,1445,5.59,4,7,260,25.22,22.89 36,187.95,65.05,150.2,6.55,2.7,177.5,6.55,0.52,1445,5.59,4,7,260,25.22,22.89 32,110.4,11.6,168.15,7.6,2,197.95,7.85,0.73,1481,6.29,1.5,1,80,25.22,22.89 29,185,80,143,9,2,179,7.5,0.69,1481,6.29,1.5,1,80,25.22,22.89 29,179.8,70.6,134.8,11.15,3.2,165.65,5.3,0.6,1481,6.29,1.5,1,80,25.22,22.89
Firstly, I tried to investigate for possible collinearity among the variables.
library(car) lily = read.csv("lily.csv",header=T) scatterplotMatrix(lily,diag="boxplot")
FilL
, AntL
, AntW
, and HeiP
seem to be non-normal so I made log10
transformations. This seems to work fine. (And it is fine for you to educate me at this point if I am doing it wrong. I'd appreciate it very much.)
scatterplotMatrix(~Elev + pH + OM + P + K + SoilTemp + AirTemp + AirTemp + Leaves + CorL + CorD + log10(FilL) + log10(log10(log10(AntL)+0.1)+0.1) + log10(AntW) + StaL + StiW + log10(HeiP),data=lily,diag="boxplot")
I check for multicolinearity among the independent variables.
> cor(lily[,10:16]) Elev pH OM P K Elev 1.00000000 0.48252995 -0.06601928 -0.56726786 -0.28159580 pH 0.48252995 1.00000000 -0.58587694 -0.81673123 -0.70434283 OM -0.06601928 -0.58587694 1.00000000 0.65931857 0.86478172 P -0.56726786 -0.81673123 0.65931857 1.00000000 0.79782480 K -0.28159580 -0.70434283 0.86478172 0.79782480 1.00000000 SoilTemp 0.14558365 0.01543524 -0.10436250 -0.05023853 -0.01041523 AirTemp 0.26450883 0.15711849 0.16862694 -0.09735977 0.11655030 SoilTemp AirTemp Elev 0.14558365 0.26450883 pH 0.01543524 0.15711849 OM -0.10436250 0.16862694 P -0.05023853 -0.09735977 K -0.01041523 0.11655030 SoilTemp 1.00000000 0.83202496 AirTemp 0.83202496 1.00000000
Among the independent variables, pairs P
and pH
, K
and pH
, P
and K
, OM
and K
, and SoilTemp
and AirTemp
have strong collinearity.
I also checked for collinearity among the dependent variables although I don't have an idea if this is a alright.
> cor(lily[,1:9]) Leaves CorL CorD FilL AntL AntW Leaves 1.000000000 0.44495257 0.17903019 0.5222644 0.1495016 0.004680606 CorL 0.444952572 1.00000000 0.51084625 0.1319070 0.2101801 0.097530007 CorD 0.179030187 0.51084625 1.00000000 0.2368117 0.3297344 0.376806953 FilL 0.522264352 0.13190704 0.23681171 1.0000000 0.3932006 0.284738542 AntL 0.149501570 0.21018008 0.32973443 0.3932006 1.0000000 0.796401542 AntW 0.004680606 0.09753001 0.37680695 0.2847385 0.7964015 1.000000000 StaL 0.416083096 0.06574503 0.23272070 0.7762797 0.2701401 0.318744025 StiW 0.194927129 -0.05594094 0.08322138 0.3752195 0.3755628 0.445964273 HeiP 0.577737137 0.17603412 0.13911530 0.6348948 0.4583508 0.254173681 StaL StiW HeiP Leaves 0.41608310 0.19492713 0.5777371 CorL 0.06574503 -0.05594094 0.1760341 CorD 0.23272070 0.08322138 0.1391153 FilL 0.77627970 0.37521953 0.6348948 AntL 0.27014013 0.37556279 0.4583508 AntW 0.31874403 0.44596427 0.2541737 StaL 1.00000000 0.38306631 0.3794643 StiW 0.38306631 1.00000000 0.5039679 HeiP 0.37946433 0.50396793 1.0000000
From here, I can check for variance inflation and their inverses and possibly investigate interactions but I am really not sure now how to proceed or if it is alright at all to do these things in the multivariate case. And it seems to be a long way still to assessing the best multivariate model. In the case of the one dependent variable case, I can use the MuMIn
package to automate the determination of the best fit but it doesn't work in the multiple response case.
How do I proceed from this point? I will also appreciate it very much if you can point me to a good book or online material (preferably with applications in R).
Best Answer
To start with I will advice you fit different models with different endpoints. Leaves CorL CorD FilL AntL AntW StaL StiW HeiP
. Why will you need a multiple endpoint model in the first place?. Among independent variables there is quite some multicollinearity like you have rightly said, is this something of concern?.Like you have indicated the VIF will help us with that info. Use the following R code to do VIF(variance inflation) regression before the with the selected variable you can use them for regression. Chose an appriopriate threshold for the VIF. I have used 5.
`
vif_func<-function(in_frame,thresh=10,trace=T){ require(fmsb) if(class(in_frame) != "data.frame") in_frame<-data.frame(in_frame) #get initial vif value for all comparisons of variables vif_init<-NULL for(val in names(in_frame)){ form_in<-formula(paste(val," ~ .")) vif_init<-rbind(vif_init,c(val,VIF(lm(form_in,data=in_frame)))) } vif_max<-max(as.numeric(vif_init[,2])) if(vif_max < thresh){ if(trace==T){ #print output of each iteration prmatrix(vif_init,collab=c("var","vif"),rowlab=rep("",nrow(vif_init)),quote=F) cat("n") cat(paste("All variables have VIF < ", thresh,", max VIF ",round(vif_max,2), sep=""),"nn") } return(names(in_frame)) } else{ in_dat<-in_frame #backwards selection of explanatory variables, stops when all VIF values are below "thresh" while(vif_max >= thresh){ vif_vals<-NULL for(val in names(in_dat)){ form_in<-formula(paste(val," ~ .")) vif_add<-VIF(lm(form_in,data=in_dat)) vif_vals<-rbind(vif_vals,c(val,vif_add)) } max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2])))[1] vif_max<-as.numeric(vif_vals[max_row,2]) if(vif_max<thresh) break if(trace==T){ #print output of each iteration prmatrix(vif_vals,collab=c("var","vif"),rowlab=rep("",nrow(vif_vals)),quote=F) cat("n") cat("removed: ",vif_vals[max_row,1],vif_max,"nn") flush.console() } in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]] } return(names(in_dat)) } }`
I called you data set dep
you can run it as follows
keep.dat<-vif_func(in_frame=dep,thresh=5,trace=T)
thresh
is the threshold for VIF you can use 10 it depends on what you want. Here are the result of the good variables to use for regression independent of dependent variables you want to use.
var vif Elev 2.34467892681204 pH 4.82456111736694 OM 9.60381685354609 P 6.09927871325235 K 6.55185481475336 SoilTemp 9.76265101226991 AirTemp 10.5786139945657 removed: AirTemp 10.57861 var vif Elev 2.10463008453203 pH 3.12149022393123 OM 5.09603402410369 P 5.56333186256743 K 6.54812601407721 SoilTemp 1.11028184053362 removed: K 6.548126
This should be a good place to start.
Similar Posts:
- Solved – two-way multiple comparisons in R
- Solved – R-Metafor: Meta-analysis and mixed effect model
- Solved – Meta-analysis in R with standard errors instead of standard deviations {metafor}
- Solved – Meta-analysis in R with standard errors instead of standard deviations {metafor}
- Solved – Interpreting random slopes equal to 0 using lmer in R