I am planning to do a multigroup SEM analysis. I gathered survey data and calculated a survey weight. Some of my variables have item nonresponse (mostly around 5% missings).
I´ve decided to use multiple imputation to handle the missing data. First, i used
LittleMCAR() test to check for the missingness mechanism. I also used
TestMCARNormality() from Jamshidian et al. which has a nonparametric test of MCAR for homogenity of covariances. The latter didn´t reject MCAR, the
LittleMCAR test did (p=8.3%). Because i assume my data to be MAR, my data was split in men/women and I applied the
LittleMCAR() test for each subgroup. This time MCAR was not rejected in both subgroups.
I´ve read (see: Enders, C., & Gottschall, A. (2011). Multiple Imputation Strategies for Multiple Group Structural Equation Models. Structural Equation Modeling: A Multidisciplinary Journal, 35-54.) that if I plan to do a multigroup SEM analysis, I should do a separate multiple imputation for each group (in this case: men/women). The R package
MICE will be used for the imputation.
Now my questions:
1.) Should use the default "massive imputation" predictormatrix from MICE
predictorMatrix = (1 - diag(1, ncol(data)), that uses all variables from the dataset as predictors for the imputation model, or should i use
quickpred() to generate a
quickpred uses some criteria (like correlation of predictor and target-variable) to select a set of predictors for each variable, that will be imputed.
quickpred(datensatz_gender_0, include=c("weight_trunc"),exclude=c("ID","X","gender"),mincor = 0.1)
2.) Should I include the survey weight in the predictor matrix?
After imputation, the list of imputed datasets will be given to the survey()-package (for weighting purposes), then i will use the
lavaan to specify my model, which will use the imputed data survey object. This lavaan model will then be passed to
lavaan.survey(), so I can use the survey weights together with the imputed data. As far, as I´ve understood,
lavaan.survey will then pool the results…
It would be great, if somebody can give me an answer to this question.
(I'm the creator of lavaan.survey)
As Stas already indicated, the combination (multiple imputation * complex sampling) can be tricky business. The main papers are Kott (1995) and Kim, Brick & Fuller (2006).
Here are some considerations:
As mentioned by Stas, all the usual best practices of MI apply. Considering the below, I would probably not use quickpred() initially. There is a risk it will discard things that you actually need. It might help to make some reasonable subselection though.
If you have weights, these need to be included in the imputation model as a covariate (Kim et al. 2006, p. 518). Since you are doing multiple group analysis ("domain estimation"), you also need to include the interaction between the group dummies and the weights in the imputation model (p. 519).
If you have strata and clusters, things become more complicated. The imputation model needs to account for the resulting correlation between the observations. If not you will get the wrong standard errors (Kim et al. 2006: p. 514). A model-based way of doing this might be to include strata as fixed effects and clusters as random effects in a Bayesian imputation model. A more survey-like approach would be to follow Stas' suggestion and use a resampling procedure that respects the strata and clusters. For example, in bootstrapping and with just the clusters, you would sample a random cluster (PSU) with replacement and then individuals (2SUS) with replacement within the sampled clusters.
Another advantage of Stas' resampling suggestion, even without strata and clusters, is that you will account for the uncertainty about the parameters of the imputation model including that caused by the weights. I am not sure if mice does this accurately by default. This is usually a relatively small additional term in the variance but it might make a difference.
Once you have the multiply imputed datasets, you can just pass these as an imputationList to lavaan.survey (see the JSS lavaan.survey paper). lavaan.survey will then do all the usual MI pooling calculations for you. So you don't need to manually fit a model separately for each imputation!
Hope this helps,
All the best, Daniel
P.S. Thanks to Stas and @Gaming_dude who brought this post to my attention. I would be happy to continue the conversation (here, lavaan Google discussion group, twitter, email..)!
- Solved – Creating a Pooled Data Set From Multiple Imputation Output in SPSS
- Solved – Negative imputed values
- Solved – Cross Validation and Multiple Imputation for Missing Data
- Solved – Optimal scaling / CATREG (categorical regression) for imputed data
- Solved – How to run chi-squared test on imputed data