I'm wondering what should be the optimal sampling strategy for my dissertation research. I have four data sources (two open source software projects meta-repositories and two global startup databases). I'd like to perform EFA to discover (or, rather, confirm my theory-based assumptions) the factor structure of the study's constructs. Then I plan to perform CFA to determine validity and reliability of the measurement model. Finally, I plan to perform SEM analysis to test the study's structural model and hypotheses. Having said that, I plan to perform data analysis (at least, SEM – not sure about EFA/CFA) on two data sets: pilot and main. I believe that pilot analysis will allow me to modify model, if fit indices will be inadequate. Then I plan to perform main SEM analysis of the modified model (and, possibly, alternative models) on the main data set. In addition, I plan to perform both covariance-based (CB) and partial least squares (PLS) SEM analysis to compare them for my study. What should be the **optimal approach and its steps** in terms of the following:

Sampling technique. I was thinking about randomized sampling of data from each OSS meta-repo or from a merged data set; then selecting corresponding data from startup databases.

Strategy on dividing the sample data set into pilot and main data sets.

Any special steps for sampling due to multiple methods (EFA/CFA/SEM).

Any special steps for sampling due to alternative models.

Any special steps for sampling due to analyzing via both CB-SEM and PLS-SEM.

Bonus question: 🙂 I plan my study as cross-sectional, but the data in meta-repositories do not exist for exactly the same time frames. For example, data from seven OSS repositories are within range from September 2012 to December 2013. I think that for OSS world the variance within projects' characteristics should not be dramatic, as OSS ecosystem is not very dynamic on average. **The question is whether using this semi-cross-sectional approach will allow me to retain statistical validity and what statistical tests exist to confirm that?**

You help and advice on this is greatly appreciated!

**Contents**hide

#### Best Answer

I'm personally not aware of sampling considerations for FA and SEM, and I doubt an optimal approach exists. However you might want to keep sample *size* in mind. I found some sources on the subject that might help:

Sample Size Considerations in Factor Analysis and Latent Class Analysis (slides)

Sample Size requirements for Structural Equation Models

Lower Bounds on Sample Size in Structural Equation Modeling

With regard to your EFA -> CFA methodology, that's not a good idea. Why is it wrong to discover factors using EFA then use CFA on the same data to confirm that factor model?