I have a data set with repeated measurements on subjects. The total sample size is $n=118$ and the number of clusters (i.e. subjects) is $m=49$. The smallest cluster is of size 2 and the largest cluster is of size 4. In fact, about 60% of the clusters are of size 2, i.e. only two observations per subject. The outcome variable is continuous and there are 5 covariates of interest.

My plan is to fit a linear mixed model (LMM) with a random subject effect, i.e. a random intercept term. If the normality assumption of the residuals should not be satisfied and transformations do not solve the issue, I would then use GEE with identity link (i.e. a marginal model) to model the data since it does not require the normality assumption.

However, before embarking on this adventure, I had some concerns about the asymptotic propoerties of the LMMs and GEEs. I know that for GEE the asymptotic behaviours depends on the number of clusters $m$ (e.g. Li and McKeague, Statistica Sinica, 2013).

Are there any guidelines/recommendations of the number of clusters $m$, the number of observations $n$ and the minimum/maximum cluster size for the two methods?

**Contents**hide

#### Best Answer

Recommendations for the number of groups and units per group are good at the study design phase. At this point in your research, you can only hope to produce decent estimates with the data that you have at hand, and that's probably the literature that you should be studying, and questions that you should be asking.

Behavior of the GEEs are asymptotic in $m$, so the more the merrier. I would say that $m=50$ is a moderate sample size (Teerenstra et. al. (2010) give references, behind the paid walls, that in turn say that sample sizes of less than $m=40$ clusters are insufficient); $m=200$ will probably be large enough for approximately symmetric data (not necessarily large enough for low proportions like single digit %, or badly overdispersed count data), while $m=10$ will be too small to really trust asymptotics.

Hox (1998) suggests the number of groups to be $m ge 50$, and group sizes, $n/m ge 20$, for asymptotics of the mixed models (which social scientists call multilevel models) to work well. (Think of this recommendation as a modest size educational study: enough classrooms/teachers with 20 students per class.) Maas and Hox (2005) updated these recommendations and studied smaller $m$, down to $m=10$, but they never thought of dropping the number of units in the group anywhere below 5. Some studies with low number of units per group are coming from Bell, Ferron and Kromrey (2008, 2010), although they concentrated on proportion of singletons (groups of size 1) rather than the size per se.

### Similar Posts:

- Solved – the difference between a stratified random sample and a single-stage cluster random sample
- Solved – Clustering (k-means, or otherwise) with a minimum cluster size constraint
- Solved – Clustering variables with outliers
- Solved – Clustering variables with outliers
- Solved – clustered-stratified random sampling