I am modeling some data where I think I have two crossed random effects. But the data set is not balanced, and I'm not sure what needs to be done to account for it.
My data is a set of events. An event occurs when a client meets with a provider to perform a task, which is either successful or not. There are thousands of clients and providers, and each client & provider participates in varying numbers of events (roughly 5 to 500). Each client and provider has a level of skill, and the chance that the task is successful is a function of the skills of both participants. There is no overlap between clients and providers.
I am interested in the respective variances of the population of clients and providers, so we can know which source has a bigger effect on the success rate. I also want to know the specific values of the skills among the client and providers we actually have data for, to identify best/worst clients or providers.
Initially, I want to assume that the probability of success is driven solely by the combined skill levels of the client and provider, with no other fixed effects. So, assuming that x is a factor for the client and y is a factor for provider, then in R (using package lme4) I have a model specified as:
glmer( success ~ (1 | x) + (1 | y), family=binomial(), data=events)
One problem is that clients are not evenly distributed across providers. Higher skill clients are more likely to be matched up with higher skill providers. My understanding is that a random effect has to be uncorrelated with any other predictors in the model, but I'm not sure how to account for it.
Also, some clients and providers have very few events (less than 10), while others have many (up to 500), so there's a wide spread in the amount of data we have on each participant. Ideally this would be reflected in a "confidence interval" around each particpants skill estimate (although I think the term confidence interval isn't quite correct here).
Are crossed random effects going to be problematic because of the unbalanced data? If so, what are some other approaches I should consider?
As for unbalanced data, glmer is able to handle unbalanced groups: that was actually the point of developing mixed-models approaches as compared to repeated-measures ANOVAs which are restricted to balanced designs. Including clients or providers with few events (even only one) is still better than omitting them, as it improves the estimation of the residual variance (see Martin et al. 2011).
If you want to use BLUPs (
ranef(model)) as a proxy of skills, you will indeed have to estimate the uncertainty around your point predictions. This can be done in a frequentist framework using
ranef(model, postVar=TRUE) or through the posterior distribution in a Bayesian framework. You should however not use BLUPs as the response variable in further regression models: see Hadfield et al. (2010) for examples of misuses of BLUPs and different methods to adequately take into account their uncertainty.
As for the correlation of skills between clients and providers, this unbalance might be problematic if it is very strong, as it would prevent correctly estimating the variance due to each random effect. There does not seem to be a mixed-models framework that would easily handle the correlation between random intercepts (see here for a formal expression of your problem). Could you maybe precise how correlated are the average successes of clients and providers?
- Solved – Partitioning variance within a level in a 3-level mixed effects model
- Solved – Weighted entropy as a measure of diversity
- Solved – Notation for three-level repeated measures random intercept model (lme4 model included)
- Solved – Moving average of irregular time series data using R
- Solved – Hierarchical clustering on large data set. Practical example