**The CV Question**

I'm trying to give (a) detailed and concise mathematical representation(s) of a mixed effects model. I am using the `lme4`

package in R. What is the correct mathematical representation for my model?

**The Data, Science Question, and R Code**

My data set consists of species in different regions. I'm testing if a species' prevalence changes in the time leading up to an extinction (extinctions aren't necessarily permanent; it can recolonize), or following a colonization.

`lmer(prevalence ~ time + time:type + (1 + time + type:time | reg) + (1 + time + type:time | reg:spp))`

**Prevalence**is the proportion of strata occupied by a species in a region-year**Time**is a continuous variable that indicates the time to either extinction or colonization; it is always positive**Type**is a categorical variable with two levels. These two levels are “-” and “+”. When type is -, it’s a colonization (default level). When type is +, it’s an extinction.**Reg**is a categorical variable with nine levels, indicating the region**Spp**is a categorical variable; the number of levels varies among regions, and varies between 48 levels and 144 levels.

In words: response variable is prevalence (proportion of strata occupied). Fixed effects included 1) and intercept, 2) time from event, and 3) the interaction between time to event and the type of event (colonization or extinction). Each of these 3 fixed effects varied randomly among regions. Within a region, each of the effects varied randomly among species.

**I'm trying to figure out how to write the mathematical equation for the model.** I think I understand what's going on in the R code (although, I'm sure I have some knowledge gaps, and hopefully writing out the formal mathematical expression will improve my understanding).

I have searched through the web and through these forums quite a bit. I found tons of useful information, to be sure (and maybe I'll link to some of these in an edit to this question). However, I couldn't quite find that "Rosetta Stone" of R-code translated to math (I'm more comfortable with code) that would really help me confirm I've got these equations right. In fact, I know there are some gaps already, but we'll get to that.

**My Attempt**

The basic form of a mixed effects model, in matrix notation is (to my understanding): $$ Y = X beta + Z gamma + epsilon$$

$$ X = begin{bmatrix} 1 & Delta t & Delta t_{+} \ vdots & vdots & vdots \ 1 & Delta t_n & Delta t_{+,n} end{bmatrix} $$

$$ beta^{'} = begin{bmatrix} beta_0 & beta_1 & beta_2 end{bmatrix}$$

$$ Z = begin{bmatrix} 1 I(r_1) & Delta t I(r_1) & Delta t_{+} I(r_1) & dots & 1 I(r_9) & Delta t I(r_9) & Delta t_{+} I(r_9) \ vdots & vdots & vdots & ddots & vdots & vdots & vdots \ 1 I(r_{1,n}) & Delta t_n I(r_{1,n}) & Delta t_{+,n} I(r_{1,n}) & dots & 1 I(r_{9,n}) & Delta t I(r_{9,n}) & Delta t_{+,n} I(r_{9,n}) \ end{bmatrix}$$

$$ gamma^{'} = begin{bmatrix} gamma_{0,1} & gamma_{1,1} &gamma_{2,1} & dots & gamma_{0,9} & gamma_{1,9} &gamma_{2,9} end{bmatrix}$$

$$ epsilon sim mathcal{N}(0,Sigma)$$

- $X$ is the design matrix for the fixed effects, $Delta t$ is the time after colonization (
`time`

), and $Delta t_{+}$ is the time after extinction (`time:type`

) - $Z$ is the design matrix for the random effects (level 1?), I() is the indicator function giving 1 if the sample belongs to the designated region and 0 otherwise, r is indexed to indicates one of the nine regions.
- $beta$ and $gamma$ contain parameters
- $epsilon$ is errors; I'm not entirely sure how to explain $Sigma$, though I realize one of these variance/covariance matrices will express covariances among slopes and intercepts, e.g.

Assuming things so far are ~correct, that means I'm good at the top level. However, explaining the species-specific variation on the parameters, which is nested within each region, stumped me even more.

But I took a crack at something that maybe makes sense …

Each of the parameters in $gamma$ is derived from a linear combination of species-specific predictors and parameters within a region. For each region , there are 3 rows of , corresponding to the 3 predictor variables. Each $gamma$ can be individually expressed as

- $ gamma_{p,r} = U_{p,r} b_{p,r} + eta_{p,r} $
- where $U_{p,r}$ is a design matrix specific to region $r$ and predictor $p$, $b_{p,r}$ is a 1 by S matrix of parameters for the region (richness in the region = $S$, e.g. 48 or 144), and $eta_{p,r}$ is a matrix of error terms

Specifically, for a given region, each of the $gamma_{p,r}$ would be:

$$ gamma_{0,r} = U_{0,r} b_{0,r} + eta_{0,r} $$

$$ gamma_{0,r} = begin{bmatrix} 1 I(s_1) dots 1 I(s_S) end{bmatrix} + begin{bmatrix} b_{0,1}\ vdots \ b_{0,S} end{bmatrix} + eta_{0,r} $$

$$ gamma_{1,r} = U_{1,r} b_{1,r} + eta_{1,r} $$

$$ gamma_{1,r} = begin{bmatrix} Delta t I(s_1) dots Delta t I(s_S) end{bmatrix} + begin{bmatrix} b_{1,1}\ vdots \ b_{1,S} end{bmatrix} + eta_{1,r} $$

$$ gamma_{2,r} = U_{2,r} b_{2,r} + eta_{2,r} $$

$$ gamma_{2,r} = begin{bmatrix} Delta t_+ I(s_1) dots Delta t_+ I(s_S) end{bmatrix} + begin{bmatrix} b_{2,1}\ vdots \ b_{2,S} end{bmatrix} + eta_{2,r} $$

That would be repeated for each region. Then, $ eta sim mathcal{N}(0,Sigma_{eta}) $, like $epsilon$. Although, perhaps instead of $Sigma$, there is another letter, like $G$, that is commonly used.

**Edit: other Q/A's that were somewhat helpful**

**Contents**hide

#### Best Answer

If I understood the code correctly, why not simply writing something like

$$y_{i} = Big(alpha + nu_{j[i]}^{(alpha)} + eta_{k[i]}^{(alpha)}Big) + Big(beta + nu_{j[i]}^{(beta)} + eta_{k[i]}^{(beta)}Big)T_{i} + Big(delta + nu_{j[i]}^{(delta)} + eta_{k[i]}^{(delta)}Big)(T_{i} * Z_{i}) + epsilon_i$$ with $$begin{aligned} Big[nu_{j}^{(alpha)}, nu_j^{(beta)}, nu_j^{(delta)}Big] &sim text{Multi-Normal}(mathbf 0, boldsymbol Sigma_nu) \ Big[eta_{j}^{(alpha)}, eta_j^{(beta)}, eta_j^{(delta)}Big] &sim text{Multi-Normal}(mathbf 0, boldsymbol Sigma_eta)\ epsilon_i & sim text{Normal}(0, sigma_epsilon) end{aligned} $$ or, if the first equation is too long, something like $$y_{i} = alpha_{j[i],k[i]} + beta_{j[i],k[i]}T_{i} + delta_{j[i],k[i]}(T_i * Z_i) + epsilon_i$$ and $$begin{aligned} alpha_{j[i],k[i]} &= alpha + nu_{j}^{(alpha)} + eta_{k}^{(alpha)} \ beta_{j[i],k[i]}&=beta + nu_{j}^{(beta)} + eta_{k}^{(beta)}\ delta_{j[i],k[i]}&=delta + nu_{j}^{(delta)} + eta_{k}^{(delta)}\ end{aligned}$$ with the same covariance structure as above? It shows the nested structure of the data as well as which coefficients vary across which levels.