Solved – What are LS means useful for

I have recently learned about LS means (estimated marginal means, predicted marginal means) and I am trying to understand what they could be used for and under what circumstances.

For concreteness, consider a dependent variable $y$ and two categorical independent variables, $x_1$ with two categories and $x_2$ with three categories. One could create dummy variables corresponding to these categories and call them $d_{1,1}, d_{1,2}$ and $d_{2,1}, d_{2,2}, d_{2,3}$. One could then have a linear model (without interaction terms)
y = beta_0 + beta_{1,2} d_{1,2} + beta_{2,2} d_{2,2} + beta_{2,3} d_{2,3} + varepsilon
where $d_{1,1}$ and $d_{2,1}$ are the reference categories. LS means for $x_1$ would be
bar y_{1,1} &= beta_0 &+ frac{1}{3}(beta_{2,2} + beta_{2,3}), \
bar y_{1,2} &= beta_0 + beta_{1,2} &+ frac{1}{3}(beta_{2,2} + beta_{2,3}). \

Uses I can think of
Given $x_1$ and $x_2$, the best (in MSE sense) prediction of $y$ is $beta_0 + beta_{1,2} d_{1,2} + beta_{2,2} d_{2,2} + beta_{2,3} d_{2,3}$. This is also the expected result after treatment if $x_1$ and/or $x_2$ are interpreted as levels of treatment.
Given $x_1$ alone, the best prediction of $y$ is $frac{1}{n}sum_{i=1}^n y_i mathbb{1}_{d_j=1}$ for $x_1$ being in the category $j$. This is also the expected result after treatment if $x_1$ are interpreted as levels of treatment.
None of these two coincides with $bar y_{1,1}$ or $bar y_{1,2}$.
I get that

Least-squares means [are] predictions from a model over a regular grid, averaged over zero or more dimensions

(which is the Wiki excerpt for the tag), but is what is the practical use of that?
So far I can see only one situation in which this could be useful; this is if we know that in population the proportion of observations that have $d_{i,j}=1$ and $d_{k,l}=1$ is the same for all combinations of $i,j,k,l$. Is that the intended use of LS means? Or can it be useful for description or hypothesis testing?

I disagree strongly with the "only situation" in the OP. EMMs (estimated marginal means, more restrictively known as least-squares means) are very useful for heading off a Simpson's paradox situation in evaluating the effects of a factor. In your example, consider a scenario where these three things are true:

  • When $x_2$ is held at any fixed level, the lowest mean response occurs at $x_1=1$.
  • For $x_1$ held fixed at either level, the highest mean response occurs when $x_2=3$.
  • The combination $(x_1=1, x_2=3)$ has a disproportionately large sample size, while $(x_1=1,x_2=1)$ and $(x_1=1,x_2=2)$ have small sample sizes.

Then it is possible that the marginal mean of $x_1$ is higher than that for $x_2$, even though the mean for $x_1=1$ is less than that for $x_1=2$ for each $x_2$.

If one instead computes EMMs, the observed means at $x_1=1$ and $x_2=1,2,3$ receive equal weight, so that the EMM for $x_1=1$ is less than that for $x_1=2$.

EMMs are comparable to what is termed "unweighted means analysis" in old experimental design texts. The idea was useful many decades ago, and it still is.

The "basics" vignette for the R package emmeans has a concrete illustration and some discussion of such issues.


I have spent the last 5 years or so developing/refining R packages for such purposes, so I'm not exactly an objective observer. I hope to hear other perspectives.

Similar Posts:

Rate this post

Leave a Comment