I was going through the Stan documentation which can be downloaded from here. I was particularly interested in their implementation of the Gelman-Rubin diagnostic. The original paper Gelman & Rubin (1992) define the the potential scale reduction factor (PSRF) as follows:

Let $X_{i,1}, dots , X_{i,N}$ be the $i$th Markov chain sampled, and let there be overall $M$ independent chains sampled. Let $bar{X}_{icdot}$ be the mean from the $i$th chain, and $bar{X}_{cdot cdot}$ be the overall mean. Define,

$$W = dfrac{1}{M} sum_{m=1}^{M} {s^2_m}, $$

where

$$s^2_m = dfrac{1}{N-1} sum_{t=1}^{N} (X_{m t} – bar{X}_{m cdot})^2,. $$

And define $B$

$$B = dfrac{N}{M-1} sum_{m=1}^{M} (bar{X}_{m cdot} – bar{X}_{cdot cdot})^2 ,.$$

Define $$hat{V} = left(dfrac{N-1}{N} right)W + left( dfrac{M+1}{MN} right)B,.$$

The PSRF is estimate with $sqrt{hat{R}}$ where

$$ hat{R} = dfrac{hat{V}}{W} cdot dfrac{df+3}{df+1},,$$

where $df = 2hat{V}/Var(hat{V})$.

The Stan documentation on page 349 ignores the term with $df$ and also removes the $(M+1)/M$ multiplicative term. This is their formula,

The variance estimator is

$$widehat{text{var}}^{+}(theta , | , y) = frac{N-1}{N} W + frac{1}{N} B,.$$

Finally, the potential scale reduction statistic is defined by

$$ hat{R} = sqrt{frac{widehat{text{var}}^{+}(theta , | , y) }{W}},. $$

From what I could see, they don't provide a reference for this change of formula, and neither do they discuss it. Usually $M$ is not too large, and can often be as low so as $2$, so $(M+1)/M$ should not be ignored, even if the $df$ term can be approximated with 1.

So where does this formula come from?

**EDIT:**

I have found a partial answer to the question "*where does this formula come from?*", in that the Bayesian Data Analysis book by Gelman, Carlin, Stern, and Rubin (Second edition) has exactly the same formula. However, the book does not explain how/why it is justifiable to ignore those terms?

**Contents**hide

#### Best Answer

I followed the specific link given for Gelman & Rubin (1992) and it has $$ hat{sigma} = frac{n-1}{n}W+ frac{1}{n}B $$ as in the later versions, although $hat{sigma}$ replaced with $hat{sigma}_+$ in Brooks & Gelman (1998) and with $widehat{rm var}^+$ in BDA2 (Gelman et al, 2003) and BDA3 (Gelman et al, 2013).

BDA2 and BDA3 (couldn't check now BDA1) have an exercise with hints to show that $widehat{rm var}^+$ is unbiased estimate of the desired quantity.

Gelman & Brooks (1998) has equation 1.1 $$ hat{R} = frac{m+1}{m}frac{hat{sigma}_+}{W} – frac{n-1}{mn}, $$ which can be rearranged as $$ hat{R} = frac{hat{sigma}_+}{W} + frac{hat{sigma}_+}{Wm}- frac{n-1}{mn}. $$ We can see that the effect of second and third term are negligible for decision making when $n$ is large. See also the discussion in the paragraph before Section 3.1 in Brooks & Gelman (1998).

Gelman & Rubin (1992) also had the term with df as df/(df-2). Brooks & Gelman (1998) have a section describing why this df corretion is incorrect and define (df+3)/(df+1). The paragraph before Section 3.1 in Brooks & Gelman (1998) explains why (d+3)/(d+1) can be dropped.

It seems your source for the equations was something post Brooks & Gelman (1998) as you had (d+3)/(d+1) there and Gelman & Rubin (1992) had df/df(-2). Otherwise Gelman & Rubin (1992) and Brooks & Gelman (1998) have equivalent equations (with slightly different notations and some terms are arranged differently). BDA2 (Gelman, et al., 2003) doesn't have anymore terms $frac{hat{sigma}_+}{Wm}- frac{n-1}{mn}$. BDA3 (Gelman et al., 2003) and Stan introduced split chains version.

My interpretation of the papers and experiences using different versions of $hat{R}$ is that the terms which have been eventually dropped can be ignored when $n$ is large, even when $m$ is not. I also vaguely remember discussing this with Andrew Gelman years ago, but if you want to be certain of the history, you should ask him.

Usually M is not too large, and can often be as low so as 2

I really do hope that this is not often the case. In cases where you want to use split-$hat{R}$ convergence diagnostic, you should use at least 4 chains split and thus have M=8. You may use less chains, if you already know that in your specific cases the convergence and mixing is fast.

Additional reference:

- Brooks and Gelman (1998). Journal of Computational and Graphical Statistics, 7(4)434-455.

### Similar Posts:

- Solved – the best method for checking convergence in MCMC
- Solved – What if Markov chain does not converge in a reasonable amount of time
- Solved – Variance of a sample – proof
- Solved – Asymptotic distribution of $sqrt{n}left(hat{sigma_{1}^{2}}-sigma^2right)$
- Solved – How to calculate the critical t-values of a linear regression model