# Solved – Proof of Point-Biserial Correlation being a special case of Pearson Correlation

I have been examining the use of the Point Biserial correlation as a statistic to measure the relationship between a dichotomous variable and a continuous one. Wikipedia et. al. seem to concur that the Point Biserial Correlation is a special case of the Pearson Correlation, but I cannot find a proof for this, algebraic or otherwise, and it is making me wary of using this in the context of the research I am doing (I need to do some statistical confidence testing afterwards). I have tried deriving the truth myself, but have chased everything round in a circle.

Contents

Let the \$n\$ data consist of \$n_0gt 0\$ \$(x, 0)\$ pairs and \$n_1gt 0\$ \$(x, 1)\$ pairs. Their Pearson correlation coefficient will be the same as the reversed data consisting of corresponding \$(0,x)\$ and \$(1,x)\$ pairs. Because there are exactly two distinct values of the first coordinates, the regression line of the reversed data must pass through the mean points \$(0,M_0)\$ and \$(1,M_1)\$, whence it has slope \$(M_1-M_0)/(1-0) = M_1-M_0\$. The correlation coefficient is obtained by standardizing this: it must be multiplied by the standard deviation of the first coordinates and divided by the standard deviation of the second coordinates (the original \$x\$ values), written \$s_n\$. The standard deviation of the first coordinates is readily computed from the fact that they consist of \$n_0\$ zeros and \$n_1\$ ones; it equals

\$\$sqrt{frac{n_1}{n}left(1-frac{n_1}{n}right)} = sqrt{frac{n_0n_1}{n^2}}.\$\$

Consequently the Pearson correlation coefficient is

\$\$r = frac{M_1-M_0}{s_n}sqrt{frac{n_0n_1}{n^2}},\$\$

which is precisely the Wikipedia formula for the point-biserial coefficient.

The heights of the red dots depict the mean values \$M_0\$ and \$M_1\$ of each vertical strip of points. The dashed gray line is the regression line.

Rate this post