# Solved – why in logistic regression the probability mass equal the count

It's said that logistic regression is well calibrated and preserves marginal probability. What does that mean? Thanks.

Contents

The solution equation for logistic equation, found by differentiating the loss function and setting it to zero, is

\$\$ X^t (y – p) = 0 \$\$

Suppose you have a binary variable in your regression, reflected as a column of \$1\$s and \$0\$s in the design matrix \$X\$. This becomes a row in \$X^t\$, and hence a single linear equation in the matrix equation above. Calling this the \$j\$th row in \$X^t\$, and focusing on this single equation, we get:

\$\$ sum_{i} x_{ji} ( y_i – p_i ) = 0 \$\$

but \$x_{ij}\$ is a binary column, so we may drop the places it's zero from the sum:

\$\$ sum_{i mid x_{ij} = 1} y_i – p_i = 0 \$\$

Or, rearranging

\$\$ sum_{i mid x_{ij} = 1} y_i = sum_{i mid x_{ij} = 1} p_i \$\$

So, for each binary variable in our regression, if we subset down to the observations where this binary variable is on, the sum of the predicted probabilities for these observations equals the sum of the response. This is what is meant by "preserves marginal probability".

As for "well calibrated", a special case applies to the intercept column, where every observation receives a \$1\$. Our equation in this case becomes:

\$\$ sum_{i} y_i = sum_{i} p_i \$\$

This is more enlightening if we divide by the number of observations

\$\$ frac{1}{n} sum_{i} y_i = frac{1}{n} sum_{i} p_i \$\$

So the average response rate in the data is equal to the average predicted probability. This is not true for all probability models, but is true of logistic regression.

Rate this post