It's said that logistic regression is well calibrated and preserves marginal probability. What does that mean? Thanks.

**Contents**hide

#### Best Answer

The solution equation for logistic equation, found by differentiating the loss function and setting it to zero, is

$$ X^t (y – p) = 0 $$

Suppose you have a binary variable in your regression, reflected as a column of $1$s and $0$s in the design matrix $X$. This becomes a row in $X^t$, and hence a single linear equation in the matrix equation above. Calling this the $j$th row in $X^t$, and focusing on this single equation, we get:

$$ sum_{i} x_{ji} ( y_i – p_i ) = 0 $$

but $x_{ij}$ is a binary column, so we may drop the places it's zero from the sum:

$$ sum_{i mid x_{ij} = 1} y_i – p_i = 0 $$

Or, rearranging

$$ sum_{i mid x_{ij} = 1} y_i = sum_{i mid x_{ij} = 1} p_i $$

So, for each binary variable in our regression, if we subset down to the observations where this binary variable is *on*, the sum of the predicted probabilities for these observations equals the sum of the response. This is what is meant by "preserves marginal probability".

As for "well calibrated", a special case applies to the intercept column, where every observation receives a $1$. Our equation in this case becomes:

$$ sum_{i} y_i = sum_{i} p_i $$

This is more enlightening if we divide by the number of observations

$$ frac{1}{n} sum_{i} y_i = frac{1}{n} sum_{i} p_i $$

So the average response rate in the data is equal to the average predicted probability. This is not true for all probability models, but is true of logistic regression.

### Similar Posts:

- Solved – Why does logistic regression produce well-calibrated models
- Solved – Comparing magnitude of coefficients in a logistic regression
- Solved – Ranking of categorical variables in logistic regression
- Solved – Why is logistic regression well calibrated, and how to ruin its calibration
- Solved – the difference between linear regression and logistic regression