Using sklean.linear_model.LogisticRegression for a binary classification problem. My classes are unbalanced. The positive class comprises about 20% of the training set. When fitting the model I use:
logreg = LogisticRegression(class_weight="auto") logreg.fit(X_trn,y_trn)
which lets sklearn give greater weight to the infrequent positive class during training. But now I want to re-balance the class_weights for test time. My first intuition is to adjust the logreg.intercept_ member of fitted model. Would this be the correct approach?
Best Answer
For any distribution with over binary variable $C$ and continuous variable $x$: begin{align} p(C_1|x) &= frac{p(x|C_1)p(C_1)}{p(x)}\ &= frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)}\ &= frac{1}{1 + frac{p(x|C_2)p(C_2)}{p(x|C_1)p(C_1)}}\ &= frac{1}{1 + expleft(lnfrac{p(x|C_2)p(C_2)}{p(x|C_1)p(C_1)}right)}\ &= frac{1}{1 + expleft(-lnfrac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}right)}\ &= frac{1}{1 + expleft(-w^Tx + bright)}, end{align} where we define $C_1$ as the event where $C=1$ and $C_2$ as the event where $C=0$. Notice this is the typical hypothesis assumed during binary logistic regression. From the above, we have that begin{equation} w^Tx + b = lnfrac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}= lnfrac{p(x|C_1)}{p(x|C_2)} + lnfrac{p(C_1)}{p(C_2)}. end{equation} If, during training, we balance the dataset or weigh the examples inversely to their class prior probabilities, we effectively have that $p(C_1) = p(C_2)$, then the above becomes begin{equation} w^Tx + b = lnfrac{p(x|C_1)}{p(x|C_2)}. end{equation} The parameters $w$ and $b$ are therefore estimated under the assumption that the class prior probabilities are balanced or equal. We can re-introduce the prior log odds: begin{align} w^Tx + b + lnfrac{p(C_1)}{p(C_2)} &= lnfrac{p(x|C_1)}{p(x|C_2)}+lnfrac{p(C_1)}{p(C_2)}\ w^Tx + b' &= lnfrac{p(x|C_1)}{p(x|C_2)}+lnfrac{p(C_1)}{p(C_2)}, end{align} where $b' = b + lnfrac{p(C_1)}{p(C_2)}$. So by a simple adjustment to the bias term, we can re-introduce unbalanced priors in the test/application setting. A similar argument holds for the case of multi-class logistic regression.
Similar Posts:
- Solved – Logistic-Regression: Prior correction at test time
- Solved – How to interpret multiclass logistic regression coefficients
- Solved – Question about deriving posterior distribution from normal prior and likelihood
- Solved – Posterior distribution of Normal Normal-inverse-Gamma Conjugacy
- Solved – The upper bound of the training error of AdaBoost