I have a dataset of users with gender and a value "fr" ([0-1]) which I hope can be used to predict the gender.
I tried fitting this data but contrary to what I expect and what make sense it predicts a higher fr means lower chance of female. I assume I must be misunderstanding something, this is my first time using Logistic regression and the statsmodels package.
>>> print(df.head(3)) gender fr is_female 0 female 0.438898 True 1 male 0.285226 False 2 male 0.157895 False >>> print(df.describe()) fr count 64900.000000 mean 0.304351 std 0.160970 min 0.000000 25% 0.200000 50% 0.285714 75% 0.392857 max 1.000000 >>> g = sns.FacetGrid(df, col="gender") >>> g.map(plt.hist, "fr", bins=25)
>>> sns.lmplot(x="fr", y="is_female", data=df.sample(1000), logistic=True, y_jitter=.05)
These two plot (I think) shows that it should be possible to use Logit to predict the gender. However, when I run with statsmodels it returns a negative coefficient:
>>> import statsmodels.api as sm >>> logit = sm.Logit(df["is_female"], df["fr"]) >>> result = logit.fit() Optimization terminated successfully. Current function value: 0.682087 Iterations 4 >>> print(result.summary()) Logit Regression Results ============================================================================== Dep. Variable: is_female No. Observations: 64900 Model: Logit Df Residuals: 64899 Method: MLE Df Model: 0 Date: Fri, 23 Dec 2016 Pseudo R-squ.: -0.08770 Time: 18:06:44 Log-Likelihood: -44267. converged: True LL-Null: -40698. LLR p-value: nan ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ fr -0.8741 0.023 -37.457 0.000 -0.920 -0.828 ==============================================================================
and plotting the result also shows its not correct:
>>> df_ = pd.DataFrame({"fr":np.linspace(0,1,11)}) >>> df_["female_predict"] = result.predict(df_[train_cols]) >>> df_.plot(x="fr", y="female_predict")
Best Answer
Answered in comments: You need to add a constant to the regression.
Similar Posts:
- Solved – Logit in statsmodels give negative coefficents
- Solved – Logit in statsmodels give negative coefficents
- Solved – How to create a “male” variable in stata
- Solved – How to test (in Stata) whether the gender distribution of employees to jobs differs across two companies
- Solved – significant differences between groups in gender