Solved – Logit in statsmodels give negative coefficents

I have a dataset of users with gender and a value "fr" ([0-1]) which I hope can be used to predict the gender.

I tried fitting this data but contrary to what I expect and what make sense it predicts a higher fr means lower chance of female. I assume I must be misunderstanding something, this is my first time using Logistic regression and the statsmodels package.

>>> print(df.head(3))    gender        fr is_female 0  female  0.438898      True 1    male  0.285226     False 2    male  0.157895     False  >>> print(df.describe())                  fr count  64900.000000 mean       0.304351 std        0.160970 min        0.000000 25%        0.200000 50%        0.285714 75%        0.392857 max        1.000000  >>> g = sns.FacetGrid(df, col="gender") >>> g.map(plt.hist, "fr", bins=25)     

histogram

>>> sns.lmplot(x="fr", y="is_female", data=df.sample(1000), logistic=True, y_jitter=.05) 

logistic plot by seaborn

These two plot (I think) shows that it should be possible to use Logit to predict the gender. However, when I run with statsmodels it returns a negative coefficient:

>>> import statsmodels.api as sm >>> logit = sm.Logit(df["is_female"], df["fr"]) >>> result = logit.fit() Optimization terminated successfully.      Current function value: 0.682087      Iterations 4 >>> print(result.summary())                                Logit Regression Results                            ============================================================================== Dep. Variable:              is_female   No. Observations:                64900 Model:                          Logit   Df Residuals:                    64899 Method:                           MLE   Df Model:                            0 Date:                Fri, 23 Dec 2016   Pseudo R-squ.:                -0.08770 Time:                        18:06:44   Log-Likelihood:                -44267. converged:                       True   LL-Null:                       -40698.                                         LLR p-value:                       nan ==============================================================================                  coef    std err          z      P>|z|      [95.0% Conf. Int.] ------------------------------------------------------------------------------ fr            -0.8741      0.023    -37.457      0.000        -0.920    -0.828 ============================================================================== 

and plotting the result also shows its not correct:

>>> df_ = pd.DataFrame({"fr":np.linspace(0,1,11)}) >>> df_["female_predict"] = result.predict(df_[train_cols]) >>> df_.plot(x="fr", y="female_predict") 

result from statsmodels.Logit

Answered in comments: You need to add a constant to the regression.

Similar Posts:

Rate this post

Leave a Comment

Solved – Logit in statsmodels give negative coefficents

I have a dataset of users with gender and a value "fr" ([0-1]) which I hope can be used to predict the gender.

I tried fitting this data but contrary to what I expect and what make sense it predicts a higher fr means lower chance of female. I assume I must be misunderstanding something, this is my first time using Logistic regression and the statsmodels package.

>>> print(df.head(3))    gender        fr is_female 0  female  0.438898      True 1    male  0.285226     False 2    male  0.157895     False  >>> print(df.describe())                  fr count  64900.000000 mean       0.304351 std        0.160970 min        0.000000 25%        0.200000 50%        0.285714 75%        0.392857 max        1.000000  >>> g = sns.FacetGrid(df, col="gender") >>> g.map(plt.hist, "fr", bins=25)     

histogram

>>> sns.lmplot(x="fr", y="is_female", data=df.sample(1000), logistic=True, y_jitter=.05) 

logistic plot by seaborn

These two plot (I think) shows that it should be possible to use Logit to predict the gender. However, when I run with statsmodels it returns a negative coefficient:

>>> import statsmodels.api as sm >>> logit = sm.Logit(df["is_female"], df["fr"]) >>> result = logit.fit() Optimization terminated successfully.      Current function value: 0.682087      Iterations 4 >>> print(result.summary())                                Logit Regression Results                            ============================================================================== Dep. Variable:              is_female   No. Observations:                64900 Model:                          Logit   Df Residuals:                    64899 Method:                           MLE   Df Model:                            0 Date:                Fri, 23 Dec 2016   Pseudo R-squ.:                -0.08770 Time:                        18:06:44   Log-Likelihood:                -44267. converged:                       True   LL-Null:                       -40698.                                         LLR p-value:                       nan ==============================================================================                  coef    std err          z      P>|z|      [95.0% Conf. Int.] ------------------------------------------------------------------------------ fr            -0.8741      0.023    -37.457      0.000        -0.920    -0.828 ============================================================================== 

and plotting the result also shows its not correct:

>>> df_ = pd.DataFrame({"fr":np.linspace(0,1,11)}) >>> df_["female_predict"] = result.predict(df_[train_cols]) >>> df_.plot(x="fr", y="female_predict") 

result from statsmodels.Logit

Best Answer

Answered in comments: You need to add a constant to the regression.

Similar Posts:

Rate this post

Leave a Comment