I am surprised that R’s `glm`

will “break” (not converge with default setting) for the following “toy” example (binary classification with ~50k data, ~10 features), but `glmnet`

returns results in seconds.

Am I using `glm`

incorrectly (for example, should I set max iteration, etc.), or is R’s `glm`

not good for big data setting? Does adding regularization make a problem easy to solve?

`d=ggplot2::diamonds d$price_c=d$price>2500 d=d[,!names(d) %in% c("price")] lg_glm_fit=glm(price_c~.,data=d,family = binomial()) library(glmnet) x=model.matrix(price_c~.,d) y=d$price_c lg_glmnet_fit=glmnet(x = x,y=y,family="binomial", alpha=0) Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred `

EDIT:

Thanks for Matthew Drury and Jake Westfall's answer. I understand the perfect separation issue which is which is already addressed. How to deal with perfect separation in logistic regression?

And in my original code, I do have the third line which drops the column that derives the label.

The reason I mention about "big data" is because in many "big data" / "machine learning" settings, people may not carefully test assumptions or know if data can be perfectly separated. But `glm`

seems to be easily broken with "unfriendly" messages, and there is not a easy way to add the regularization to fix it.

**Contents**hide

#### Best Answer

The unregularized model is suffering from complete separation because you are trying to predict the dichotomized variable `price_c`

from the continuous variable `price`

from which it is derived.

The regularized model avoids the problem of complete separation by imposing a penalty that keeps the coefficient for the `price`

predictor from going off to $infty$ or $-infty$. So it manages to converge fine and work well.

You should remove the continuous `price`

predictor from the design matrix in this toy example.

** Edit:** As @Erik points out, the continuous

`price`

predictor *is*already removed from the design matrix, which I somehow missed. So the complete separation arises from some other predictor or combination of predictors.

It's also worth adding that, of course, none of these issues have anything to do with the particular implementation of logistic regression in R's `glm()`

function. It is simply about regularized vs. unregularized logistic regression.

### Similar Posts:

- Solved – How to tell which variable is perfectly separated in R
- Solved – What does it mean when glm algorithm doesn’t converge but still gives results
- Solved – fitted probabilities in logistic regression
- Solved – Logistic regression cost surface not convex
- Solved – relationship between number of covariates and sample size in logistic regression?