I'm starting my first Machine Learning project to classify some entities and I decided to use Logistic Regression for the task.
Initially I starter with around 10 features and I can see that my model is underfitting the data (F-Score around 0.63).
That can be explained because all of my features are of first order and so my hypothesis is a first order polynomial.
I would like to add more of higher order features, but I quickly realized that I don't have a good intuition on how to do that. I could take each of my features $X_n$ and add new ones $X_{n^2}$, $X_{n^3}$ etc. I could also start adding more complex features like $X_1$ * $X_2$ etc.
Immediatelly I noticed that there are countless possibilities. How do I start? What are good practices in adding more features. How can I avoid overfitting the data?
Best Answer
If you are really want to create higher order features to a logistic regressor then I would suggest you expand your features with interaction between features $X_1*X_2$, nonlinear features like $log(X_1)$ and $X_1^2$. Everything exactly like you proposed.
Finally to avoid over-fitting and at the same time doing variable selection apply a LASSO regularizer, it will both penalize model complexy and also induce sparsity. Only the subset of features, high order features that are of higher importance will be kept by the model.
You might also want to consider non linear models, they try to discover the optimal non-linearity by themselves (e.g. neural networks).
Similar Posts:
- Solved – How to add features of higher order for Logistic Regression
- Solved – How to add features of higher order for Logistic Regression
- Solved – How to handle predictor variables from different distributions in logistic regression
- Solved – Can a Linear Regression Model (with no higher order coefficients) over-fit
- Solved – Running logistic regression on survey data