Solved – If I use a regularization (e.g. L2) can I not apply early stopping

I've seen that early stopping is a form of regularization that limits the movement of the parameters $theta$ in a similar way that L2 Regularization penalizes the movement of $theta$ to be closer to the origin.

Does that mean that I can avoid overfitting if I use a regularization other than Early Stopping and train for more than the epochs that it takes to overfit the model as prohibited by Early Stopping?

It would be rather typical to combine a L2 penalty (or the closely related weight decay, plus possibly other regularization techniques such as drop-out) with early stopping when training neural networks or gradient boosted decision trees (e.g. LightGBM, XGBoost etc.). Esp. when training models that have the potential to massively overfit (like the ones I mentioned) and are more costly to train than some generalized linear model, this is very commonly done. Different regularization techniques have different effects and models may benefit from several of them.

E.g. early stopping is commonly used when you cannot figure out (or don't have the time to) how to set all the other regularization parameters in a way so that you can train to convergence without overfitting. Other regularization parameters like L1 and L2 penalties (as well as dropout in neural networks, which has been suggested to have a slab-and-spike prior like effect, sub-sampling predictors for trees or parts of trees is kind of similar to that, while sampling observations in tree based models or data augmentation for neural networks has more of an effect of emphasizing patterns that are seen in most of the data etc.) tend to reduce overfitting and will, as you suggested, let you train for more epochs/iterations before early stopping would be needed.

Similar Posts:

Rate this post

Leave a Comment