I coded a neural network from scratch. When regularization parameter is too high and the learning rate too low, cost increases. I suspect that the added cost (associated with regularization) to loss function is responsible for this. When the regularization parameter is set to zero I always get a nice decrease in cost function. Can you please explain what is happening?
Best Answer
$L_2$ regularization is basically adding a parabola with a minimum at the origin to the loss surface. How steeply the parabola rises depends on the magnitude of the $L_2$ penalty. If the penalty is too large, then the effect of regularization will overwhelm the signal from the cross-entropy loss, because the shape of the surface is so distorted by the massive penalty for increasing the norm of the weights from 0.
If you imagine starting in some vicinity near zero, moving further from zero will increase the magnitude of the penalty dramatically. If the increase in the penalty is larger than the decrease in cross-entropy loss, then the net effect is that the total loss will increase.
This is one of the reasons that I prefer to track the total, penalized loss separately from the penalty and the classification loss. Basically, track all of these quantities separately to make this kind of unusual behavior obvious.
Similar Posts:
- Solved – Why is the penalty term added instead of subtracting it from loss term in regularization
- Solved – How Does L2 Norm Regularization Work with Negative Weights
- Solved – How Does L2 Norm Regularization Work with Negative Weights
- Solved – Logistic regression cost surface not convex
- Solved – How L2 Regularization changes backpropogation formulas