Specifically talking about Ridge Regression's cost function since Ridge Regression is based off of the $l_2$ norm. We should expect the cost function to be:
$$J(theta)=MSE(theta) + alphasqrt{sum_{i=1}^{n}theta_i^2}$$
Actual:
$$J(theta)=MSE(theta) + alphafrac{1}{2}sum_{i=1}^{n}theta_i^2$$
Contents
hide
Best Answer
One of the factors to consider is computational simplicity.
By not introducing the square root, the gradient has a more elegant form.
Also, minimizing $MSE$ subject to $|theta|_2le c$ is equivalent to minimizing $MSE$ subject to $|theta|_2^2le c^2$.
Similar Posts:
- Solved – a loss function in decision theory
- Solved – Supervised learning : How did they find the Cost function to minimize
- Solved – Supervised learning : How did they find the Cost function to minimize
- Solved – Why is the regularization term *added* to the cost function (instead of multiplied etc.)
- Solved – Why is the regularization term *added* to the cost function (instead of multiplied etc.)