Specifically talking about Ridge Regression's cost function since Ridge Regression is based off of the $l_2$ norm. We should expect the cost function to be:

$$J(theta)=MSE(theta) + alphasqrt{sum_{i=1}^{n}theta_i^2}$$

Actual:

$$J(theta)=MSE(theta) + alphafrac{1}{2}sum_{i=1}^{n}theta_i^2$$

**Contents**hide

#### Best Answer

One of the factors to consider is computational simplicity.

By not introducing the square root, the gradient has a more elegant form.

Also, minimizing $MSE$ subject to $|theta|_2le c$ is equivalent to minimizing $MSE$ subject to $|theta|_2^2le c^2$.

### Similar Posts:

- Solved – a loss function in decision theory
- Solved – Supervised learning : How did they find the Cost function to minimize
- Solved – Supervised learning : How did they find the Cost function to minimize
- Solved – Why is the regularization term *added* to the cost function (instead of multiplied etc.)
- Solved – Why is the regularization term *added* to the cost function (instead of multiplied etc.)