$$hat{theta}=argmin_{theta}{ ||y-Xtheta||_2^2+lambda||theta||_2^2},$$ where $X$ is an $ntimes p$ matrix.

We have if $y=Xtheta+varepsilon$ then $$hat{theta}^{text{ridge}}=(X^TX+lambda I)^{-1}X^Ty$$ So I'm kinda confused, because if $y=Xtheta+varepsilon$, then $||y-Xtheta||_2^2+lambda||theta||_2^2=||varepsilon||_2^2+lambda||theta||_2^2.$ But I'm confused as to how to show that

$$(X^TX+lambda I)^{-1}X^Ty=argmin_{theta}{ ||y-Xtheta||_2^2+lambda||theta||_2^2}.$$ Any help would be much appreciated. Thank you. I gotta edit this cause someone said it's a duplicate of an entirely different problem cool.

**Contents**hide

#### Best Answer

I think about the problem in summation notation,

The loss is defined, as you said, as $L = sum_{i=1}^{N}(sum_{j=1}^{M}theta_{j}X_{ij} – y_{i})^{2}+ lambdasum_{j=1}^{M}theta_{j}^{2} $

You can differentiate this w.r.t $theta _{k}$ to find:

$frac{partial L}{partial theta _{k}} =sum_{i=1}^{N}2(sum_{j=1}^{M}theta_{j}X_{ij}-y_{i})X_{ik} +2lambda theta _{k}$

Note that $sum_{i=1}^{N}X_{ik}sum_{j=1}^{M}theta_{j}X_{ij}=sum_{i=1}^{N}X_{ik}(Xcdot theta)_{i}=(X^{T}cdot X cdot theta)_{k}$

and $sum_{i=1}^{N}X_{ik}y_{i}=(X^{T}cdot y)_{k}$

Putting this together:

$(X^{T}cdot Xcdot theta)_{k}-(X^{T}cdot y)_{k} +lambda theta _{k}=0 hspace{5mm}forall k$

which you can re-write as a vector equation:

$X^{T}cdot y= (X^{T}cdot X + lambda I)cdot theta$

and thus, finally

$theta = (X^{T}cdot X + lambda I)^{-1}cdot X^{T}cdot y$

So this has shown that if you assume your loss is given by $||y – Xcdot theta ||_{2}+lambda ||theta||_{2}$ and you wish to find the theta which minimises this loss, then $theta = (X^{T}cdot X + lambda I)^{-1}cdot X^{T}cdot y$ is the solution. Hope this answers your question