$$hat{theta}=argmin_{theta}{ ||y-Xtheta||_2^2+lambda||theta||_2^2},$$ where $X$ is an $ntimes p$ matrix.
We have if $y=Xtheta+varepsilon$ then $$hat{theta}^{text{ridge}}=(X^TX+lambda I)^{-1}X^Ty$$ So I'm kinda confused, because if $y=Xtheta+varepsilon$, then $||y-Xtheta||_2^2+lambda||theta||_2^2=||varepsilon||_2^2+lambda||theta||_2^2.$ But I'm confused as to how to show that
$$(X^TX+lambda I)^{-1}X^Ty=argmin_{theta}{ ||y-Xtheta||_2^2+lambda||theta||_2^2}.$$ Any help would be much appreciated. Thank you. I gotta edit this cause someone said it's a duplicate of an entirely different problem cool.
Best Answer
I think about the problem in summation notation,
The loss is defined, as you said, as $L = sum_{i=1}^{N}(sum_{j=1}^{M}theta_{j}X_{ij} – y_{i})^{2}+ lambdasum_{j=1}^{M}theta_{j}^{2} $
You can differentiate this w.r.t $theta _{k}$ to find:
$frac{partial L}{partial theta _{k}} =sum_{i=1}^{N}2(sum_{j=1}^{M}theta_{j}X_{ij}-y_{i})X_{ik} +2lambda theta _{k}$
Note that $sum_{i=1}^{N}X_{ik}sum_{j=1}^{M}theta_{j}X_{ij}=sum_{i=1}^{N}X_{ik}(Xcdot theta)_{i}=(X^{T}cdot X cdot theta)_{k}$
and $sum_{i=1}^{N}X_{ik}y_{i}=(X^{T}cdot y)_{k}$
Putting this together:
$(X^{T}cdot Xcdot theta)_{k}-(X^{T}cdot y)_{k} +lambda theta _{k}=0 hspace{5mm}forall k$
which you can re-write as a vector equation:
$X^{T}cdot y= (X^{T}cdot X + lambda I)cdot theta$
and thus, finally
$theta = (X^{T}cdot X + lambda I)^{-1}cdot X^{T}cdot y$
So this has shown that if you assume your loss is given by $||y – Xcdot theta ||_{2}+lambda ||theta||_{2}$ and you wish to find the theta which minimises this loss, then $theta = (X^{T}cdot X + lambda I)^{-1}cdot X^{T}cdot y$ is the solution. Hope this answers your question