I'm struggling with mathematics behind linear regression. In the following lines I pasted the text from the book Pattern Recognition and Machine Learning (p. 46) where author derives the regression function $mathbb{E}_{t} [t | mathbf{x}]$. I want to understand the procedure from the equation (2) to the final result. Could somebody please provide me some useful pointers (and/or links) which concept from the calculus of variations should I study.
The average, expected, loss is given by
$$
mathbb{E}[L] = int int L(t, x (mathbf{x})) p (mathbf{x}, t) , dmathbf{x} , dt.
tag{1}
$$
A common choice of loss function in linear regression is the squared loss given by $L (t, y(mathbf{x})) = { y (mathbf{x}) – t }^{2}$. In this case, the expected loss can be written as
$$
mathbb{E}[L] = int int { y (mathbf{x}) – t }^{2} p (mathbf{x}, t) , dmathbf{x} , dt.
tag{2}
$$
Our goal is to choose $y (mathbf{x})$ so as to minimize $mathbb{E} [L]$. We can do this using the calculus of variations to give
$$
dfrac{delta mathbb{E} [L]}{delta y (mathbf{x})} = 2 int { y (mathbf{x}) – t } p (mathbf{x}, t) , dt = 0.
tag{3}
$$
Solving for $y (mathbf{x})$, and using the sum and product rules of probability, we obtain
$$
y (mathbf{x}) = dfrac{int tp (mathbf{x}, t) , dt}{p (mathbf{x})} = int t p (t | mathbf{x}) , dt = mathbb{E}_{t} [t | mathbf{x}]
tag{4}
$$
Best Answer
I am assuming your difficulty is in the jump between Eq.2 and Eq.3. All you need is an Euler-Lagrange equation, as in their equation (3). In their notation $f(x,y,dot y)$ would be your $int {y(x)-t}^2p(x,t)dx$, so that $df/ddot{y}=0$, for instance.