I took a look at a couple of derivations of this decomposition and I think they all require that $E[epsilon hat{f}] = 0$.

The most transparent one I found was from Wikipedia here.

I reproduce the steps below for your convenience.

The data generating process is $y = f(x) + epsilon$. As a shorthand, let $f = f(x)$ and let $hat{f} = hat{f}(x)$ be the fitted model.

$

begin{align}

mathrm{E}big[(y – hat{f})^2big]

& = mathrm{E}[y^2 + hat{f}^2 – 2 yhat{f}] \

& = mathrm{E}[y^2] + mathrm{E}[hat{f}^2] – mathrm{E}[2yhat{f}] \

& = mathrm{Var}[y] + mathrm{E}[y]^2 + mathrm{Var}[hat{f}] + mathrm{E}[hat{f}]^2 – 2fmathrm{E}[hat{f}] \

& = mathrm{Var}[y] + mathrm{Var}[hat{f}] + (f – mathrm{E}[hat{f}])^2 \

& = mathrm{Var}[y] + mathrm{Var}[hat{f}] + mathrm{E}[f – hat{f}]^2 \

& = sigma^2 + mathrm{Var}[hat{f}] + mathrm{Bias}[hat{f}]^2

end{align}$

My question is about the third step, where they use

$ E[2yhat{f}] = 2fE[hat{f}] $

I think what they are assuming is that $E[epsilon hat{f}] = 0$.

Is it because we are assuming that $X$ is independent of $epsilon$? Is that necessary to do this derivation? It seems such a big assumption that I feel I must be missing something. Is there another decomposition that doesn't require this?

**Contents**hide

#### Best Answer

I find the most difficult parts of this subject to be in keeping track of what is is being integrated over (the expectation with respect to what?). Many expositions take this lightly, and the end result can be highly confusing. Here is The Elements of Statistical Learning on this point

Discussions of error rate estimation can be confusing, because we have to make clear which quantities are fixed and which are random

I think you're seeing a bit of that in the wikipedia exposition.

**My Preferred Derivation**

Here's a detailed break down. Working out the exposition in this way is what finally cleared up the various concepts at play for me. I hope it helps you too.

The first step is to condition on $x$, and take the expectation with respect to $y$. There are no assumptions for this part, $g$ is a general function

$$begin{align*} ESE(f; x) &= E_Y left[ left( y – g(x) right)^2 mid x right] \ & = E_Y left[ left( y – E[y mid x] + E[y mid x] – g(x) right)^2 mid x right] \ & = E_Y left[ left( y – E[y mid x] right)^2 mid x right] + E_Y left[ left(E[y mid x] – g(x) right)^2 mid x right] \ & quad + 2 E_Y left[ left( y – E[y mid x] right) left( E[y mid x] – g(x) right) mid x right] \ end{align*}$$

The trick here is introducing the $E[ y mid x ]$ into the middle of the sum. This expectation is over $y$ (the only thing that makes sense in this context), so $E[ y mid x ]$ is just a function of $x$, same as $g$.

Now we may observe that in

$$ 2 E_Y left[ left( y – E[y mid x] right) left( E[y mid x] – g(x) right) mid x right] $$

the *second* factor has no dependence on $y$, and the *first* factor is zero in expectation, so this term dies, and we are left with

$$begin{align*} ESE(f; x) = E_Y left[ left( y – E[y mid x] right)^2 mid x right] + E_Y left[ left(E[y mid x] – g(x) right)^2 mid x right] \ end{align*}$$

The first term is the irreducible error of the classifier. It's your $sigma^2$ above.

Its worth noting that only the integrand of the *first* term in this result has any dependence on $y$, so the $E_Y$ may be dropped from the second term.

To get to the bias-variance breakdown, we introduce an expectation over the *sampling* distribution of $x$. That is, we consider $g$ itself as a random variable; training a learning algorithm takes us from a data set $D$ to a function $g$. Thus, it makes sense to take expectations over this sampling distribution with $g$ as a random variate. I'll write $g(x, D)$ to emphasize this dependence of $g$ on the training dataset $D$. With this notation, the second term in our final equation above becomes

$$ left[ left(E[y mid x] – g(x,D) right)^2 mid x right] $$

Taking the expectation of everything over $D$, and writing $Eg(x)$ for the expectation of $g(x, D)$ with respect to this sampling distribution, we can break this down further

$$begin{align*} & E_{D} left[ left(E[y mid x] – g(x,D) right)^2 mid x right] \ & = E_{D} left[ left(E[y mid x] – Eg(x) + Eg(x) – g(x,D) right)^2 mid x right] \ & = E_{D} left[ left(E[y mid x] – Eg(x) right)^2 mid x right] + E_{D} left[ left(Eg(x) – g(x, D) right)^2 mid x right] \ & quad + 2 E_{D} left[ left(E[y mid x] – Eg(x) right) left( Eg(x) – g(x,D) right) mid x right] \ end{align*} $$

The same trick applies. In this breakdown the *first* factor has no dependence on $D$, and the *second* is zero in expectation, so the cross term dies. Therefore

$$begin{align*} E_{D} left[ left(E[y mid x] – g(x,D) right)^2 mid x right] = E_{D} left[ left(E[y mid x] – Eg(x) right)^2 mid x right] + E_{D} left[ left(Ef(x) – g(x, D) right)^2 mid x right] end{align*}$$

The first term here is the *bias*, and the second is the *variance*.

Of course, I kept the conditioning on $x$ the entire time, so what we have really arrived at is the *pointwise* decomposition. To get the usual full decomposition of the total error, all that is needed is to integrate out $x$ as well.

**How This Relates to the Wikipedia Breakdown**

Wikipedia writes $f(x)$ for my $E[ y mid x ]$. I think it makes the exposition easier to follow to make this explicit. Wikipedia writes

$$ y = f(x) + epsilon $$

with the assumption that $E(epsilon) = 0$. What is really meant is $E(epsilon mid x) = 0$. This is implicit in my exposition, I simply use $E[y mid x]$ throughout.

Note that no independence assumptions are necessary for the bias variance decomposition. What wikipedia says

Thus, since $epsilon$ and $hat{f}$ are independent

what they really mean is that $hat{f}$ can be considered a constant when conditioning on $x$, so it can be taken out front of the expectation.