Solved – Probabilistic interpretation of regression for justifying squared loss function

I was reading Andrew Ng's CS229 lecture notes (page 12) about justifying squared loss risk as a means of estimating regressions parameters.

Andres explains that we first need to assume that the target function $y^{(i)}$ can be written as:

$$ y^{(i)} = theta^Tx^{(i)} + epsilon^{(i)}$$

where $e^{(i)}$ is the error term that captures unmodeled effects and random noise. Further assume that this noise is distributed as $epsilon^{(i)} sim mathcal{N}(0, sigma^2)$. Thus:

$$p(e^{(i)}) = frac{1}{sqrt{2pisigma}}exp left( frac{-(e^{(i)})^2}{2sigma^2} right) $$

Thus we can see that the error term is a function of $y^{(i)}$, $x^{(i)}$ and $theta$ as in:

$$e^{(i)} = f(y^{(i)}, x^{(i)}; theta) = y^{(i)} – theta^Tx^{(i)}$$

thus we can substitute to the above equation for $e^{(i)}$

$$p(y^{(i)} – theta^Tx^{(i)}) = frac{1}{sqrt{2pisigma}}exp left( frac{-(y^{(i)} – theta^Tx^{(i)})^2}{2sigma^2} right)$$

Now we know that:

$p(e^{(i)}) = p(y^{(i)} – theta^Tx^{(i)}) = p(f(y^{(i)}, x^{(i)}; theta))$

Which is a function of the random variables $x^{(i)}$ and $y^{(i)}$ (and the non random variable $theta$). Andrew then favors $x^{(i)}$ as being the conditioning variable and says:

$p(e^{(i)}) = p(y^{(i)} mid x^{(i)})$

However, I can't seem to justify why we would favor expressing $p(e^{(i)})$ as $p(y^{(i)} mid x^{(i)})$ and not the other way round $p(x^{(i)} mid y^{(i)})$.

The problem I have his derivation is that with only the distribution for the error (which for me, seems to be symmetric wrt to x and y):

$$frac{1}{sqrt{2pisigma}}exp left( frac{-(e^{(i)})^2}{2sigma^2} right)$$

I can't see why we would favor $p(e^{(i)})$ as $p(y^{(i)} mid x^{(i)})$ and not the other way round $p(x^{(i)} mid y^{(i)})$ (just because we are interested in y, is not enough for me as a justification because just because that is our quantity of interest, it does not mean that the equation should be the way we want it to, i.e. it doesn't mean that it should be $p(y^{(i)} mid x^{(i)})$, at least that doesn't seem to be the case from a purely mathematical perspective for me).

Another way of expressing my problem is the following:

The Normal equation seems to be symmetrical in $x^{(i)}$ and $y^{(i)}$. Why favor $p(y^{(i)} mid x^{(i)})$ and not $p(x^{(i)} mid y^{(i)})$. Furthermore, if its a supervised learning situation, we would get both pairs $(x^{(i)}, y^{(i)})$, right? Its not like we get one first and then the other.

Basically, I am just trying to understand why $p(y^{(i)} mid x^{(i)})$ is correct and why $p(x^{(i)} mid y^{(i)})$ is not the correct substitution for $p(e^{(i)})$.

Overall, you're correct; $p(x|y)$ will be a normally-distributed function of the size of the error. However, in general, you will be using multiple exogenously fixed input variables $x$ to predict a single output variable $y$, so we're rarely interested in guessing $x$ directly based on what we know about $y$.

An example will be helpful here: Suppose you have a set of pictures of animals and you want to know the type of animal present in each picture. Your $x$ will be an image, and $y$ will be the type of animal in the image. $p(y|x)$ makes a lot of sense–we're trying to find probabilistically the correct class label for each image.

$p(x|y)$ is kind of odd. It's a probability of a single image, given that the image's label is a cat. If you had a 256 x 256 pixel image with 16-bit pixels, there are 2^(2^20) different images you could make, which is going to make any individual image's probability so tiny as to pretty much defy interpretation.

If we wanted to know $p(x|y)$, we'll use Bayes' Law to compute $p(x|y) = frac{p(y|x)p(x)}{p(y)}$

On the other hand, $p(y|x)$ could be represented as a single-variable normal distribution representing our belief in $y$ given that you know $x$, which is the task that is usually more tractable, and thus we're usually more interested in solving.

Similar Posts:

Rate this post

Leave a Comment