In Bayesian estimation, we need to compute the normalizing factor P(X)
. Say that our parameter space was y. Then in order to compute the Bayesian evidence we'd need to marginalize the evidence over all the possible parameters in our parameter space:
I don't understand why that is so difficult to compute. Is that not a straightforward application of the total law of probability?
Best Answer
I like @ShijiaBian's answer. I would add the following.
The normalizing constant is important because without it, (1) you won't have a valid probability distribution and (2) you can't assess relative probabilities of values of the parameter. For example, if you modeled data $x_t$ as Gaussian conditional on a mean $theta$ that was modeled as Poisson, you would not be able to average the likelihood over the values of the parameter because the infinite sum over the product of the two PDFs' kernels is not available in closed form. Mathematically:
$$ begin{align} p(theta) &= text{Poisson}(lambda)\ p(x_t|theta) &= mathcal{N}(theta, sigma^2)\ p(theta | mathbf{X}) &= frac{p(theta)prod_tp(x_t|theta)}{sum_Theta p(theta)prod_tp(x_t|theta)} end{align} $$
Expanding the numerator, you'll find that:
$$ p(theta | mathbf{X}) propto frac{1}{theta!}lambda^theta(2pisigma^2)^{-T/2}prod_texp{left[frac{-1}{2sigma^2}(x_t^2 – 2theta x_t + theta^2) – frac{lambda}{T}right]} $$
To normalize this function, you'd have to sum over all the possible (discrete) values of $theta$: $0, 1, 2, ldots, infty$. This is impossible analytically because there is no closed-form expression for an infinite sum of the above form. If you don't do this, however, your function will not integrate to $1$ and you won't have a valid probability density. Furthermore, normalizing ensures that for each value of $theta = theta^*$, you can exactly determine the relative probability of $theta^*$ relative to other values of $theta$.
Expanding on this second point, if you only normalized for values of $theta$, say, from $0$ through $10$, then you cannot compare how likely values of $theta$ outside that range to values inside that range. This does suggest, however, that if you have some belief about the range of values for which $theta$ may be restricted, you could truncate your distribution to that range and perform the summation numerically within that range, like $0$ to $10$. Then, you would have a valid probability distribution (a truncated Poisson) over the range of values from $0$ to $10$. This is much harder, however, when $theta$ is continuous (say, Beta or Gamma distributed), although you could perform numerical integration. Numerical integration is difficult in high dimensions, however, so you'd have to restrict the dimension of $theta$ to something that is computationally feasible.