# Solved – Estimation of parameters as a mode of posterior distribution

I was referring to this http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation. Here it is mentioned that estimating the parameters \$theta\$ is actually finding the mode of the posterior distribution. I didn't get how this is true.

Contents

It is high time to convert my comment to an answer.

Bayesian inference has a theoretical fundation, namely Bayesian decision theory. The key tools of Bayesian decision theory are the loss functions.

A Bayesian estimate is derived as follows from the Bayesian decision theory. Define a "distance" \$d(theta,theta')\$ on the parameter space (I will come back to this point in the last paragraph of my answer). This is a loss function for the estimation problem: the Bayesian estimate associated to this loss function is the value of \$theta_0\$ minimizing the expected posterior loss \$\$int d(theta_0,theta) pi(theta mid x) mathrm{d}theta\$\$ where, in accordance to the usual notations, \$pi(thetamid x)\$ denotes the posterior distribution.

When \$d(theta,theta')={(theta-theta')}^2\$ is the squared loss function then the corresponding Bayesian estimate turns out to be the mean of the posterior distribution. When \$d(theta,theta')=|theta-theta'|\$ is the absolute deviation loss function, then the Bayesian estimate is the median of the posterior dsitribution. I am not sure to well remember what is the loss function yielding the posterior mode as the Bayesian estimate; see the excellent book The Bayesian Choice, by Christian Robert. Then the choice of the Bayesian is driven by the choice of the loss function, in other words by the question: "what do you want to minimize ?".

Now let me note that choosing a distance \$d(theta,theta')\$ between the possible values of the parameter does not sound to be a sensible idea. For example it is clear that the difference between the Poisson distributions \${cal P}(theta)\$ and \${cal P}(theta')\$ should not be as pronounced for \$theta=1000\$ and \$theta'=1010\$ than for \$theta=1\$ and \$theta'=11\$. Thus, it should be more sensible to use a "distance" between the sampling distributions \$p_theta\$ and \$p_{theta'}\$ rather than between the parameters \$theta\$ and \$theta'\$. In my paper I studied the Bayesian inference based on such a "distance" (the intrinsic discrepancy loss function) for a very simple model. Other possible choices of an "intrinsic loss function" are given in Robert's paper which is cited in the references.

Rate this post