# Solved – explanation of MCMC and bayesian estimation

I have some questions about bayesian methods.

First of all I have a set of iid observations, calling \$y_1,ldots,y_n\$ that come from a certain distribution \$f\$ with unknown parameters(for example Normal distribution). Then I can form likelihood by \$P(D|vec{theta})=prod_i^n f(y_i)\$. Now I am going to use bayesian inference to estimate unknown parameters of posterior.

My questions are,

• How to choose prior?
• Given a prior (for example a gamma distribution) how can I estimate parameters?

Let denote parameters and data by \$theta=(theta_1,ldots,theta_p)\$, \$D=(y_1,ldots,y_n)\$ respectively. As far as I realize the posterior is

\$\$
Pos(theta|D)propto P(theta)P(D|theta)
\$\$

Because \$theta\$ is unknown, how can I compute \$P(theta)\$ and \$P(D|theta)\$?

I would be very thankful If somebody give me an explanation of MCMC and connection to above paragraphs.

I want to appreciate all of you for your help: I also found a nice tutorial for bayesian and MCMC algorithm here :https://theoreticalecology.wordpress.com/2010/09/17/metropolis-hastings-mcmc-in-r/

Contents

I sympathize with your struggle because I was formally trained frequentist and had to pick up a lot of the Bayesian material on my own. What you should realize is that unlike the frequentist approach, the Bayesian approach does not produce point estimates for model parameters but rather a distribution for them conditional on the data. That is the posterior distribution \$theta|D sim Pos(theta,|,D)\$ using your notation. So where we would ordinarily have standard errors that approach zero asymptotically as the number of observations increases, we instead have a posterior distribution who's tails will get thinner and thinner as the number of observations increase, converging on single values asymptotically.

When it comes to actually doing Bayesian inference, MCMC is the most popular tool. This is because MCMC can produce a sample from the posterior even when the mathematical form for the posterior is not known (which is the case when one is using non-conjugate priors). In other words for \$G\$ iterations of MCMC we can simulate \$theta^{(1)},theta^{(2)},…theta^{(G)} stackrel{iid}{sim} Pos(theta,|,D)\$. This posterior sample can then be used to calculate the probability of any hypothesis, i.e. \$Pr(theta_1<0)\$.

I will use the Metropolis algorithm to exemplify how MCMC works as this was how I came to understand it. (Gibbs sampling is the other popular MCMC algorithm, and there are some others too). To begin we must specify both the likelihood and the prior. For this example say we are trying to gather information on the parameters of a variable we know is normally distributed, so the parameters, \$theta_1\$ and \$theta_2\$ are the mean and variance respectively. The likelihood can then be written as

\$\$ P(y_1…y_n|theta_1,theta_2)=prod_{i=1}^{n}f(y_i|theta_1,theta_2) \$\$ where \$f(y_i ; | ; theta_1, theta_2) = frac{1}{sqrt{2pitheta_2} } ; e^{ -frac{(y_i-theta_1)^2}{2theta_2} }\$

A prior must also be specified for \$theta_1\$ and \$theta_2\$. There are many different priors we can use, for this example assume \$theta_1\$ and \$theta_2\$ to be independent and that \$theta_1\$ is distributed normal with a known mean \$mu\$ and variance \$sigma^2\$ and \$theta_2\$ is distributed chi-squared with known degrees of freedom \$k\$. So \$\$theta_1 sim P_1(theta_1) equiv frac{1}{sigmasqrt{2pi} } ; e^{ -frac{(theta_1-mu)^2}{2sigma^2} }\$\$ and \$\$ theta_2 sim P_2(theta_2) equiv begin{cases} frac{theta_2^{(k/2-1)} e^{-theta_2/2}}{2^{k/2} Gammaleft(frac{k}{2}right)}, &if;; theta_2 geq 0; \ 0, &if;; text{otherwise}. end{cases} \$\$

There is nothing entirely special about these priors, other than my needing to choose something for this example. In research, priors should always be developed in a thoughtful manner.

We can then write the posterior as being proportional to the product of the likelihood and prior, although I will use different notation here then you did in your question \$\$ P(theta_1,theta_2|y_1…y_n) propto P(y_1…y_n|theta_1,theta_2)P(theta_1,theta_2)=P(y_1…y_n|theta_1,theta_2)P_1(theta_1)P_2(theta_2) \$\$ For ease of notation, define \$boldsymbol{theta}=(theta_1,theta_2)\$. Notice that if I get values for \$boldsymbol{theta}\$ I can plug them into the likelihood and priors to get a value for \$P(y_1…y_n|boldsymbol{theta})P(boldsymbol{theta})\$

For the Metropolis algorithm, we must define yet another distribution \$J(boldsymbol{theta}_i,|,boldsymbol{theta}_j)\$ called a jump or proposal distribution where we can draw a parameter vector \$boldsymbol{theta}_i\$ conditional on a previous draw \$boldsymbol{theta}_j\$. This distribution will usually be symmetric (for non-symmetric cases we must use the more general Metropolis-Hastings algorithm) and should be a close approximation to the posterior. A common choice for the proposal is a multivariate normal with mean vector \$boldsymbol{theta}_j\$.

The Metropolis algorithm then proceeds as follows:

1. Choose an initial starting value \$boldsymbol{theta}_0\$ for which the likelihood and priors are both positive
2. For \$t in {1,2,3,…}\$ draw \$boldsymbol{theta}_{*}\$ from \$J(boldsymbol{theta}_t,|,boldsymbol{theta}_{t-1})\$ and calculate the ratio \$\$ r=frac{P(boldsymbol{theta}_{*},|,y_1…y_n)}{P(boldsymbol{theta}_{t-1},|,y_1…y_n)}=frac{P(y_1…y_n,|,boldsymbol{theta}_{*})P(boldsymbol{theta}_{*})}{P(y_1…y_n,|,boldsymbol{theta}_{t-1})P(boldsymbol{theta}_{t-1})} \$\$
3. Set \$\$ boldsymbol{theta}_t = left{ begin{array}{lr} boldsymbol{theta}_{*} ;;mathrm{with;;probability};;min{r,1}\ boldsymbol{theta} ;;mathrm{otherwise} end{array} right. \$\$ The sequence \${boldsymbol{theta}_t}\$ converges in distribution to \$P(boldsymbol{theta},|,y_1…y_n)\$. This sequence forms the Monte Carlo sample discussed above which you can use to calculate the probabilities of hypotheses about \$boldsymbol{theta}\$. There are many proofs available so instead of hacking up my own google "Metropolis-Hastings Algorithm".