I know that there have already been a lot of questions about why the likelihood is no probability density function and I ve read most of the answers. However, to me the point is still not clear yet why the likelihood is no pdf. There have been several arguments, mainly involving that
- it does not integrate to 1
- it is a distribution over the parameters with the data fixed
However, there has also been an accepted answer that says "it is the probability (density) of the data given the parameter value", which to me sounds like a probability then.
My general confusion and problem about the understanding of the likelihood contains the following items:
1.) The Likelihood is (often) defined as $L(theta|X)=p(X|theta)$. But that IS a (conditional) pdf. There is nothing I can do about it to interprete it otherwise (e.g. by assuming to hold any of $X$ or $theta$ fixed). The above expression means that I have a pdf over the random variable $X$ conditioned on the parameter $theta$ which in fact is a conditional pdf, no?
2.) One could argue (such as on wikipedia) that the likelihood function is defined as $L(theta|X)=p(X;theta)$, i.e., explicitely not as a conditional pdf. However in Bayes theorem the likelihood is always a conditional pdf, as Bayes theorem is in principle only a consequence of the definition of conditional probability (density). Therefore in Bayes theorem I have to inteprete the likelihood as a conditional probability density.
3.) I am also confused about the definition of the likelihood in frequentist and Bayesian framework. In the former one assumes the data to be random variables and the parameters to be fixed unknowns and in the latter one assumes the data to be fixed and the paramters to be random variables. So it seems that the interpretation of the likelihood also depends on the framework I am working in?
4.) The pdf of a given distribution is often written as a conditional probability, e.g., the gaussian is often written as $p(X=x|mu, sigma)$ and then treated as a likelihood when using e.g. Bayes theorem. In that case we explicitly assume the likelihood to be a (conditional) pdf then. However, how is this then justified (if the likelihood is not a conditional pdf)?
5.) Why are there many textbooks in applied statistics and machine learning, that just use the likelihood as a conditional pdf just like in point 4.) if this is not correct?
EDIT: The discussion I have looked at involve:
What is the reason that a likelihood function is not a pdf?
How to rigorously define the likelihood?
What is the difference between "likelihood" and "probability"?
To understand why the likelihood is not a pdf, we first have to understand what a function is. Most importantly, a function has a parameter and maps inputs from this parameter to some output. Most importantly, a pdf takes some continuous variable as input (parameter) and maps this to a probability density.
Therefore, a pdf must integrate to 1 if integrated over this parameter. E.g. for $p(X|theta)$ the variable $X$ is the parameter, and therefore we must have $int_Omega p(X|theta) dX = 1$, where $Omega$ is the space from which $X$ can be chosen. The most important point here is, that $theta$ is not a parameter. This is a bit confusing because in other branches of mathematics anything in the "(" and ")" is the parameter. Better to think of this as a special way of writing $p_theta(X)$.
For the likelihood, we have $L(theta|X)$ and now $theta$ is a parameter, and $X$ is not. Again, think of this as a special way of writing $L_X(theta)$. So for the likelihood to be a pdf, it would have to integrate to 1, when integrated over $theta$. However, usually, we have $int_Theta L(theta|X)dthetaneq1$, where $Theta$ is the complete set of possible model parameters.
So why can the likelihood be used in the Bayesian theorem? In essence, the likelihoods for different data (different $X$) but the same parameters (same $theta$) are a conditional pdf, namely $p(X|theta)$. This means, if only need the value $p(X|theta)$ (as in the Bayesian theorem), it does not matter, where you get it from. Thus, you can use the likelihood to calculate that value. This is the difference, if the likelihood is seen as a function (over $theta$) or if a single value is used.
- Solved – ny difference between Frequentist and Bayesian on the definition of Likelihood
- Solved – Why are density functions sometimes written with conditional notation
- Solved – Conditional Maximum Likelihood – How is marginal probability of inputs are independent of the parameter we are estimating
- Solved – Likelihood function of a Linear probability model
- Solved – Sufficient estimator for Bernoulli distribution using the likelihood function theorem for sufficiency