Say, $X$ is dependent on $alpha$. Rigorously speaking,
if $X$ and $alpha$ are both random variables, we could write $p(Xmidalpha)$;
however, if $X$ is a random variable and $alpha$ is a parameter, we have to write $p(X; alpha)$.
I notice several times that the machine learning community seems to ignore the differences and abuse the terms.
For example, in the famous LDA model, where $alpha$ is the Dirichlet parameter instead of a random variable.
Shouldn't it be $p(theta;alpha)$? I see a lot of people, including the LDA paper's original authors, write it as $p(thetamidalpha)$.
Best Answer
I think this is more about Bayesian/non-Bayesian statistics than machine learning vs.. statistics.
In Bayesian statistics parameter are modelled as random variables, too. If you have a joint distribution for $X,alpha$, $p(X mid alpha)$ is a conditional distribution, no matter what the physical interpretation of $X$ and $alpha$. If one considers only fixed $alpha$s or otherwise does not put a probability distribution over $alpha$, the computations with $p(X; alpha)$ are exactly the same as with $p(X mid alpha)$ with $p(alpha)$. Furthermore, one can at any point decide to extend the model with fixed values of $alpha$ to one where there is a prior distribution over $alpha$. To me at least, it seems strange that the notation for the distribution-given-$alpha$ should change at this point, wherefore some Bayesians prefer to use the conditioning notation even if one has not (yet?) bothered to define all parameters as random variables.
Argument about whether one can write $p(X ; alpha)$ as $p(X mid alpha)$ has also arisen in comments of Andrew Gelman's blog post Misunderstanding the $p$-value. For example, Larry Wasserman had the opinion that $mid$ is not allowed when there is no conditioning-from-joint while Andrew Gelman had the opposite opinion.
Similar Posts:
- Solved – Significant about the Bayesian Approach in Machine Learning
- Solved – What are some of the disavantage of bayesian hyper parameter optimization
- Solved – What are some of the disavantage of bayesian hyper parameter optimization
- Solved – number of parameters for a Bayesian network over binary random variables
- Solved – Are random effect models same as Bayesian versions of fixed effect models