Solved – Cross-validation vs empirical Bayes for estimating hyperparameters

Given a hierarchical model $p(x|phi,theta)$, I want a two stage process to fit the model. First, fix a handful of hyperparameters $theta$, and then do Bayesian inference on the rest of the parameters $phi$. For fixing the hyperparameters I am considering two options.

  1. Use Empirical Bayes (EB) and maximize the marginal likelihood $p(mbox{all data}|theta)$ (integrating out the rest of the model which contains high dimensional parameters).
  2. Use Cross Validation (CV) techniques such as $k$-fold cross validation to choose $theta$ that maximizes the likelihood $p(mbox{test data}|mbox{training data}, theta)$.

The advantage of EB is that I can use all data at once, while for CV I need to (potentially) compute the model likelihood multiple times and search for $theta$. The performance of EB and CV are comparable in many cases (*), and often EB is faster to estimate.

Question: Is there a theoretical foundation that links the two (say, EB and CV are the same in the limit of large data)? Or links EB to some generalizability criterion such as empirical risk? Can someone point to a good reference material?

(*) As an illustration, here is a figure from Murphy's Machine Learning, Section 7.6.4, where he says that for ridge regression both procedures yield very similar result:

murphy - empirical bayes vs CV

Murphy also says that the principle practical advantage of the empirical Bayes (he calls it "evidence procedure") over CV is when $theta$ consists of many hyper-parameters (e.g. separate penalty for each feature, like in automatic relevancy determination or ARD). There it is not possible to use CV at all.

I doubt there will be a theoretical link that says that CV and evidence maximisation are asymptotically equivalent as the evidence tells us the probability of the data given the assumptions of the model. Thus if the model is mis-specified, then the evidence may be unreliable. Cross-validation on the other hand gives an estimate of the probability of the data, whether the modelling assumptions are correct or not. This means that the evidence may be a better guide if the modelling assumptions are correct using less data, but cross-validation will be robust against model mis-specification. CV is assymptotically unbiased, but I would assume that the evidence isn't unless the model assumptions happen to be exactly correct.

This is essentially my intuition/experience; I would also be interested to hear about research on this.

Note that for many models (e.g. ridge regression, Gaussian processes, kernel ridge regression/LS-SVM etc) leave-one-out cross-validation can be performed at least as efficiently as estimating the evidence, so there isn't necessarily a computational advantage there.

Addendum: Both the marginal likelihood and cross-validation performance estimates are evaluated over a finite sample of data, and hence there is always a possibility of over-fitting if a model is tuned by optimising either criterion. For small samples, the difference in the variance of the two criteria may decide which works best. See my paper

Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research, 11(Jul):2079−2107, 2010. (pdf)

Similar Posts:

Rate this post

Leave a Comment