Solved – What do we mean by hyperparameters?

Can anyone give me full details about what we mean by hyperparameters, and what in the Dirichlet distribution are called hyperparameters? A practice example for the estimation of those parameters would also be useful.

I suspect what is meant by hyper-parameter depends on the context, but here goes:

I would say that the parameters of a model are those that are directly fitted to the data, and the hyper-parameters are those parameters that are set by the user or which are indirectly fitted to the data. For instance in ridge regression, the parameters are the regression coefficients and the ridge parameter is the hyper-parameter. In this case, the regression parameters are determined by minimising the negative log-likelihood with a penalty term, usually via the normal equations

$vec{beta} = [XX^T + lambda I]^{-1}X^Tvec{y}$

whereas the ridge parameter, $lambda$, is set by the user (perhaps just to ensure the matrix is invertible) or might be tuned by minimising the cross-validation error, or generalised cross-validation. In that case, $lambda$ is tuned using the data, but only indirectly.

Sometimes there is no real statistical distinction between parameters and hyper-parameter, other than that there is a computationally efficient manner to determine the values of one set of parameters given the value of the others, and the first set gets called "parameters" and the second set gets called the "hyper-parameters", but it is really just a matter of convenience. For example, I (and Mrs Marsupial) have found [1] it can be better to tune the kernel parameters of a kernel machine (e.g. LS-SVM) directly on the data (with an additional regularisation term), so we treat them as parameters, rather than the usual approach which treats them as hyper-parameters (tune e.g. via cross-validation).

I don't think that a Dirichlet distribution has hyper-parameters as such, but if a Dirichlet distribution is used as a prior in a Bayesian analysis then the parameters of the Dirichlet distribution become the hyper-parameters of the model. It is the parameters of the model that are directly determined from the data for a given Dirichlet prior (and the hyper-parameters indirectly tuned to the data by e.g. maximising the evidence for the model).

[1] Cawley, G. C. and Talbot N. L. C, "Kernel learning at the first level of inference", Neural Networks, Volume 53, Pages 69–80, May 2014. (doi, preprint)

Similar Posts:

Rate this post

Leave a Comment