I'm a bit confused about Gaussian Process models for classification. In chapter 3 of http://www.gaussianprocess.org/gpml/ it is claimed that you can use a logit or probit model without any additional noise assumptions. This is made explicit on page 40:
We have tacitly assumed that the latent Gaussian process is
noise-free, and combined it with smooth likelihood functions, such as
the logistic or probit. However, one can equivalently think of adding
independent noise to the latent process in combination with a
step-function likelihood. In particular, assuming Gaussian noise and a
step-function likelihood is exactly equivalent to a noise- free8
latent process and probit likelihood, see exercise 3.10.1
Also, in the example code, they never add any additional noise in the covariance function for classification.
I don't understand why it's not nececary to add noise to the covariance function for classification. In regression, in chapter 2, we saw that you need the noise to regularize the gaussian process. If you for example set the noise to 0, it will completely overfit, and all predictions will exactly equal the observed values.
Why is this then not the case in classification? The problem might be that I'm so used to the `weight-space' view, and that in the classification chapter this is left out for the GP — for me it is unclear what the prior is for the weights in classification.
Furthermore, I can almost understand it for the logit model, as you can see logistic regression also as a latent variable model (see wikipedia) just as the GP, however in that case I don't understand wikipedia: it says it does not need to specify the noise model with any parameters, but why was this then needed in regression? And why then is regularization needed for logistic regression but not for GP's?
So I guess the smoothing works for GP because all possible values for the weights are integrated, but it still remains a mystery to me that we don't need a parameter to specify the regularization in classification for GP's. GP classifiers are compared to SVM's, but how can the GP's perform well if the regularization parameter cannot be learned and tweaked?
Hopefully someone can point me a bit in the right direction 🙂 thanks!
I think I figured it out myself, if anybody is interested: In classification, the likelihood function (such as the probit or error function) itself is a source of noise, since it will make the output of y non-deterministic. This is actually the same as the likelihood for regression, the gaussian noise again actually came from the likelihood.
In logistic regression no noise parameter is needed, since by for example enlarging your vector w will cause less noise, since it will be squashed further through the logistic function.
Finally, the GP model is regularized as well, this can be best seen from page 144. For the GP classification models, the parameter controlling the latent variable variation in the covariance function ($sigma_f$) controls the regularization (again see page 144). This regularization comes from the smoothness assumption that the Gaussian Process model enforces on the latent variable.