I read a few times that the mean prediction of a GP should be equivalent to KRR. I tested this empirically and found (dataset is y=2x + gaussian noise):

Two explanations for this come to mind:

- GP is bayesian, so trains using log marginal likelihood, which is sometimes called bayesian's occam razor. This would however contradict the common saying (KRR = GP mean)
- GP can train its hyperparameters (lengthscale and variance) by gradient descent, whereas the sklearn code I'm using is only doing a gridsearch on krr's hyperparameters (can we can krr's hyperparameters –regularization term alpha and lengthscale– by gradient descent?), which could make GP better empirically

Are these right? Or is there something else going on here?

`from sklearn.kernel_ridge import KernelRidge from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline import matplotlib.pyplot as plt import numpy as np np.random.seed(100) # Make data. X = np.arange(-10,10,.25)[:,None] Y = 2*X + np.random.randn(X.shape[0],1)*5 plt.scatter(X,Y, c='green') krbf = GPy.kern.RBF(1) m = GPy.models.GPRegression(X,Y,krbf) m.optimize() plt.plot(X,m.predict(X)[0], label='gp') kr = GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1), cv=5, param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3], "gamma": np.array([1,5,10,15,20])}) kr.fit(X,Y) plt.plot(X,rr.predict(X), label='kr') plt.legend() plt.show() `

**Contents**hide

#### Best Answer

The mean of the predictive distribution for GP regression is equal to the prediction using kernel ridge regression *when using the same kernel and hyperparameters*.

Suppose we have observed training inputs $X = {x_1, dots x_n}$ with corresponding real-valued outputs $y = [y_1, dots, y_n]^T$. We're interested in predicting the output for some new test input $x_*$. Assume the model:

$$y_i = f(x_i) + epsilon_i$$

where the $epsilon_i$ represent i.i.d. Gaussian noise with variance $sigma_n^2$. Let $k$ be a kernel function (a.k.a. covariance function in the case of GP regression). Let $K$ denote the kernel matrix for the training points, so $K_{ij} = k(x_i, x_j)$. And, let vector $k_* = [k(x_*, x_1), dots, k(x_*, x_n)]^T$ contain the result of evaluating the kernel function between the test input and each training input.

In **kernel ridge regression**, the predicted ouput at $x_*$ is:

$$hat{y}_* = k_*^T (K + lambda I)^{-1} y$$

where $lambda$ is the regularization parameter and $I$ is the identity matrix.

In **GP regression** we have a Gaussian posterior predictive distribution over the output at $x_*$, with mean:

$$bar{f}_* = k_*^T (K + sigma_n^2 I)^{-1} y$$

where the noise variance $sigma_n^2$ is considered a hyperparameter. See Rasmussen and Williams ch2.2 for details.

You can see that the predictions for kernel ridge regression and GP regression are equal, as long as $lambda = sigma_n^2$ and the same kernel function is used in both cases (including any hyperparameters).

In your example, GP regression and kernel ridge regression give different predictions because you're using fitting the hyperparameters separately for each, using different methods. You're using cross validation for kernel ridge regression. And, if I recall correctly, sklearn chooses GP regression hyperparameters by maximizing the marginal likelihood. So, you're probably ending up with different hyperparameters for each method.

### Similar Posts:

- Solved – How to prevent overfitting in Gaussian Process
- Solved – How to prevent overfitting in Gaussian Process
- Solved – Loss for Kernel Ridge Regression
- Solved – why signal variance is big for optimized gaussian process regression with gaussian rbf kernel
- Solved – Likelihood vs. noise kernel hyperparameter in GPML Toolbox