I'm looking for some references that have studied the behaviour of Gaussian process regression for the different settings of interpolation, and extrapolation.
I've found answers(e.g. like this one) that states when extrapolating far enough from a data point, the GP just regresses to the mean function. However, I would like to have a formal reference, and if possible exploring in more detail this issue.
Best Answer
Well, the regression to the mean of a GP is "well-known" but actually not always true.
Let $m:Drightarrowmathbb{R}$ be the (prior) mean function and $K:Dtimes Drightarrowmathbb{R}$ be the covariance function of a GP. After observing data $(x_i)$ your posterior mean function will be: $$ hat{m}(x)= m(x) + sum_i alpha_i K(x_i,x)$$ for suitable $alpha_i$. (see Rasmussen/Williams Equation (2.27) ).
Since the $alpha_i$ are constant, the behaviour far enough from data points is entirely determined by $m$ and $K$. Let us assume for simplicity that $m=0$, i.e. zero-mean GP. Then "regression to the mean" means small values of $K(x_i,x)$ for $x$ far away from the data $x_i$. This is true for many popular kernels in GP regression, most notably for the Gaussian kernel ($exp(-x^2)$). But this is not the only possible choice. Not all kernels have $K(x_i,x)rightarrow 0$ for large $x$. Counterexamples are kernels such as the exponential kernel $exp(<x,y>)$ linear kernels or even the constant kernel $K(x,y)=c$.
If you are interested in interpolation and approximation errors, machine learning literature is probably not the right place for you. Look at books such as Wendland or the lecture notes by Schaback.