# Solved – How exactly is the sum (or mean) centering constraint for splines (also w.r.t. gam from mgcv) done

The data-generating-process is: \$y = text{sin}Big(x+I(d=0)Big) + text{sin}Big(x+4*I(d=1)Big) + I(d=0)z^2 + 3I(d=1)z^2 + mathbb{N}left(0,1right)\$

Let \$x,z\$ be a sequence from \$-4\$ to \$4\$ of length \$100\$ and \$d\$ to be the corresponding factor \$din{0,1}\$. Take all possible combinations of \$x,z,d\$ to calculate \$y\$: Using the (uncentered) B-spline-Basis for \$x,z\$ for each level of \$d\$ will not be feasible by the parition-of-unity-property (rows sum to 1). Such a model will not be identifiable (even without intercept).

Example: (Setting: 5 inner knot-intervals (uniformly distributed), B-Spline of degree 2, the `spline`-function is a custom one)

``# drawing the sequence n <- 100 x <- seq(-4,4,length.out=n) z <- seq(-4,4,length.out=n) d <- as.factor(0:1) data <- CJ(x=x,z=z,d=d) set.seed(100)  # setting up the model data[,y := sin(x+I(d==0)) + sin(x+4*I(d==1)) + I(d==0)*z^2 + 3*I(d==1)*z^2 + rnorm(n,0,1)]  # creating the uncentered B-Spline-Basis for x and z X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=FALSE)] > head(X)      x.1d0 x.2d0 x.3d0 x.4d0 x.5d0 x.6d0 x.7d0 x.1d1 x.2d1 x.3d1 x.4d1 x.5d1 x.6d1 x.7d1 [1,]   0.5   0.5     0     0     0     0     0   0.0   0.0     0     0     0     0     0 [2,]   0.0   0.0     0     0     0     0     0   0.5   0.5     0     0     0     0     0 [3,]   0.5   0.5     0     0     0     0     0   0.0   0.0     0     0     0     0     0  Z <- data[,spline(z,min(z),max(z),5,2,by=d)] head(Z)          z.1d0     z.2d0      z.3d0 z.4d0 z.5d0 z.6d0 z.7d0     z.1d1     z.2d1      z.3d1 z.4d1 z.5d1 z.6d1 [1,] 0.5000000 0.5000000 0.00000000     0     0     0     0 0.0000000 0.0000000 0.00000000     0     0     0 [2,] 0.0000000 0.0000000 0.00000000     0     0     0     0 0.5000000 0.5000000 0.00000000     0     0     0 [3,] 0.4507703 0.5479543 0.00127538     0     0     0     0 0.0000000 0.0000000 0.00000000     0     0     0       z.7d1 [1,]     0 [2,]     0 [3,]     0  # lm will drop one spline-column for each factor  lm(y ~ -1+X+Z,data=data)  Call: lm(formula = y ~ -1 + X + Z, data = data)  Coefficients:  Xx.1d0   Xx.2d0   Xx.3d0   Xx.4d0   Xx.5d0   Xx.6d0   Xx.7d0   Xx.1d1   Xx.2d1   Xx.3d1   Xx.4d1   Xx.5d1    23.510   19.912   18.860   22.177   23.080   19.794   18.727   68.572   69.185   67.693   67.082   68.642    Xx.6d1   Xx.7d1   Zz.1d0   Zz.2d0   Zz.3d0   Zz.4d0   Zz.5d0   Zz.6d0   Zz.7d0   Zz.1d1   Zz.2d1   Zz.3d1    69.159   67.496    1.381  -11.872  -19.361  -21.835  -19.698  -11.244       NA   -1.329  -38.449  -62.254    Zz.4d1   Zz.5d1   Zz.6d1   Zz.7d1   -69.993  -61.438  -39.754       NA ``

To overcome this problem, Wood, Generalized Additive Models: An Introduction with R, page 163-164 proposes the sum (or mean) centering constraint:

\$boldsymbol{1}^Tboldsymbol{tilde{X}_j}boldsymbol{tilde{beta}_j}=0\$

This can be done by reparametrization if a matrix \$boldsymbol{Z}\$ is found such that

\$boldsymbol{1}^Tboldsymbol{tilde{X}_j}boldsymbol{Z}=0\$

\$boldsymbol{Z}\$-matrix can be found by the QR-decomposition of the constraint matrix \$boldsymbol{C}^T = (boldsymbol{boldsymbol{1}^Tboldsymbol{tilde{X}_j}})^T = boldsymbol{tilde{X}_j}^Tboldsymbol{1}\$.

Note that \$boldsymbol{tilde{X}_j}^Tboldsymbol{1}\$ is \$boldsymbol{1}\$ by the partition of unity-property.

The centered/constrained-version of my B-Spline-Matrix is:

``X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=TRUE)] head(X)          x.1d0      x.2d0      x.3d0      x.4d0      x.5d0       x.6d0     x.1d1      x.2d1      x.3d1      x.4d1 [1,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000  0.0000000  0.0000000  0.0000000 [2,] 0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.00000000 0.2271923 -0.3225655 -0.3225655 -0.3225655 [3,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000  0.0000000  0.0000000  0.0000000            x.5d1       x.6d1 [1,]  0.0000000  0.00000000 [2,] -0.2728077 -0.05790256 [3,]  0.0000000  0.00000000  Z <- data[,spline(z,min(z),max(z),5,2,by=d,intercept=TRUE)] head(Z)          z.1d0      z.2d0      z.3d0      z.4d0      z.5d0       z.6d0     z.1d1      z.2d1      z.3d1      z.4d1 [1,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000  0.0000000  0.0000000  0.0000000 [2,] 0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.00000000 0.2271923 -0.3225655 -0.3225655 -0.3225655 [3,] 0.2875283 -0.3066501 -0.3079255 -0.3079255 -0.2604260 -0.05527458 0.0000000  0.0000000  0.0000000  0.0000000            z.5d1       z.6d1 [1,]  0.0000000  0.00000000 [2,] -0.2728077 -0.05790256 [3,]  0.0000000  0.00000000 ``

My question is : Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?

``# comparing with gam from mgcv mod.gam <- gam(y~d+s(x,bs="ps",by=d,k=7)+s(z,bs="ps",by=d,k=7),data=data) X.gam <- model.matrix(mod.gam) head(X.gam)   (Intercept) d1 s(x):d0.1   s(x):d0.2  s(x):d0.3  s(x):d0.4  s(x):d0.5   s(x):d0.6 s(x):d1.1   s(x):d1.2 1           1  0 0.5465301 -0.05732768 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000  0.00000000 2           1  1 0.0000000  0.00000000  0.0000000  0.0000000  0.0000000  0.00000000 0.5465301 -0.05732768 3           1  0 0.5465301 -0.05732768 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000  0.00000000     s(x):d1.3  s(x):d1.4  s(x):d1.5   s(x):d1.6 s(z):d0.1    s(z):d0.2  s(z):d0.3  s(z):d0.4  s(z):d0.5 1  0.0000000  0.0000000  0.0000000  0.00000000 0.5465301 -0.057327680 -0.2351708 -0.2259983 -0.1201207 2 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000  0.000000000  0.0000000  0.0000000  0.0000000 3  0.0000000  0.0000000  0.0000000  0.00000000 0.5471108 -0.031559945 -0.2302910 -0.2213227 -0.1176356      s(z):d0.6 s(z):d1.1    s(z):d1.2  s(z):d1.3  s(z):d1.4  s(z):d1.5   s(z):d1.6 1 -0.01043987 0.0000000  0.000000000  0.0000000  0.0000000  0.0000000  0.00000000 2  0.00000000 0.5465301 -0.057327680 -0.2351708 -0.2259983 -0.1201207 -0.01043987 3 -0.01022388 0.0000000  0.000000000  0.0000000  0.0000000  0.0000000  0.00000000 ``

Thw dotted line corresponds to my fit, the straight line to the gam-version Contents

Here is a simpler example using the link from Nemo. The question I answer is

How exactly is the sum (or mean) centering constraint for splines (also w.r.t. gam from mgcv) done?

I answer this as this is the title and as

My question is: Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?

is rather unclear for reason I provide in the end. Here is the answer to the above question

``# simulate data library(splines) set.seed(100) n <- 1000 x <- seq(-4,4,length.out=n) df <- expand.grid(d = factor(c(0, 1)), x = x) df <- cbind(y = sin(x) + rnorm(length(df),0,1), df) x <- df\$x  # we start the other way and find the knots `mgcv` uses to make sure we have # the same knots... library(mgcv) mod_gam <- gam(y ~ s(x, bs="ps", k = 7), data = df) knots <- mod_gam\$smooth[]\$knots  # find constrained basis as OP describes X <- splineDesign(knots = knots, x) C <- rep(1, nrow(X)) %*% X qrc <- qr(t(C)) Z <- qr.Q(qrc,complete=TRUE)[,(nrow(C)+1):ncol(C)] XZ <- X%*%Z rep(1, nrow(X)) %*% XZ # all ~ zero as they should #R              [,1]          [,2]          [,3]          [,4]          [,5]          [,6] #R [1,] 2.239042e-13 -2.112754e-13 -3.225198e-13 -6.993017e-14 -2.011724e-13 -3.674838e-14  # now we get roughtly the same basis all.equal(model.matrix(mod_gam)[, -1], XZ, check.attributes = FALSE) #R  TRUE  # if you want to use a binary by value mod_gam <- gam(y ~ s(x, bs="ps", k = 7, by = d), data = df) all.equal(   model.matrix(mod_gam)[, -1],   cbind(XZ * (df\$d == 0), XZ * (df\$d == 1)), check.attributes = FALSE) #R  TRUE ``

You can do better in terms of computation speed than explicitly computing

``Z <- qr.Q(qrc,complete=TRUE)[,(nrow(C)+1):ncol(C)] XZ <- X%*%Z ``

as described on page 211 of

Wood, Simon N.. Generalized Additive Models: An Introduction with R, Second Edition (Chapman & Hall/CRC Texts in Statistical Science). CRC Press.

There are some issues in the OP's code

``# drawing the sequence n <- 100 x <- seq(-4,4,length.out=n) z <- seq(-4,4,length.out=n) d <- as.factor(0:1) library(data.table) # OP did not load the library data <- CJ(x=x,z=z,d=d) set.seed(100)  # setting up the model data[, y :=      # OP only simulate n random terms -- there are 20000 rows      sin(x+I(d==0)) + sin(x+4*I(d==1)) + I(d==0)*z^2 + 3*I(d==1)*z^2 + rnorm(n,0,1)]  # creating the uncentered B-Spline-Basis for x and z X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=FALSE)] # gets an error #R Error in spline(x, min(x), max(x), 5, 2, by = d, intercept = FALSE) : #R   unused arguments (by = d, intercept = FALSE) str(formals(spline)) # here are the formals for `stats::spline` #R Dotted pair list of 8 #R \$ x     : symbol #R \$ y     : NULL #R \$ n     : language 3 * length(x) #R \$ method: chr "fmm" #R \$ xmin  : language min(x) #R \$ xmax  : language max(x) #R \$ xout  : symbol #R \$ ties  : symbol mean ``

To

My question is: Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?

then I do not get how you would expect to get the same. You may have used different knots and I do not see how the `spline` function would yield the correct results here.

Thw dotted line corresponds to my fit, the straight line to the gam-version

If the latter is fitted with `lm` then it is un-penalized so the results should differ?

Rate this post