The data-generating-process is: $y = text{sin}Big(x+I(d=0)Big) + text{sin}Big(x+4*I(d=1)Big) + I(d=0)z^2 + 3I(d=1)z^2 + mathbb{N}left(0,1right)$
Let $x,z$ be a sequence from $-4$ to $4$ of length $100$ and $d$ to be the corresponding factor $din{0,1}$. Take all possible combinations of $x,z,d$ to calculate $y$:
Using the (uncentered) B-spline-Basis for $x,z$ for each level of $d$ will not be feasible by the parition-of-unity-property (rows sum to 1). Such a model will not be identifiable (even without intercept).
Example: (Setting: 5 inner knot-intervals (uniformly distributed), B-Spline of degree 2, the spline
-function is a custom one)
# drawing the sequence n <- 100 x <- seq(-4,4,length.out=n) z <- seq(-4,4,length.out=n) d <- as.factor(0:1) data <- CJ(x=x,z=z,d=d) set.seed(100) # setting up the model data[,y := sin(x+I(d==0)) + sin(x+4*I(d==1)) + I(d==0)*z^2 + 3*I(d==1)*z^2 + rnorm(n,0,1)] # creating the uncentered B-Spline-Basis for x and z X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=FALSE)] > head(X) x.1d0 x.2d0 x.3d0 x.4d0 x.5d0 x.6d0 x.7d0 x.1d1 x.2d1 x.3d1 x.4d1 x.5d1 x.6d1 x.7d1 [1,] 0.5 0.5 0 0 0 0 0 0.0 0.0 0 0 0 0 0 [2,] 0.0 0.0 0 0 0 0 0 0.5 0.5 0 0 0 0 0 [3,] 0.5 0.5 0 0 0 0 0 0.0 0.0 0 0 0 0 0 Z <- data[,spline(z,min(z),max(z),5,2,by=d)] head(Z) z.1d0 z.2d0 z.3d0 z.4d0 z.5d0 z.6d0 z.7d0 z.1d1 z.2d1 z.3d1 z.4d1 z.5d1 z.6d1 [1,] 0.5000000 0.5000000 0.00000000 0 0 0 0 0.0000000 0.0000000 0.00000000 0 0 0 [2,] 0.0000000 0.0000000 0.00000000 0 0 0 0 0.5000000 0.5000000 0.00000000 0 0 0 [3,] 0.4507703 0.5479543 0.00127538 0 0 0 0 0.0000000 0.0000000 0.00000000 0 0 0 z.7d1 [1,] 0 [2,] 0 [3,] 0 # lm will drop one spline-column for each factor lm(y ~ -1+X+Z,data=data) Call: lm(formula = y ~ -1 + X + Z, data = data) Coefficients: Xx.1d0 Xx.2d0 Xx.3d0 Xx.4d0 Xx.5d0 Xx.6d0 Xx.7d0 Xx.1d1 Xx.2d1 Xx.3d1 Xx.4d1 Xx.5d1 23.510 19.912 18.860 22.177 23.080 19.794 18.727 68.572 69.185 67.693 67.082 68.642 Xx.6d1 Xx.7d1 Zz.1d0 Zz.2d0 Zz.3d0 Zz.4d0 Zz.5d0 Zz.6d0 Zz.7d0 Zz.1d1 Zz.2d1 Zz.3d1 69.159 67.496 1.381 -11.872 -19.361 -21.835 -19.698 -11.244 NA -1.329 -38.449 -62.254 Zz.4d1 Zz.5d1 Zz.6d1 Zz.7d1 -69.993 -61.438 -39.754 NA
To overcome this problem, Wood, Generalized Additive Models: An Introduction with R, page 163-164 proposes the sum (or mean) centering constraint:
$boldsymbol{1}^Tboldsymbol{tilde{X}_j}boldsymbol{tilde{beta}_j}=0$
This can be done by reparametrization if a matrix $boldsymbol{Z}$ is found such that
$boldsymbol{1}^Tboldsymbol{tilde{X}_j}boldsymbol{Z}=0$
$boldsymbol{Z}$-matrix can be found by the QR-decomposition of the constraint matrix $boldsymbol{C}^T = (boldsymbol{boldsymbol{1}^Tboldsymbol{tilde{X}_j}})^T = boldsymbol{tilde{X}_j}^Tboldsymbol{1}$.
Note that $boldsymbol{tilde{X}_j}^Tboldsymbol{1}$ is $boldsymbol{1}$ by the partition of unity-property.
The centered/constrained-version of my B-Spline-Matrix is:
X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=TRUE)] head(X) x.1d0 x.2d0 x.3d0 x.4d0 x.5d0 x.6d0 x.1d1 x.2d1 x.3d1 x.4d1 [1,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000 0.0000000 0.0000000 0.0000000 [2,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.2271923 -0.3225655 -0.3225655 -0.3225655 [3,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000 0.0000000 0.0000000 0.0000000 x.5d1 x.6d1 [1,] 0.0000000 0.00000000 [2,] -0.2728077 -0.05790256 [3,] 0.0000000 0.00000000 Z <- data[,spline(z,min(z),max(z),5,2,by=d,intercept=TRUE)] head(Z) z.1d0 z.2d0 z.3d0 z.4d0 z.5d0 z.6d0 z.1d1 z.2d1 z.3d1 z.4d1 [1,] 0.2271923 -0.3225655 -0.3225655 -0.3225655 -0.2728077 -0.05790256 0.0000000 0.0000000 0.0000000 0.0000000 [2,] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.2271923 -0.3225655 -0.3225655 -0.3225655 [3,] 0.2875283 -0.3066501 -0.3079255 -0.3079255 -0.2604260 -0.05527458 0.0000000 0.0000000 0.0000000 0.0000000 z.5d1 z.6d1 [1,] 0.0000000 0.00000000 [2,] -0.2728077 -0.05790256 [3,] 0.0000000 0.00000000
My question is : Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?
# comparing with gam from mgcv mod.gam <- gam(y~d+s(x,bs="ps",by=d,k=7)+s(z,bs="ps",by=d,k=7),data=data) X.gam <- model.matrix(mod.gam) head(X.gam) (Intercept) d1 s(x):d0.1 s(x):d0.2 s(x):d0.3 s(x):d0.4 s(x):d0.5 s(x):d0.6 s(x):d1.1 s(x):d1.2 1 1 0 0.5465301 -0.05732768 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000 0.00000000 2 1 1 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000 0.00000000 0.5465301 -0.05732768 3 1 0 0.5465301 -0.05732768 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000 0.00000000 s(x):d1.3 s(x):d1.4 s(x):d1.5 s(x):d1.6 s(z):d0.1 s(z):d0.2 s(z):d0.3 s(z):d0.4 s(z):d0.5 1 0.0000000 0.0000000 0.0000000 0.00000000 0.5465301 -0.057327680 -0.2351708 -0.2259983 -0.1201207 2 -0.2351708 -0.2259983 -0.1201207 -0.01043987 0.0000000 0.000000000 0.0000000 0.0000000 0.0000000 3 0.0000000 0.0000000 0.0000000 0.00000000 0.5471108 -0.031559945 -0.2302910 -0.2213227 -0.1176356 s(z):d0.6 s(z):d1.1 s(z):d1.2 s(z):d1.3 s(z):d1.4 s(z):d1.5 s(z):d1.6 1 -0.01043987 0.0000000 0.000000000 0.0000000 0.0000000 0.0000000 0.00000000 2 0.00000000 0.5465301 -0.057327680 -0.2351708 -0.2259983 -0.1201207 -0.01043987 3 -0.01022388 0.0000000 0.000000000 0.0000000 0.0000000 0.0000000 0.00000000
Thw dotted line corresponds to my fit, the straight line to the gam-version
Best Answer
Here is a simpler example using the link from Nemo. The question I answer is
How exactly is the sum (or mean) centering constraint for splines (also w.r.t. gam from mgcv) done?
I answer this as this is the title and as
My question is: Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?
is rather unclear for reason I provide in the end. Here is the answer to the above question
# simulate data library(splines) set.seed(100) n <- 1000 x <- seq(-4,4,length.out=n) df <- expand.grid(d = factor(c(0, 1)), x = x) df <- cbind(y = sin(x) + rnorm(length(df),0,1), df) x <- df$x # we start the other way and find the knots `mgcv` uses to make sure we have # the same knots... library(mgcv) mod_gam <- gam(y ~ s(x, bs="ps", k = 7), data = df) knots <- mod_gam$smooth[[1]]$knots # find constrained basis as OP describes X <- splineDesign(knots = knots, x) C <- rep(1, nrow(X)) %*% X qrc <- qr(t(C)) Z <- qr.Q(qrc,complete=TRUE)[,(nrow(C)+1):ncol(C)] XZ <- X%*%Z rep(1, nrow(X)) %*% XZ # all ~ zero as they should #R [,1] [,2] [,3] [,4] [,5] [,6] #R [1,] 2.239042e-13 -2.112754e-13 -3.225198e-13 -6.993017e-14 -2.011724e-13 -3.674838e-14 # now we get roughtly the same basis all.equal(model.matrix(mod_gam)[, -1], XZ, check.attributes = FALSE) #R [1] TRUE # if you want to use a binary by value mod_gam <- gam(y ~ s(x, bs="ps", k = 7, by = d), data = df) all.equal( model.matrix(mod_gam)[, -1], cbind(XZ * (df$d == 0), XZ * (df$d == 1)), check.attributes = FALSE) #R [1] TRUE
You can do better in terms of computation speed than explicitly computing
Z <- qr.Q(qrc,complete=TRUE)[,(nrow(C)+1):ncol(C)] XZ <- X%*%Z
as described on page 211 of
Wood, Simon N.. Generalized Additive Models: An Introduction with R, Second Edition (Chapman & Hall/CRC Texts in Statistical Science). CRC Press.
There are some issues in the OP's code
# drawing the sequence n <- 100 x <- seq(-4,4,length.out=n) z <- seq(-4,4,length.out=n) d <- as.factor(0:1) library(data.table) # OP did not load the library data <- CJ(x=x,z=z,d=d) set.seed(100) # setting up the model data[, y := # OP only simulate n random terms -- there are 20000 rows sin(x+I(d==0)) + sin(x+4*I(d==1)) + I(d==0)*z^2 + 3*I(d==1)*z^2 + rnorm(n,0,1)] # creating the uncentered B-Spline-Basis for x and z X <- data[,spline(x,min(x),max(x),5,2,by=d,intercept=FALSE)] # gets an error #R Error in spline(x, min(x), max(x), 5, 2, by = d, intercept = FALSE) : #R unused arguments (by = d, intercept = FALSE) str(formals(spline)) # here are the formals for `stats::spline` #R Dotted pair list of 8 #R $ x : symbol #R $ y : NULL #R $ n : language 3 * length(x) #R $ method: chr "fmm" #R $ xmin : language min(x) #R $ xmax : language max(x) #R $ xout : symbol #R $ ties : symbol mean
To
My question is: Even though the fit is very similar, why do my constrained B-Spline-columns differ from what gam provides? What did I miss?
then I do not get how you would expect to get the same. You may have used different knots and I do not see how the spline
function would yield the correct results here.
Thw dotted line corresponds to my fit, the straight line to the gam-version
If the latter is fitted with lm
then it is un-penalized so the results should differ?