In a comment to this question, user @whuber cited the possibility of using a periodic version of splines to fit periodic data. I would like to know more about this method, in particular the equations defining the splines, and how to implement them in practice (I'm mostly an R
user, but I can make do with MATLAB or Python, if the need arises). Also, but this is a "nice to have", it would be great to know about possible advantages/disadvantages with respect to trigonometric polynomials fitting, which is how I usually deal with these kind of data (unless the response is not very smooth, in which case I switch to Gaussian Process with periodic kernel).
Best Answer
Splines are used in regression modeling to model possibly complex, non-linear functional forms. A spline smoothed trend consists of piecewise continuous polynomials whose leading coefficient changes at each breakpoint or knot. The spline may be specified in terms of the polynomial degree of the trend as well as the breakpoints. A spline representation of a covariate extends a single vector of observed values into a matrix whose dimension is the polynomial degree plus the number of knots.
A periodic version of splines is merely a periodic version of any regression: the data are cut into replicates of the length of the period. So for instance, modeling a diurnal trend in a multiday experiment on rats would require recoding time of experiment into 24 hour increments, so the 154th hour would be the modulo 24 value of 10 (154 = 6*24 + 10). If you fit a linear regression on the cut data, it would estimate a saw-tooth waveform for the trend. If you fit a step function somewhere in the period, it would be a square waveform that fits the series. The spline is capable of expressing a much more sophisticated wavelet. For what it's worth, in the splines
package, there is a function periodicSpline
which does exactly this.
I don't find R's default spline "bs" implementation useful for interpretation. So I wrote my own script below. For a spline of degree $p$ with $n_k$ knots, this representation gives the first $p$ columns the standard polynomial representation, the $p+i$-th columns ($i le n_k$) are simply evaluated as $S_{p+i} = (X – k_i)^pmathcal{I}(X<k_i)$ where $k$ is the actual vector of knots.
myspline <- function(x, degree, knots) { knots <- sort(knots) val <- cbind(x, outer(x, knots, `-`)) val[val < 0] <- 0 val <- val^degree if(degree > 1) val <- cbind(outer(x, 1:{degree-1}, `^`), val) colnames(val) <- c( paste0('spline', 1:{degree-1}, '.1'), paste0('spline', degree, '.', seq(length(knots)+1)) ) val }
For a little case study, interpolate a sinusoidal trend on the domain of 0 to $2pi$ (or $tau$) like so:
x <- seq(0, 2*pi, by=pi/2^8) y <- sin(x) plot(x,y, type='l') s <- myspline(x, 2, pi) fit <- lm(y ~ s) yhat <- predict(fit) lines(x,yhat)
You'll see they're quite concordant. Further, the naming convention enables interpretation. In the regression output you see:
> summary(fit) Call: lm(formula = y ~ s) Residuals: Min 1Q Median 3Q Max -0.04564 -0.02050 0.00000 0.02050 0.04564 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.033116 0.003978 -8.326 7.78e-16 *** sspline1.1 1.268812 0.004456 284.721 < 2e-16 *** sspline2.1 -0.400520 0.001031 -388.463 < 2e-16 *** sspline2.2 0.801040 0.001931 414.878 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.02422 on 509 degrees of freedom Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988 F-statistic: 1.453e+05 on 3 and 509 DF, p-value: < 2.2e-16
The first set of covariates for my spline1.1-degree is the polynomial trend for the first domain behind the first breakpoint. The linear term is the slope of the tangent at the origin, X=0. This is nearly 1 which would be indicated by the derivative of the sinusoidal curve (cos(0) = 1), but we must bear in mind that these are approximations, and the error of extrapolating the quadratic trend out $pi/2$ is prone to error. The quadratic term indicates a negative, concave shape. The spline2.2 term indicates a difference from the first quadratic slope, leading to a 0.4 positive leading coefficient indicating an upward, convex shape. So we now have interpretation available for spline output and can judge the inference and estimates accordingly.
I'm going to assume that you know the periodicity of the data at hand. If the data lack a growth or moving average component, you may transform a long time series into replicates of a short series of a duration of 1 period. You now have replicates and can use data analysis to estimate the recurrent trend.
Suppose I generate the following somewhat noisey, very long time series:
x <- seq(1, 100, by=0.01) y <- sin(x) + rnorm(length(x), 0, 10) xp <- x %% (2*pi) s <- myspline(xp, degree=2, knots=pi) lm(y ~ s)
The resulting output shows reasonable performance.
> summary(fit) Call: lm(formula = y ~ s) Residuals: Min 1Q Median 3Q Max -39.585 -6.736 0.013 6.750 37.389 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.48266 0.38155 -1.265 0.205894 sspline1.1 1.52798 0.42237 3.618 0.000299 *** sspline2.1 -0.44380 0.09725 -4.564 5.09e-06 *** sspline2.2 0.76553 0.18198 4.207 2.61e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.949 on 9897 degrees of freedom Multiple R-squared: 0.006406, Adjusted R-squared: 0.006105 F-statistic: 21.27 on 3 and 9897 DF, p-value: 9.959e-14