I am testing the R package changepoint
, and the results are strange.
Here is my code:
x = 1:100 x = x + rnorm(100, 0, 10) cpt1 <- cpt.mean(x, test.stat="Normal", method="PELT") x = 1:100/10 x = x + rnorm(100, 0, 1) cpt2 <- cpt.mean(x, test.stat="Normal", method="PELT") par(mfcol=c(2,1)) plot(cpt1) plot(cpt2)
I don't think this makes any sense, because first graph has smaller increasing means. I think there should be fewer change points for the means.
Best Answer
As JeromeLaurent said, I think the problem lies in the range of the data. Specifically, the errors in your first plot are an order of magnitude larger than the second plot. Therefore, the changepoint analysis can more easily mistake the larger random fluctuations/noise as a real change unless you specify the penalty appropriately. The default penalty for cpt.mean is "SIC". (I am uncertain why this penalty is not good in this case.)
In the changepoint article (http://www.jstatsoft.org/v58/i03/paper), Killick and Eckley state, "The choice of appropriate penalty is still an open question and typically depends on many factors including the size of the changes and the length of segments, both of which are unknown prior to analysis." They give useful papers such as Birge and Massart 2007 for current research on penalty choice.
Although Killick and Eckley recommend plotting the data and choosing penalty based on what visually appears reasonable, I found other answers: Penalty value in changepoint analysis and Change point analysis to be helpful as they show that you can use elbow-plots to pick an appropriate penalty. According to those answers, you should choose the penalty at the elbow of the plot. This is because at low penalties, you are getting spurious change points that drop off as you increase the penalty.
Here is code and the elbow plots to show that you should specify a much higher penalty for the data with higher random noise. Then your changepoint analysis will not give such strange answers.
x1 <- 1:100 x1 <- x1 + rnorm(100, 0, 10) cpt1 <- cpt.mean(x1, test.stat="Normal", method="PELT") x2 <- 1:100/10 x2 <- x2 + rnorm(100, 0, 1) cpt2 <- cpt.mean(x2, test.stat="Normal", method="PELT") cptfn <- function(data, pen) { ans <- cpt.mean(data, test.stat="Normal", method = "PELT", penalty = "Manual", pen.value = pen) length(cpts(ans)) +1 } elbowplot1 <- unlist(lapply(c(1:1000), function(p) cptfn(data = x1, pen = p))) elbowplot2 <- unlist(lapply(c(1:1000), function(p) cptfn(data = x2, pen = p))) par(mfrow=c(3,2)) plot(cpt1) title(main = "Default penalty for x1") plot(cpt2) title(main = "Default penalty for x2") plot(elbowplot1) title(main = "# changepoints for given penalty for x1") plot(elbowplot2) title(main = "# changepoints for given penalty for x2") cpt1penalty <- cpt.mean(x1, test.stat="Normal", method="PELT", penalty = "Manual", pen.value = 400) cpt2penalty <- cpt.mean(x2, test.stat="Normal", method="PELT", penalty = "Manual", pen.value = 10) plot(cpt1penalty) title(main = "Penalty = 400 for x1") plot(cpt2penalty) title(main = "Penalty = 10 for x2")
Just one more thought: it would seem like this type of data is better described by a line, rather than change points?