I'm trying to implement Pettitt test in R following papers like this pdf (pp. 5 & 6), or this pdf. But, I'm misunderstanding something, because having tested it with some data, I think that output is not correct.
Here is the code:
pettitt <- function(x, alpha=0.99) { # Pettitt AN. 1979 A non-parametric approach to the change point detection. # x is a vector # alpha, integer, level of significance x <- na.omit(x) o <- rank(x) s <- c() L <- length(x) for (i in 1:(L-1)) { s <- c(s, 2*(colSums(as.matrix(o[1:i]))) - (i*(L+1)) ) } vc <- sqrt((-1) * log(alpha) * (L^3 + L^2)/6) output <- list(abs(s), vc) return(output) }
Testing with larain
and tempdub
dataset from TSA package
:
library(TSA) data(larain) data(tempdub) pettitt(larain) [[1]] [1] 78 118 180 76 30 30 144 90 124 148 224 334 314 298 362 444 356 334 [19] 300 302 194 121 83 55 45 57 25 95 175 195 193 287 181 231 175 213 [37] 301 331 421 345 392 322 282 354 372 274 194 130 188 248 175 97 85 153 [55] 105 171 181 189 245 297 401 375 449 557 467 551 594 576 602 490 406 354 [73] 262 266 362 248 244 214 208 200 247 147 89 13 9 15 97 5 9 83 [91] 3 95 123 63 31 12 44 6 48 34 72 108 208 164 170 282 214 148 [109] 202 140 104 6 102 86 [[2]] [1] 50.69224 > max(pettitt(larain)[[1]]) [1] 602 pettitt(tempdub) [[1]] [1] 83 161 226 235 164 60 80 169 220 219 188 74 57 177 266 281 228 147 [19] 19 82 125 140 102 41 100 197 235 254 233 141 1 97 144 153 112 26 [37] 73 206 255 258 235 137 28 49 98 101 46 29 149 252 281 274 247 160 [55] 43 70 115 126 79 22 157 248 317 328 287 224 96 27 86 79 27 82 [73] 225 348 407 406 351 256 125 10 58 77 32 61 200 314 381 386 353 216 [91] 124 40 35 70 35 36 173 302 365 386 321 242 131 10 51 38 19 146 [109] 241 319 342 359 330 223 89 45 113 144 111 2 123 228 280 275 250 177 [127] 34 50 89 102 59 22 131 248 334 359 302 198 73 46 83 100 73 [[2]] [1] 70.96777 > max(pettitt(tempdub)[[1]]) [1] 407
I don't know if I lost something in pettitt test or there are error in my code.
Best Answer
After read carefully Pettitt paper, I can understand the method and check my code.
Pettitt starts with a general method expresed:
$$U_{t,T} = sum_{i=1}^{t} sum_{j=t+1}^{T} D_{ij}$$
where $D_{ij} = sign(X_i – X_j)$
Then Pettitt follows expressing a test for discrete Bernoulli and Binomial data and two variants for continuos data.
For discrete test, he use $U_{t,T} = T cdot (S_t – frac{tS_t}{T})$, where $S_t = sum_{1}^{T} X_j$, $S_T = sum_{1}^{T} X_j$ and $X_j$ is eval function expresed like Bernoulli serie where values are 0 or 1.
For continuos test, he use:
- $U_{t,T} = 2 cdot W_t – t(T+1)$, when $W_t=sum R_j$ and $R_j$ are the rank or the data. (There is a variation when ties exists.)
- $U_{t,T} = U_{t-1,T} + V_{t,T}$, with $t=2,dots,T$ and $V_{t,T} = sum_{j=1}^{T} sign(X_t – X_j)$.
Then, my original code are variant 1 for continuos test and results are correct, except for guessing critical values in Pettitt series results.
I calculate these critical values solving $K_T$ from $P_{OA} = 2e^{(frac{-6{K_{T}}^2}{T^3+T^2})}$, but although $P_{OA}$ is solved correctly, the critical value is not. Why? Actually, I don't know and I can't reach others papers where shows how to calculate table expresed in, for example, Sahin et al paper (link in previous comments).
So that, I change my function code in order to include all variants and $P_{OA}$ calcul. Here are:
<!-- language: lang-R --> pettitt<-function(x,alpha=0.01,method=c("discrete","continuos"), alternative=c("rank","variation")) { # Pettitt AN. 1979 A non-parametric approach to the change point detection. # (Sección 2.3) # # x is a numeric vector # alpha is a integer x<-na.omit(x) orden<-rank(x) T<-length(x) method<-match.arg(method) alternative<-match.arg(alternative) # U.t.T<-c() V.i.T<-c() P.o.a<-NULL P.o.a.p<-NULL P.o.a.n<-NULL k.T<-NULL k.T.p<-NULL k.T.n<-NULL # if (!is.numeric(x)) stop("'x' must be a numeric vector) # # Discrete values if (method == "discrete") { x.j<-sign(x) x.j[which(x.j==-1)]<-0 S.T<-sum(x.j) for (i in 1:T) { S.t<-sum(x.j[1:i]) U.t.T<-c(U.t.T, T*(S.t-(i*S.T/T))) } k.T<-max(abs(U.t.T)) k.T.p<-max(U.t.T) k.T.n<-min(U.t.T) P.o.a<-exp((-2*k.T.p^2)/(S.T*(T^2-T*S.T))) critical<-sqrt((log(alpha)*(S.T*(T^2-T*S.T)))/-2) } # # Continuos value. if (method == "continuos" & alternative == "rank") { TIES<-length(unique(x)) < T if (!TIES) { for (i in 1:T) { U.t.T<-c(U.t.T, 2*(colSums(as.matrix(orden[1:i]))) -(i*(T+1)) ) } } else { frequency<-as.vector(table(x)) total.frequency<-sum(frequency) for (i in 1:length(frequency)) { U.t.T<-c(U.t.T, 1-(total.frequency*(frequency[i]^2-1))/(T*(T^2-1)) ) } } k.T<-max(abs(U.t.T)) P.o.a<-2*exp((-6*k.T^2)/(T^3+T^2)) critical<-sqrt((log(alpha/2)*(T^3+T^2))/-6) } if (method == "continuos" & alternativa == "variation") { V.i.T<-matrix(rep(NA,T^2),ncol=T) for (i in 1:T) { for (j in 1:T) { V.i.T[j,i]<-sign(x[i]-x[j]) } if (i==1) { U.t.T<-sum(V.i.T[,i],na.rm=T) } else { U.t.T<-c(U.t.T, U.t.T[(i-1)]+sum(V.i.T[,i],na.rm=T) ) } } V.i.T<-colSums(V.i.T,na.rm=T) k.T.p<-max(U.t.T) k.T.n<-min(U.t.T) k.T<-max(abs(U.t.T)) P.o.a.p<-exp((-6*k.T.p^2)/(T^3+T^2)) P.o.a.n<-exp((-6*k.T.n^2)/(T^3+T^2)) P.o.a<-2*exp((-6*k.T^2)/(T^3+T^2)) critical<-sqrt((log(alpha/2)*(T^3+T^2))/-6) } output<-list(U.t.T,V.i.T,P.o.a,P.o.a.p,P.o.a.n,k.T,k.T.p,k.T.n,critical) return(output) }
If anyone knows how calcul critical values are welcome.
So many thanks.