# Solved – How to program the deleted d-jack-knife

I have a dataset containing 45 observations. I want to sample times from this dataset, but with sample size equal to 35 each time. So each time I want to delete 10 datapoints from the original dataset. In total there are \${45}choose{10}\$ possible ways to delete d points from the sample. I want all these possibilities to be exploited during the resampling.

Can someone help me to program this? I must add that the code \${45}choose{10}\$ seems not to work in R because it exceeds the memory capacity.

Contents

It is rare in statistical analysis that such code is truly needed (and it is not needed for the Jackknife), but here (for the record) is a solution.

A reliable method to generate a non-duplicating collection of \$k\$-subsets of \$n\$ things is to associate a unique combination with each integer \$x\$ in the set \${0, 1, ldots, binom{n}{k}-1}\$. This can be done by walking upwards in Pascal's Triangle back to its apex, exploiting the relation \$binom{n}{k} = binom{n-1}{k-1} + binom{n-1}{k}\$, and picking elements at each row where you move left (from \$k\$ to \$k-1\$).

Start with the triple \$(x, n, k)\$ with \$0 le x lt binom{n}{k}\$ and \$0 le k le n\$. When \$x lt binom{n-1}{k-1}\$, select the first of \$n\$ elements, decrement \$x\$ to \$\$x^prime = x – binom{n-1}{k-1},\$\$ and proceed to select \$k-1\$ out of \$n-1\$ elements based on the triple \$(x^prime, n-1, k-1)\$. Otherwise do not select the first of \$n\$ elements, do not change \$x\$, and proceed to select \$k\$ out of \$n-1\$ elements based on the triple \$(x, n-1, k)\$. In any event, if \$n = k\$, then pick all \$k\$ elements.

The proof that this works depends on some easily established invariants:

1. At the outset, \$0 le x lt binom{n}{k}\$.

2. Exactly \$k\$ elements will be picked.

You can check that the following code satisfies these invariants.

``choose.int <- function(x, n, k) {   if(n <= k) return(rep(TRUE, k))   u <- choose(n-1, k-1)   pick <- x < u   if (pick) y <- choose.int(x, n-1, k-1) else y <- choose.int(x-u, n-1, k)   return(c(pick, y)) } ``

The proof that it generates all such subsets, without duplication, can be accomplished with an inductive argument on \$n\$.

Thus, to select (say) \$1000\$ unique \$10\$-subsets of \$45\$ things, obtain a random sample without replacement from the set \${0, 1, ldots, binom{45}{10}-1}\$. Using `choose.int`, convert each value \$x\$ in this set into a vector indicating which of the \$45\$ elements to select:

``n <- 45; k <- 10 sample <- sapply(sample.int(choose(n, k), 1000)-1, choose.int, n=n, k=k) ``

Timing indicates about \$10,000\$ such subsets can be generated per second (because `R` is not efficient with recursive functions).

Because this is a largish array, let's illustrate with smaller values. How about picking \$6\$ of the \$10 = binom{5}{3}\$ three-subsets of five things:

``n <- 5; k <- 3 sapply(sample.int(choose(n, k), 6)-1, choose.int, n=n, k=k) ``

The output will vary with the random seed, but in one case it was

``      [,1]  [,2]  [,3]  [,4]  [,5]  [,6] [1,]  TRUE  TRUE FALSE  TRUE  TRUE FALSE [2,]  TRUE FALSE  TRUE  TRUE FALSE  TRUE [3,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE [4,]  TRUE  TRUE FALSE FALSE FALSE  TRUE [5,] FALSE  TRUE  TRUE FALSE  TRUE FALSE ``

Each column indicates which elements to include in the sample. They all have \$3\$ `TRUE` values for the elements picked. No two of the columns are the same.

Rate this post