I have a dataset containing 45 observations. I want to sample times from this dataset, but with sample size equal to 35 each time. So each time I want to delete 10 datapoints from the original dataset. In total there are ${45}choose{10}$ possible ways to delete d points from the sample. I want all these possibilities to be exploited during the resampling.
Can someone help me to program this? I must add that the code ${45}choose{10}$ seems not to work in R because it exceeds the memory capacity.
Many thanks in advance!
Best Answer
It is rare in statistical analysis that such code is truly needed (and it is not needed for the Jackknife), but here (for the record) is a solution.
A reliable method to generate a non-duplicating collection of $k$-subsets of $n$ things is to associate a unique combination with each integer $x$ in the set ${0, 1, ldots, binom{n}{k}-1}$. This can be done by walking upwards in Pascal's Triangle back to its apex, exploiting the relation $binom{n}{k} = binom{n-1}{k-1} + binom{n-1}{k}$, and picking elements at each row where you move left (from $k$ to $k-1$).
Start with the triple $(x, n, k)$ with $0 le x lt binom{n}{k}$ and $0 le k le n$. When $x lt binom{n-1}{k-1}$, select the first of $n$ elements, decrement $x$ to $$x^prime = x – binom{n-1}{k-1},$$ and proceed to select $k-1$ out of $n-1$ elements based on the triple $(x^prime, n-1, k-1)$. Otherwise do not select the first of $n$ elements, do not change $x$, and proceed to select $k$ out of $n-1$ elements based on the triple $(x, n-1, k)$. In any event, if $n = k$, then pick all $k$ elements.
The proof that this works depends on some easily established invariants:
At the outset, $0 le x lt binom{n}{k}$.
Exactly $k$ elements will be picked.
You can check that the following code satisfies these invariants.
choose.int <- function(x, n, k) { if(n <= k) return(rep(TRUE, k)) u <- choose(n-1, k-1) pick <- x < u if (pick) y <- choose.int(x, n-1, k-1) else y <- choose.int(x-u, n-1, k) return(c(pick, y)) }
The proof that it generates all such subsets, without duplication, can be accomplished with an inductive argument on $n$.
Thus, to select (say) $1000$ unique $10$-subsets of $45$ things, obtain a random sample without replacement from the set ${0, 1, ldots, binom{45}{10}-1}$. Using choose.int
, convert each value $x$ in this set into a vector indicating which of the $45$ elements to select:
n <- 45; k <- 10 sample <- sapply(sample.int(choose(n, k), 1000)-1, choose.int, n=n, k=k)
Timing indicates about $10,000$ such subsets can be generated per second (because R
is not efficient with recursive functions).
Because this is a largish array, let's illustrate with smaller values. How about picking $6$ of the $10 = binom{5}{3}$ three-subsets of five things:
n <- 5; k <- 3 sapply(sample.int(choose(n, k), 6)-1, choose.int, n=n, k=k)
The output will vary with the random seed, but in one case it was
[,1] [,2] [,3] [,4] [,5] [,6] [1,] TRUE TRUE FALSE TRUE TRUE FALSE [2,] TRUE FALSE TRUE TRUE FALSE TRUE [3,] FALSE FALSE TRUE TRUE TRUE TRUE [4,] TRUE TRUE FALSE FALSE FALSE TRUE [5,] FALSE TRUE TRUE FALSE TRUE FALSE
Each column indicates which elements to include in the sample. They all have $3$ TRUE
values for the elements picked. No two of the columns are the same.
Similar Posts:
- Solved – Probability of selecting exactly 2 members of a group of 7, out of 35 people, if 3 people are picked
- Solved – Overlapping sets (3) – significance test
- Solved – How the hypergeometric distribution sums to 1
- Solved – How the hypergeometric distribution sums to 1
- Solved – Rao-Blackwell unbiased estimator binomial distribution