If I have data of length n
and want to generate a random sample of length N
, does the following use of the prob
argument make each observation equi-probable?
random.sample = sample(mydata, N, replace=TRUE, prob=rep(1/n, times=n))
As an example, the following density plot shows the density of a sample generated with (red) and without (black) the above prob
argument:
The overall shape of the curve and the position of peaks and troughs has not changed so how does prob
influence the result?
Following from the above train of thought:
- Is it even possible, statistically speaking, to make such an "equi-probable" sample? If yes, how?
- More generally, how does the
prob
argument work and in which cases is it used in random sampling?
Best Answer
My answer is going to be based on Whuber's comment from above. Like Whuber said, by default, sample
should be sampling with equal probability. However, if you specify it yourself using the prob
option, the two methods do not return the same answer. However, the difference between the two is systematic. In fact, it turns out (if you set the random seed) the sample will be exactly the same minus one. That is, if you use the prob
option then you should need to only subtract 1 from your samples to get back what sample would have returned had you not used the prob
option. Here is some very short code illustrating that point.
N = 100 n = length(50:90) set.seed(1) random.sample1 = sample(50:90, N, replace=TRUE, prob=rep(1/n, times=n)) set.seed(1) random.sample2 = sample(50:90, N, replace=TRUE) plot(density(random.sample1),col="blue") lines(density(random.sample2),col="red") summary(random.sample1) summary(random.sample2) random.sample1 random.sample2
which yields
> summary(random.sample1) Min. 1st Qu. Median Mean 3rd Qu. Max. 50.00 63.00 70.00 71.27 82.00 90.00 > summary(random.sample2) Min. 1st Qu. Median Mean 3rd Qu. Max. 50.00 62.75 69.50 70.68 81.00 90.00 > > random.sample1 [1] 61 66 74 88 59 87 89 78 76 53 59 58 79 66 82 71 80 50 66 82 89 59 77 56 61 [26] 66 51 66 86 64 70 75 71 58 84 78 83 55 80 67 84 77 83 73 72 83 51 70 81 79 [51] 70 86 68 61 53 55 63 72 78 67 88 63 69 64 77 61 70 82 54 86 64 85 65 64 70 [76] 87 86 66 82 90 68 80 67 64 82 59 80 55 61 56 60 53 77 86 82 83 69 67 84 75 > random.sample2 [1] 60 65 73 87 58 86 88 77 75 52 58 57 78 65 81 70 79 90 65 81 88 58 76 55 60 [26] 65 50 65 85 63 69 74 70 57 83 77 82 54 79 66 83 76 82 72 71 82 50 69 80 78 [51] 69 85 67 60 52 54 62 71 77 66 87 62 68 63 76 60 69 81 53 85 63 84 64 63 69 [76] 86 85 65 81 89 67 79 66 63 81 58 79 54 60 55 59 52 76 85 81 82 68 66 83 74 >
So as you can see, the two sample are the same with one being shifted by minus 1. Why this occurs I have not figured out by my guess is that by default the sample
command assigns the equal probabilities differently then the user would. Also, a disclaimer, the above idea works in the case when the sample is an integer, however, when I ran the same code sampling from numbers with decimals, I could not get the above results to hold. Most likely it must have to do something with the comment: "The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero."
Update:
So after digging a bit deeper we see that the sample
command actually relies on the command sample.int
. However, withing sample.int
, if you do not specify the prob
option then the command calls .Internal(sample2())
which I have not figured out how to see inside of. However, if someone know how to see what the function sample2
is doing, then we will have our answer as to how they specify the prob
option when not explicitly given.
> sample function (x, size, replace = FALSE, prob = NULL) { if (length(x) == 1L && is.numeric(x) && x >= 1) { if (missing(size)) size <- x sample.int(x, size, replace, prob) } else { if (missing(size)) size <- length(x) x[sample.int(length(x), size, replace, prob)] } } <bytecode: 0x000000000ff22210> <environment: namespace:base> > sample.int function (n, size = n, replace = FALSE, prob = NULL) { if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n, size)) else .Internal(sample(n, size, replace, prob)) } <bytecode: 0x000000000ffa3478> <environment: namespace:base>