Solved – Equi-probable sampling in R using the prob argument in sample()

If I have data of length n and want to generate a random sample of length N, does the following use of the prob argument make each observation equi-probable?

random.sample = sample(mydata, N, replace=TRUE, prob=rep(1/n, times=n))   

As an example, the following density plot shows the density of a sample generated with (red) and without (black) the above prob argument:

enter image description here

The overall shape of the curve and the position of peaks and troughs has not changed so how does prob influence the result?

Following from the above train of thought:

  1. Is it even possible, statistically speaking, to make such an "equi-probable" sample? If yes, how?
  2. More generally, how does the prob argument work and in which cases is it used in random sampling?

My answer is going to be based on Whuber's comment from above. Like Whuber said, by default, sample should be sampling with equal probability. However, if you specify it yourself using the prob option, the two methods do not return the same answer. However, the difference between the two is systematic. In fact, it turns out (if you set the random seed) the sample will be exactly the same minus one. That is, if you use the prob option then you should need to only subtract 1 from your samples to get back what sample would have returned had you not used the prob option. Here is some very short code illustrating that point.

N = 100 n = length(50:90)  set.seed(1) random.sample1 = sample(50:90, N, replace=TRUE, prob=rep(1/n, times=n))  set.seed(1) random.sample2 = sample(50:90, N, replace=TRUE)  plot(density(random.sample1),col="blue") lines(density(random.sample2),col="red")  summary(random.sample1) summary(random.sample2)  random.sample1 random.sample2 

which yields

> summary(random.sample1)    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    50.00   63.00   70.00   71.27   82.00   90.00  > summary(random.sample2)    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    50.00   62.75   69.50   70.68   81.00   90.00  >  > random.sample1   [1] 61 66 74 88 59 87 89 78 76 53 59 58 79 66 82 71 80 50 66 82 89 59 77 56 61  [26] 66 51 66 86 64 70 75 71 58 84 78 83 55 80 67 84 77 83 73 72 83 51 70 81 79  [51] 70 86 68 61 53 55 63 72 78 67 88 63 69 64 77 61 70 82 54 86 64 85 65 64 70  [76] 87 86 66 82 90 68 80 67 64 82 59 80 55 61 56 60 53 77 86 82 83 69 67 84 75 > random.sample2   [1] 60 65 73 87 58 86 88 77 75 52 58 57 78 65 81 70 79 90 65 81 88 58 76 55 60  [26] 65 50 65 85 63 69 74 70 57 83 77 82 54 79 66 83 76 82 72 71 82 50 69 80 78  [51] 69 85 67 60 52 54 62 71 77 66 87 62 68 63 76 60 69 81 53 85 63 84 64 63 69  [76] 86 85 65 81 89 67 79 66 63 81 58 79 54 60 55 59 52 76 85 81 82 68 66 83 74 >  

So as you can see, the two sample are the same with one being shifted by minus 1. Why this occurs I have not figured out by my guess is that by default the sample command assigns the equal probabilities differently then the user would. Also, a disclaimer, the above idea works in the case when the sample is an integer, however, when I ran the same code sampling from numbers with decimals, I could not get the above results to hold. Most likely it must have to do something with the comment: "The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero."

enter image description here

Update:

So after digging a bit deeper we see that the sample command actually relies on the command sample.int. However, withing sample.int, if you do not specify the prob option then the command calls .Internal(sample2()) which I have not figured out how to see inside of. However, if someone know how to see what the function sample2 is doing, then we will have our answer as to how they specify the prob option when not explicitly given.

> sample function (x, size, replace = FALSE, prob = NULL)  {     if (length(x) == 1L && is.numeric(x) && x >= 1) {         if (missing(size))              size <- x         sample.int(x, size, replace, prob)     }     else {         if (missing(size))              size <- length(x)         x[sample.int(length(x), size, replace, prob)]     } } <bytecode: 0x000000000ff22210> <environment: namespace:base>  > sample.int function (n, size = n, replace = FALSE, prob = NULL)  {     if (!replace && is.null(prob) && n > 1e+07 && size <= n/2)          .Internal(sample2(n, size))     else .Internal(sample(n, size, replace, prob)) } <bytecode: 0x000000000ffa3478> <environment: namespace:base> 

Similar Posts:

Rate this post

Leave a Comment