# Solved – Equi-probable sampling in R using the prob argument in sample()

If I have data of length `n` and want to generate a random sample of length `N`, does the following use of the `prob` argument make each observation equi-probable?

``random.sample = sample(mydata, N, replace=TRUE, prob=rep(1/n, times=n))   ``

As an example, the following density plot shows the density of a sample generated with (red) and without (black) the above `prob` argument: The overall shape of the curve and the position of peaks and troughs has not changed so how does `prob` influence the result?

Following from the above train of thought:

1. Is it even possible, statistically speaking, to make such an "equi-probable" sample? If yes, how?
2. More generally, how does the `prob` argument work and in which cases is it used in random sampling?
Contents

My answer is going to be based on Whuber's comment from above. Like Whuber said, by default, `sample` should be sampling with equal probability. However, if you specify it yourself using the `prob` option, the two methods do not return the same answer. However, the difference between the two is systematic. In fact, it turns out (if you set the random seed) the sample will be exactly the same minus one. That is, if you use the `prob` option then you should need to only subtract 1 from your samples to get back what sample would have returned had you not used the `prob` option. Here is some very short code illustrating that point.

``N = 100 n = length(50:90)  set.seed(1) random.sample1 = sample(50:90, N, replace=TRUE, prob=rep(1/n, times=n))  set.seed(1) random.sample2 = sample(50:90, N, replace=TRUE)  plot(density(random.sample1),col="blue") lines(density(random.sample2),col="red")  summary(random.sample1) summary(random.sample2)  random.sample1 random.sample2 ``

which yields

``> summary(random.sample1)    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    50.00   63.00   70.00   71.27   82.00   90.00  > summary(random.sample2)    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    50.00   62.75   69.50   70.68   81.00   90.00  >  > random.sample1    61 66 74 88 59 87 89 78 76 53 59 58 79 66 82 71 80 50 66 82 89 59 77 56 61   66 51 66 86 64 70 75 71 58 84 78 83 55 80 67 84 77 83 73 72 83 51 70 81 79   70 86 68 61 53 55 63 72 78 67 88 63 69 64 77 61 70 82 54 86 64 85 65 64 70   87 86 66 82 90 68 80 67 64 82 59 80 55 61 56 60 53 77 86 82 83 69 67 84 75 > random.sample2    60 65 73 87 58 86 88 77 75 52 58 57 78 65 81 70 79 90 65 81 88 58 76 55 60   65 50 65 85 63 69 74 70 57 83 77 82 54 79 66 83 76 82 72 71 82 50 69 80 78   69 85 67 60 52 54 62 71 77 66 87 62 68 63 76 60 69 81 53 85 63 84 64 63 69   86 85 65 81 89 67 79 66 63 81 58 79 54 60 55 59 52 76 85 81 82 68 66 83 74 >  ``

So as you can see, the two sample are the same with one being shifted by minus 1. Why this occurs I have not figured out by my guess is that by default the `sample` command assigns the equal probabilities differently then the user would. Also, a disclaimer, the above idea works in the case when the sample is an integer, however, when I ran the same code sampling from numbers with decimals, I could not get the above results to hold. Most likely it must have to do something with the comment: "The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero." # Update:

So after digging a bit deeper we see that the `sample` command actually relies on the command `sample.int`. However, withing `sample.int`, if you do not specify the `prob` option then the command calls `.Internal(sample2())` which I have not figured out how to see inside of. However, if someone know how to see what the function `sample2` is doing, then we will have our answer as to how they specify the `prob` option when not explicitly given.

``> sample function (x, size, replace = FALSE, prob = NULL)  {     if (length(x) == 1L && is.numeric(x) && x >= 1) {         if (missing(size))              size <- x         sample.int(x, size, replace, prob)     }     else {         if (missing(size))              size <- length(x)         x[sample.int(length(x), size, replace, prob)]     } } <bytecode: 0x000000000ff22210> <environment: namespace:base>  > sample.int function (n, size = n, replace = FALSE, prob = NULL)  {     if (!replace && is.null(prob) && n > 1e+07 && size <= n/2)          .Internal(sample2(n, size))     else .Internal(sample(n, size, replace, prob)) } <bytecode: 0x000000000ffa3478> <environment: namespace:base> ``

Rate this post