If I have data of length `n`

and want to generate a random sample of length `N`

, does the following use of the `prob`

argument make each observation equi-probable?

`random.sample = sample(mydata, N, replace=TRUE, prob=rep(1/n, times=n)) `

As an example, the following density plot shows the density of a sample generated with (red) and without (black) the above `prob`

argument:

The overall shape of the curve and the position of peaks and troughs has not changed so how does `prob`

influence the result?

Following from the above train of thought:

- Is it even possible, statistically speaking, to make such an "equi-probable" sample? If yes, how?
- More generally, how does the
`prob`

argument work and in which cases is it used in random sampling?

**Contents**hide

#### Best Answer

My answer is going to be based on Whuber's comment from above. Like Whuber said, by default, `sample`

should be sampling with equal probability. However, if you specify it yourself using the `prob`

option, the two methods do not return the same answer. However, the difference between the two is systematic. In fact, it turns out (if you set the random seed) the sample will be exactly the same minus one. That is, if you use the `prob`

option then you should need to only subtract 1 from your samples to get back what sample would have returned had you not used the `prob`

option. Here is some very short code illustrating that point.

`N = 100 n = length(50:90) set.seed(1) random.sample1 = sample(50:90, N, replace=TRUE, prob=rep(1/n, times=n)) set.seed(1) random.sample2 = sample(50:90, N, replace=TRUE) plot(density(random.sample1),col="blue") lines(density(random.sample2),col="red") summary(random.sample1) summary(random.sample2) random.sample1 random.sample2 `

which yields

`> summary(random.sample1) Min. 1st Qu. Median Mean 3rd Qu. Max. 50.00 63.00 70.00 71.27 82.00 90.00 > summary(random.sample2) Min. 1st Qu. Median Mean 3rd Qu. Max. 50.00 62.75 69.50 70.68 81.00 90.00 > > random.sample1 [1] 61 66 74 88 59 87 89 78 76 53 59 58 79 66 82 71 80 50 66 82 89 59 77 56 61 [26] 66 51 66 86 64 70 75 71 58 84 78 83 55 80 67 84 77 83 73 72 83 51 70 81 79 [51] 70 86 68 61 53 55 63 72 78 67 88 63 69 64 77 61 70 82 54 86 64 85 65 64 70 [76] 87 86 66 82 90 68 80 67 64 82 59 80 55 61 56 60 53 77 86 82 83 69 67 84 75 > random.sample2 [1] 60 65 73 87 58 86 88 77 75 52 58 57 78 65 81 70 79 90 65 81 88 58 76 55 60 [26] 65 50 65 85 63 69 74 70 57 83 77 82 54 79 66 83 76 82 72 71 82 50 69 80 78 [51] 69 85 67 60 52 54 62 71 77 66 87 62 68 63 76 60 69 81 53 85 63 84 64 63 69 [76] 86 85 65 81 89 67 79 66 63 81 58 79 54 60 55 59 52 76 85 81 82 68 66 83 74 > `

So as you can see, the two sample are the same with one being shifted by minus 1. Why this occurs I have not figured out by my guess is that by default the `sample`

command assigns the equal probabilities differently then the user would. Also, a disclaimer, the above idea works in the case when the sample is an integer, however, when I ran the same code sampling from numbers with decimals, I could not get the above results to hold. Most likely it must have to do something with the comment: "The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero."

# Update:

So after digging a bit deeper we see that the `sample`

command actually relies on the command `sample.int`

. However, withing `sample.int`

, if you do not specify the `prob`

option then the command calls `.Internal(sample2())`

which I have not figured out how to see inside of. However, if someone know how to see what the function `sample2`

is doing, then we will have our answer as to how they specify the `prob`

option when not explicitly given.

`> sample function (x, size, replace = FALSE, prob = NULL) { if (length(x) == 1L && is.numeric(x) && x >= 1) { if (missing(size)) size <- x sample.int(x, size, replace, prob) } else { if (missing(size)) size <- length(x) x[sample.int(length(x), size, replace, prob)] } } <bytecode: 0x000000000ff22210> <environment: namespace:base> > sample.int function (n, size = n, replace = FALSE, prob = NULL) { if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n, size)) else .Internal(sample(n, size, replace, prob)) } <bytecode: 0x000000000ffa3478> <environment: namespace:base> `