Solved – Probability of two values being equal in a sample drawn from a continuous distribution

I am reading about the Kolmogrov-Smirnov tests from the book Probability and Statistics by DeGroot and Schervish. In the initial few lines on this topic, the authors state the following:-

Suppose that the random variables X1,…,Xn form a random sample from some continuous distribution, and let x1,…,xn denote the observed values of X1,…,Xn. Since the observations come from a continuous distribution, there is probability 0 that any two of the observed values x1,…,xn will be equal. Therefore, we shall assume for simplicity that all n values are different.

My question is – For a sample from a continuous distribution, will be probability of two sample values being equal be exactly zero or approximately zero? If it is the former, can you please give me a hint regarding how to prove it mathematically?

Intuitively, the probability being approximately zero makes sense as however rare it might be, it is possible to have two equal values generated from a distribution. I tried to check this computationally by running a simple R script (written below) and after running it a 100 times, I got the probability to be equal to zero in all instances. May be running it a few million times might produce better results but that would be cruel on my Dell Core i3, 2GB RAM laptop.

probOfCommonObs <- rep(0, 100) noOfCommonObs <- rep(0, 100) for(i in 1:100) {   gaussianSample <- rnorm(1000, sample(1:50, 1), sample(1:50, 1))   for(j in 1:999)   {     for(k in (j+1):1000)     {       if(gaussianSample[j] == gaussianSample[k])         {           noOfCommonObs[i] <- noOfCommonObs[i] + 1         }     }   }   probOfCommonObs[i] <- noOfCommonObs[i]/1000 }  noOfCommonObs probOfCommonObs  

I guess a theoretical explanation would help clarify my doubt and any help would be really appreciated.

I have kept the posting instructions in mind while writing this post but would like to apologise if I have made any mistakes. Thanks!

The answer is exactly 0 in theory and approximately 0 in practice.

Let $X$ be a continuous random variable. Then $Y=X_i-X_j$ is also continuous.

If $P(Y=0)=0$ then the probability of two observations of $X$ being equal is $0$, since $$P(X_i=X_j)=P(X_i-X_j=0)=P(Y=0)=0.$$ If $P(Y=0)>0$ then the probability of doublets is greater than $0$.

To see that $P(Y=x)>0$ is an impossibility for any $x$, note that $Y$ being continuous means that $F(x)=P(Yleq x)$ is continuous in $x$. Thus, since $P(a<Yleq b)=F(b)-F(a)$,

$$P(Y=x)=lim_{epsilonrightarrow 0} P(x-epsilon<Yleq x+epsilon)=lim_{epsilonrightarrow 0}Big( F(x+epsilon)-F(x-epsilon)Big)=0.$$

Thus $P(X_i=X_j)=P(Y=0)=0.$

This works in the same way as does length. The length of a single point is $0$, but the length of an interval containing an uncountably infinite number of points is more than $0$. Similarly, the probability of $Y=x$ is $0$ but the probability that $Yin (x-epsilon,x+epsilon)$ is greater than $0$.

Real data, on the other hand, is never continuous. Even measurements with great precision have a finite number of decimals attached to them. This means that there actually is a small probability of getting doublets.

Let $X_{obs}$ be the observed value of $X$, rounded to four decimal places. Then, as an example, $$P(X_{obs}=2.5934)=P(|X-2.5934|<0.00005)>0.$$ The probability of getting the same observation again is therefore the probability that $X$ falls into a small interval surrounding it, as this will yield the same $X_{obs}$ again.

Despite there being no continuous data, continuous distributions are very useful as approximations, since working with integrals often is much much easier than working with complicated sums (which is what we would get if we always tried to use highly granular discrete distributions).

Edit: thanks to Procrastinator, Didier and Stéphane for helping to improve this answer. 🙂

Similar Posts:

Rate this post

Leave a Comment