I have a random variable which I know is uniformly distributed, and I expect it to be distributed between the range $[0,1]$. Then, I generate (simulate) 100 realizations of the variable.
The question: Which is the best way to find the probability that the underlying "actual" range of the variable is in fact $[0,1]$ from the realizations?
Its clear that if the 100 realizations are within the range $[0.51, 0.69]$, the probability is very low, while if they are within the range $[0.01, 0.99]$, it should be high.
I tried simply to compute the actual vs expected range ratio
$$
((0.69-0.51)/(1-0)) = 0.08,
$$
and assumed that it was the probability $p$ of each realization. Then computed the probability of it happening 100 times as $p^{100}$.
The problem is that, for a correct case, in which the original range was in fact $[0,1]$, the realization range turned out to be $[0.02,0.93]$ and the computed probability is very low $0.91^{100} = 0.0012$, which does not makes sense.
So, what is it that I am getting completely wrong?
EDIT. Clarifications thanks to @whuber
(1) I am attempting to check that the range of the underlying variable is in fact [0,1] (for validation purposes) (2) I just want to check the range and know how likely is that it is in fact [0,1]. (3) realizations are independent $rand(1,100)$ in Matlab
Best Answer
Although it is meaningless to find a probability (unless you first specify a prior distribution of the endpoints), you can find the relative likelihood. A good basis for comparison would be the alternative hypothesis that the numbers are drawn from a uniform distribution between a lower bound $L$ and upper bound $U$.
Sufficient statistics are the minimum $X$ and maximum $Y$ of all the data (assuming each number is obtained independently). It doesn't matter whether you draw the data in batches or not. When drawn from the interval $[0,1]$, the joint distribution of $(X, Y)$ is continuous and has density
$$eqalign{f(x,y) &= binom{n}{1,n-2,1}(y-x)^{n-2}mathcal{I}(0le xle yle 1) \ &= n(n-1)(y-x)^{n-2}mathcal{I}(0le xle yle 1).}$$
When scaled by $U-L$ and shifted by $L$, this density becomes
$$f_{(L,U)}(x,y) = (U-L)^{-n} n(n-1)(y-x)^{n-2}mathcal{I}(Lle xle yle U).$$
Obviously this is greatest when $L = x$ and $U=y$.
The relative likelihood is their ratio, best expressed as a logarithm:
$$Lambda(X,Y) = logleft(frac{f_{(X,Y)}(X,Y)}{f(X,Y)}right) = -nlog(Y-X).$$
A small value of this is evidence for the hypothesis $(L,U)=(0,1)$; larger values are evidence against it. Of course if $X lt 0$ or $Y gt 1$ the hypothesis is controverted. But when the hypothesis is true, for large $n$ (greater than $20$ or so), $2Lambda(X,Y)$ will have approximately a $chi^2(4)$ distribution. Assuming $X ge 0$ and $Y le 1$, this enables you to reject the hypothesis when the chance of a $chi^2(4)$ variable exceeding $2Lambda(X,Y)$ becomes so small you can no longer suppose the large value can be attributed to chance alone.
I will not attempt to prove that the $chi^2(4)$ distribution is the one to use; I will merely show that it works by simulating a large number of independent values of $2Lambda(X,Y)$ when the hypothesis is true. Since you have the ability to generate large values of $n$, let's take $n=500$ as an example.
$100,000$ results are shown for $n=500$. The red curve graphs the density of a $chi^2(4)$ variable. It closely agrees with the histogram.
As a worked example consider the situation posed in the question where $n=100$, $X= 0.51$, and $Y=0.69$. Now
$$-2Lambda(0.51, 0.69) = -2(100log(0.69 – 0.51)) = 343.$$
The corresponding $chi^2(4)$ probability is less than $10^{-72}$: although we would never trust the accuracy of the $chi^2$ approximation this far out into the tail (even with $n=100$ observations), this value is so small that certainly these data were not obtained from $100$ independent uniform$(0,1)$ variables!
In the second situation where $X=0.01$ and $Y=0.99$,
$$-2Lambda(0.01, 0.99) = -2(100log(0.99 – 0.01)) = 4.04.$$
Now the $chi^2(4)$ probability is $0.40 = 40%$, quite consistent with the hypothesis that $(L,U)=(0,1)$.
BTW, here's R
code to perform simulations. I have reset it to just $10,000$ iterations so that it will take less than one second to complete.
n <- 500 # Sample size N <- 1e4 # Number of simulation trials lambda <- apply(matrix(runif(n*N), nrow=n), 2, function(x) -2 * n * log(diff(range(x)))) # # Plot the results. # hist(lambda, freq=FALSE, breaks=seq(0, ceiling(max(lambda)), 1/4), border="#00000040", main="Histogram", xlab="2*Lambda") curve(dchisq(x, 4), add=TRUE, col="Red", lwd=2)
Similar Posts:
- Solved – MGF of Poisson Z=X+2Y
- Solved – Suppose X and Y are independent Poisson random variables with respective parameters $lambda$ and 2$lambda$. Find $E[Y-2*X|X+Y=10]$
- Solved – Maximum likelihood estimator, exact distribution
- Solved – How to bound a probability with Chernoff’s inequality
- Solved – How to efficiently model the sum of Bernoulli random variables