Solved – How to test if some data points is drawn from a distribution with linear PDF

I have some data in the range [0, 1], and from the histogram below, it seems that they might be drawn from a distribution with linear probability density function (what's the name of that kinds of distribution?). How do I estimate the parameters of that distribution and how do I test how likely the data were drawn from it in R?

Histogram of the data

Now I know (thanks to Glen_b) if I can define the PDF (e.g., $ P(x) = ax + b$), I can test it with Kolmogorov-Smirnov test, but how do I estimate $a$ and $b$ from the sample?

At least as a rough approximation, one might regard that as having a pdf that increases linearly from 0:

enter image description here

(However, you should check whether there's actually a spike at exactly 1 – and perhaps a small one at exactly 0. Are the values near 1 all less than 1, or are there a number at exactly 1? Similarly with 0.)

Such a linear-increasing pdf might be regarded as a special case of the triangular distribution or as a particular beta distribution (a beta(2,1)).

You can do a hypothesis test for a fully-specified distribution using a Kolmogorov-Smirnov test (for example; there are other choices). In R, that's ks.test.

 f <- function(x) pbeta(x,2,1)  ks.test(x,f)      One-sample Kolmogorov-Smirnov test  data:  x D = 0.1309, p-value = 0.3291 alternative hypothesis: two-sided 

However, two caveats:

1) If you chose the shape to test for based on the same data you run the test on, the p-values are pretty much meaningless.

2) In any case, even if that wasn't at issue, a formal hypothesis test is usually not what you want if your question is "is it reasonable to use this as a model?" — it mostly answers the wrong question.

Alternatively, perhaps you might have regarded the pdf as starting not from 0 but from something higher than 0:

enter image description here

That's not a triangle, but a trapezoid (often called a trapezium if you're not in the US).

Both the triangle I initially drew and this are linear of course.

if I can define the PDF (e.g., P(x)=ax+b), I can test it with Kolmogorov-Smirnov test, but how do I estimate a and b from the sample?

Some comments:

1) if you intended a triangular pdf (with peak at the right) you have a=2 and b=0

2) If you instead assume that it's a trapezoid(/trapezium) as drawn above there aren't two free parameters, but 1 (think instead of the height of the density at 0.5 and the slope – the height at 0.5 is restricted to be exactly 1, leaving you only with the slope)

3) The K-S test is for fully specified distributions (the triangle I had thought you mean was fully specified, now if we're dealing with the trapezoid above, you're estimating a parameter). If you use the same test statistic as the Kolmogorov-Smirnov, you are now doing what's called a Lilliefors test and you'll need to simulate to obtain the distribution of the test statistic.

4) you said you had spikes at exactly 0 and 1; this trapezoid-shaped model we're now discussing doesn't. If you have spikes in your data, a model without them won't fit. But perhaps you mean to have spikes as well as some linear density in between. The density in between doesn't look to me to be particularly linear though (which is why I suggested a beta model), but it might do well enough.

Let's imagine that we condition only on the data between 0 and 1 and we're fitting that trapzoidal-shaped (linear pdf) model, $f(x) = 1+beta(x-frac{_1}{^2}),,quad -2leqbetaleq 2$.

(The term "trapezoidal distribution" usually means a shape with the parallel sides parallel to the x-axis, not the y-axis as here. So perhaps I should stick with your 'linear' characterization after all.)

This parameter may be estimated in a number of ways (the most common approaches would be method of moments and maximum likelihood estimation).

Let's look at MLE.

$$cal{L}(beta) = prod_{i=1}^n 1+beta(x_i-frac{_1}{^2})$$

$$log(cal{L}(beta))=cal{l}(beta) = sum_{i=1}^n log(1+beta(x_i-frac{_1}{^2}))$$

Now this is a nice smooth function and it's quite possible to take the derivative, but we can't solve it for $hat{beta}$ in closed form (or at least I don't think so) by setting that derivative to zero. However, we can take our sample, evaluate the log-likelihood at any value of $beta$, and so we can use optimization routines to locate the maximum.

enter image description here

The maximum for this sample occurs at about 1.2875. (The actual population value I generated the data from was $frac{4}{3}approx 1.3333$.)

Method of moments:

The mean of a random variable with the linear pdf is $int_0^1 x [ 1+beta(x-frac{1}{2})] dx = [frac{x^2}{2} + beta (frac{x^3}{3} -frac{x^2}{4})]_0^1 = frac{1}{2} + frac{beta}{12}$

Equating sample and population mean, $hat{beta}=12(bar{x}-frac{1}{2})$

This has the very nice advantage of great simplicity, but of course if $bar{x}$ is not between $frac{1}{3}$ and $frac{2}{3}$, this yields an impossible estimate for $beta$ (in that the resulting density isn't actually a density).

In the data I used in the MLE example, the method of moments estimate was 1.324 .

Similar Posts:

Rate this post

Leave a Comment