Solved – Unimodality Test for Discrete Distribution

I have several collections of discrete datasets of integer values, say, from 1 to 10, inclusive. I am interested in characterizing the various distributions in these datasets, and it is important for my purposes whether each distribution is shaped unimodally or multimodally. I am not interested in the explicit discrete mode in the sense that, for example, 8 is the most common value. Rather, I am interested in whether each distribution is shaped unimodally or multimodally. For example, what I would call a unimodal shape:

Unimodal shape

And a bimodal shape:

Bimodal shape

Obviously, the above two distributions exhibit very different shapes. And, as you can see, some of the datasets are very large, containing tens or hundreds of thousands of values.

The discrete nature of the datasets is somewhat problematic because the most common test for unimodality, the Hartigan-Hartigan Dip test, assumes a continuous distribution. I am not a statistician, but it appears as if this is a rather rigid assumption. I tested an R implementation of the Dip test on what appeared to be some (artificial) perfectly unimodal discrete data and detected an incredibly small p-value, suggesting that the data was actually multimodal.

The two other commonly mentioned tests for unimodality appear to be the Silverman test and the excess mass test. I know little about the latter, but this explanation of the former seemed to hint that the Silverman test applies to discrete data, although it was not said explicitly.

So, my questions:

1) Am I thinking about this in the right way? Is "unimodality" the correct term here?

2) What is/are the best statistical test(s) to use for my data? Is the Silverman test an appropriate choice?

3) As I am not a statistician, where, if possible, might I find an already working implementation of the above statistical test (ideally Python or R)?

The implication of the question is that these datasets tabulate counts of values drawn independently from a discrete distribution defined on an ordered set of values such as $1,2,ldots, 10.$ When that is the case, these counts have a multinomial distribution.

If by "mode" we mean a strict local maximum height in the graph (padding the left and right of the graph with zeros), or something like that, and if the counts are all relatively large (more than 5 or so ought to do), then an attractive method to assess the number of modes in the underlying distribution is with bootstrapping. The problem this solves is that the number of modes in the distribution might differ from the number of modes in the data. By reconstructing the experiment from the distribution defined by the data, we can see to what extent the number of modes might vary. This is "bootstrapping."

Carrying out the bootstrapping is easy: write a function to compute the number of modes in a graph and another one to repeatedly sample from the graph's data and apply that function to the sample. Tabulate its results. ExampleR code is below. When given a dataset like the second one in the question, it plots this chart of the bootstrapped mode frequencies:

Figure

In 676 of 1000 bootstrap samples there were two modes; in 293 there were three; and in 31 there were four. This indicates the data are consistent with an underlying distribution with two or perhaps three modes. There is some possibility of four. The likelihood of more than four is tiny.

These results intuitively make sense, because in the dataset the frequencies of the values $8,9,10$ were close and relatively small. It is possible the true frequency of $9$ is less than those of either $8$ or $10,$ causing there to be modes at $1,8,$ and $10.$ The bootstrapping gives us a sense of how much variation in modes is likely based on the random variation implied by the assumed sampling scheme.

The results for the first set of data are always two modes. That is because the variation among counts in the thousands or tens of thousands is so small that it is extremely unlikely these data came from a distribution with any other modes besides the obvious ones at $1$ and $8.$

# # Compute strict modes. # Input consists of the counts in the data, in order, including any zeros. # n.modes <- function(x) {   n <- length(x)+1   i <- c(0, x) < c(x, 0)   sum(i[-n] & !i[-1]) } # # Bootstrap the mode count in a dataset. # n.modes.boot <- function(x, n.boot=1e3)      tabulate(apply(rmultinom(n.boot, sum(x), x), 2, n.modes), ceiling(length(x)/2+1)) # # Plot the bootstrap results. # library(ggplot2) n.modes.plot <- function(f) {   X <- data.frame(Frequency=f / sum(f))   X$Count <- factor(1:nrow(X))   X <- subset(X, Frequency > 0)   ggplot(X, aes(Count, Frequency, fill=Count)) + geom_col(show.legend=FALSE) } # # Show some examples. # x <- c(70, 30,20,40,60,70,110,170,180,165) f <- n.modes.boot(x) print(n.modes.plot(f)) 

Similar Posts:

Rate this post

Leave a Comment