I have distributions that show n-modal behavior. I need to find the values of the largest and smallest modes. For example, in the histogram below I need to find the values representing the yellow lines (first one is around 20 and the last one is around 190). The red ones are not important to me for now. One of the problems is that I cannot guarantee that any given mode has a normal distribution. In fact, I cannot guarantee any distribution at all. Also I cannot know beforehand how many modes I can find in graph.
Is there any analysis that I can do to find those values?
Figure 1: SAMPLE DISTRIBUTION HISTOGRAM
Best Answer
A very long time ago I learned an effective technique in the geological literature. (I apologize for not remembering the source.) It consists of studying the modes of a kernel density estimator (KDE) as the bandwidth is varied.
What happens is that with a very large bandwidth, the data look like a big lump with a single mode. This one uses a bandwidth of 60 and its mode is near 110:
As the bandwidth shrinks, the KDE outlines what the eye sees more closely and more modes appear. This one uses a bandwidth of 10 and has three obvious modes with a fourth just beginning to show near 60:
When the bandwidth shrinks too far, the KDE is too detailed. This one with a bandwidth of 1 has 36 modes:
You can explore this behavior with a "mode trace." For each bandwidth within the full range (from no detail to too detailed) it plots the modes. I have tracked the evolution of each mode and colored them accordingly. For instance, the single mode in the first figure corresponds to the central red line (shaped almost like a question mark); the four modes in the second figure correspond to the four traces rising to a height (bandwidth) of 10; the 36 modes in the third figure correspond to all 36 traces:
It's probably a good idea to use a logarithmic scale for the bandwidth, as shown here.
A glance at the mode trace will indicate how many modes to identify. I have chosen four. To determine their locations, I have found the points where the traces are the most vertical among all bandwidths smaller than the one at which all four modes first appear: at these locations the locations are stable even when the bandwidth changes. It is comforting (but not really essential) that all four locations are found using comparable bandwidths. (One really should take a bit more care in case multiple stable points appear along a trace: I would opt for the one with the largest bandwidth less than the bandwidth at which all the modes appear.)
Having located the modes, we may plot them on the original histogram:
It's then a simple matter to select the extreme modes. The mode trace will tell you how sensitive their locations are to both the number of modes you identify and to the bandwidth you use. In this example it suggests a tendency for the highest mode to grow even greater with smaller bandwidths before it splits into multiple modes, but the other three modes remain relatively stable (their traces remain nearly vertical at low bandwidths).
It doesn't much matter what shape kernel you pick. The original paper suggested using a Gaussian kernel, which I have done here. The use of a Gaussian is not tantamount to any assumption that the peaks will even approximately have Gaussian shapes. Because Gaussians are (infinitely) smooth, so is the KDE, which means you can analyze it with Calculus techniques to your heart's content.
To be perfectly clear, here is a mathematical account of the mode trace. Let the Kernel function $K$ have unit area and unique mode at $0$ and let the data be $x_1, ldots, x_n.$ The KDE of the data with bandwidth $hge 0$ is the convolution
$$f(x,h) = frac{1}{nh}sum_{i=1}^n Kleft(frac{x-x_i}{h}right).$$
For each $hge 0,$ let $M(h)$ be the set of modes of the distribution function $xto f(x,h).$ The "mode trace" of the data is the union of $M(h)$ as $h$ ranges over an interval $(0, A)$ where $A$ has been chosen so large that $M(h)$ contains a unique element for all $hge A.$
The mode trace has additional structure: it can be decomposed (not necessarily uniquely) into the disjoint union of graphs of continuous partial functions of $h$ defined on intervals. This decomposition is maximal in the sense that the only points any two distinct such functions can possibly have in common are at the endpoints of their domains. I have used colors to designate these partial functions.
Apart from selecting the number of modes to use–which depends very much on your concept of the correct resolution at which to analyze your data–everything can be automated. Here is the R
code I used to generate sample data, analyze them, and make the figures. Its results will be contained in a dataframe X
recording the mode trace and an array modes
containing information about the selected modes.
BTW, if you code your own, note that the KDE is obtained most efficiently using the Fast Fourier Transform (FFT). The most efficient method transforms the data once and then multiplies that by a sequence of transformed kernels, inverting each product to produce the KDE. To determine the range of bandwidths to search, make the largest approximately one-quarter the range of the data and the smallest perhaps 3% or 1% of that.
# # Generate random values from a mixture distribution. # rmix <- function(n, mu, sigma, p) { matrix(rnorm(length(mu)*n, mu, sigma), ncol=n)[ cbind(sample.int(length(mu), n, replace=TRUE, prob=p), 1:n)] } mu <- c(25, 60, 130, 190) # Means sigma <- c(8, 13, 15, 19) # SDs p <- c(.18, .2, .24, .28) # Relative proportions (needn't sum to 1) n <- 1e4 # Sample size x <- rmix(n, mu, sigma, p) # # Find the modes of a KDE. # (Quick and dirty: it assumes no mode spans more than one x value.) # findmodes <- function(kde) { kde$x[which(c(kde$y[-1],NA) < kde$y & kde$y > c(NA,kde$y[-length(kde$y)]))] } # # Compute the mode trace by varying the bandwidth within a factor of 10 of # the default bandwidth. Track the modes as the bandwidth is decreased from # its largest to its smallest value. # This calculation is fast, so we can afford a detailed search. # m <- mean(x) id <- 1 bw <- density(x)$bw * 10^seq(1,-1, length.out=101) modes.lst <- lapply(bw, function(h) { m.new <- sort(findmodes(density(x, bw=h))) # -- Associate each previous mode with a nearest new mode. if (length(m.new)==1) delta <- Inf else delta <- min(diff(m.new))/2 d <- outer(m.new, m, function(x,y) abs(x-y)) i <- apply(d, 2, which.min) g <- rep(NA_integer_, length(m.new)) g[i] <- id[1:ncol(d)] #-- Create new ids for new modes that appear. k <- is.na(g) g[k] <- (sum(!k)+1):length(g) id <<- g m <<- m.new data.frame(bw=h, Mode=m.new, id=g) }) X <- do.call(rbind, args=modes.lst) X$id <- factor(X$id) # # Locate the modes at the most vertical portions of their traces. # minslope <- function(x, y) { f <- splinefun(x, y) e <- diff(range(x)) * 1e-4 df2 <- function(x) ((f(x+e)-f(x-e)) / (2*e))^2 # Numerical derivative, squared v <- optimize(df2, c(min(x),max(x))) c(bw=v$minimum, slope=v$objective, Mode=f(v$minimum)) } # # Retain the desired modes. # n.modes <- 4 # USER SELECTED: Not automatic bw.max <- max(subset(X, id==n.modes)$bw) modes <- sapply(1:n.modes, function(i) { Y <- subset(X, id==i & bw <= bw.max) minslope(Y$bw, Y$Mode) }) # # Plot the results. # library(ggplot2) ggplot(X, aes(bw, Mode)) + geom_line(aes(col=id), size=1.2, show.legend=FALSE) + geom_point(aes(bw, Mode), data=as.data.frame(t(modes)), size=3, col="Black", alpha=1/2) + scale_x_log10() + coord_flip() + ggtitle("Mode Trace") ggplot(data.frame(x), aes(x, ..density..)) + geom_histogram(bins=500, fill="#2E75B2") + geom_vline(data=as.data.frame(t(modes)), mapping=aes(xintercept=Mode), col="#D18A4e", size=1) + ggtitle("Histogram With Modes")
Similar Posts:
- Solved – Finding minimum/maximum peaks in a n-modal distribution
- Solved – Finding minimum/maximum peaks in a n-modal distribution
- Solved – Given a 10D MCMC chain, how can I determine its posterior mode(s) in R
- Solved – Compare means of two groups with a variable that has multiple sub-group
- Solved – How to compare mean distance traveled by two groups