I have a question regarding fitting mixture of normal distribution.
I have the following data:
I identified three peaks and using mixtools
package, I fitted mixture of normal distributions to my data as follow:
mixture <- mixtools::normalmixEM(data, k = 3)
Consequently, I got 3 $mu$ , 3 $sigma$, and 3 $lambda$ values.
My question is now how should I calculate the area under the peaks corresponding to the estimated normal distribution.
i.e., I need the area under the new curves which are in blue, red and green.
But my question is that, how from the calculate $mu$ and $sigma$, I reach to these three curves that I can consequently calculate the area under them.
Thanks.
Best Answer
Recall that mixture distribution $f$ is a weighted sum of it's component distributions $f_1,dots,f_k$:
$$ f(x) = lambda_1 f_1(x;mu_1, sigma_1) + lambda_2 f_2(x;mu_2, sigma_2) + dots + lambda_k f_k(x;mu_k, sigma_k) $$
where $sum_{i=1}^k lambda_i = 1$. Since $f$ integrates to unity and each of $f_i$ integrates to unity, you need to take $lambda_i$ fraction of each of them. So height of each of $f_i$ densities gets multiplied by $lambda_i$.
Below you can find simulated example that illustrates it.
set.seed(123) N <- 1e5 mu <- c(2,4,6) sigma <- rep(0.5, 3) lambda <- (1:3)/6 dat <- c( rnorm(N*lambda[1], mu[1], sigma[1]), rnorm(N*lambda[2], mu[2], sigma[2]), rnorm(N*lambda[3], mu[3], sigma[3]) ) xx <- seq(0, 8, by = 0.01) hist(dat, 100, freq = FALSE, col = "lightgray", border = 1, xlab = "", main = "") lines(xx, lambda[1]*dnorm(xx, mu[1], sigma[1]), col = "red", lwd = 2) lines(xx, lambda[2]*dnorm(xx, mu[2], sigma[2]), col = "blue", lwd = 2) lines(xx, lambda[3]*dnorm(xx, mu[3], sigma[3]), col = "orange", lwd = 2)
normalmixEM
function correctly identifies the parameters:
summary(mixtools::normalmixEM(dat, k = 3)) ## summary of normalmixEM object: ## comp 1 comp 2 comp 3 ## lambda 0.166202 0.334826 0.498973 ## mu 1.998120 3.999630 6.004135 ## sigma 0.498775 0.504762 0.497422 ## loglik at estimate: -165967.4
As about your second question in the comments, if your data is only fy
points (density estimates) and corresponding y
values, then the easiest way to estimate underlying model would be to create simulated dataset by sampling y
values with probabilities proportional to fy
and then use it for estimating your model as usual. Below you can see plot showing the kind of data that you describes and code example illustrating how to obtain the results.
yy <- sample(y, 5000, prob = fy, replace = TRUE) summary(mixtools::normalmixEM(yy, k = 3)) ## number of iterations= 106 ## summary of normalmixEM object: ## comp 1 comp 2 comp 3 ## lambda 0.168503 0.324642 0.506855 ## mu 1.973119 3.980462 5.976660 ## sigma 0.471415 0.496197 0.527898 ## loglik at estimate: -8360.809
As you can see, with sample of just 25 pairs of y
,fy
values we were able to find the parameter values (notice that they are presented in different order then in the first example).