Solved – How to test/prove data is zero inflated

I've got a problem that I think should be simple but can't quite figure it out. I'm looking at seed pollination, I have plants (n=36) that flower in clusters, I sample 3 flower clusters from each plant, and 6 seed pods from each cluster (18 seed pods in total from each plant). A pod can have between 0 to at most 4 seeds pollinated. So, the data is count, with an upper bound. I'm finding an average of ~10% of seeds are pollinated, but anywhere between 1 – 30% on a given plant, so over dispersed data, and of course, there are 4 missing cluster replicates on 3 plants, so not perfectly symmetrical.

The question I'm asking is if this data supports the idea this plant requires pollinators for seed set.

I'm finding that the distribution for the number of seeds in a pod looks like there are more 0 pollinated seed pods (6-9 pods out of 16) and more 3 and 4 pollinated seed pods (2-4 for each) than would be expected if seeds in the population were just randomly pollinated. Basically, I think this is classic example for zero inflated data, first a insect either does or does not visit the flower at all (one zero generator) and if it does, then pollinates 0-4 of the seeds in another distribution. The alternative hypothesis is the plant is partially selfing, and it would then be expected that every seed would have the same probability of being pollinated (this data suggests a roughly 0.1 chance, which means 0.01 chance for two seeds in the same pod, etc).

But I simply want to demonstrate the data best fits one or the other distribution, not actually DO an ZIP or ZINB on the data. I think whatever method I use should take into account the actual number of pollinated seeds and the number of pods sampled on each plant. The best thing I have come up with is to do some sort of boot strap thing where I just randomly assign the number of pollinated seeds for a given plant into the number of seed pods I sampled, do that 10,000 times and see how likely it is the experimental data for the given plant came out of that random distribution.

I just feel there is something about this that should be a lot easier than brute force bootstrapping, but after days of thinking and searching I’m giving up. I can’t just compare to a Poisson distribution because it’s upper bound, it’s not binomial because I need to generate the expected distribution somehow 1st. Any thoughts? And I’m using R so advice there (especially how to most elegantly generate 10,000 random distributions of n balls into 16 boxes that can each contain at most 4 balls) would be most welcome.

ADDED 9/07/2012
First, thanks to you all for all the interest and help. Reading over the answers has made me think to reword my question a bit. What I’m saying is that I have one hypothesis (which for now I am thinking of as the null) that seeds are pollinated randomly across pods, and my alternative hypothesis is that a seed pod with at least 1 pollinated seed is more likely to have multiple pollinated seeds than would be expected by a random process. I’ve provided real data from three plants as examples to illustrate what I’m talking about. First column is the # of pollinated seeds in a pod, second column is the frequency of pods with that seed count.

plant 1 (total 3 seeds: 4% pollination)

num.seeds ::pod.freq

0::16

1::1

2::1

3::0

4::0

plant 2 (total 19 seeds: 26% pollination)

num.seeds::pod.freq

0::12

1::1

2::1

3::0

4::4

plant 3 (total 16 seeds: 22% pollination)

num.seeds::pod.freq

0::9

1::4

2::3

3::2

4::0

In plant #1, only 3 seeds were pollinated in 18 pods, one pod had one seed, and one pod had two seeds. Thinking about a process of adding one seed to the pods at random, the first two seeds each go into their own pod, but for the 3rd seed, there are 6 spots available in pods that already have one seed but 64 spots in the 16 pods with no seeds, so the highest probability of a pod with 2 seeds here is 6/64= 0.094. That’s a bit low, but not really extreme, so I’d say that this plant fits the hypothesis of random pollination across all seeds with a ~4% chance of pollination occurring. But plant 2 looks much more extreme to me, with 4 pods completely pollinated, yet 12 pods with nothing. I’m not quite sure how to calculate the odds of this distribution directly (hence my bootstrap idea) but I’d guess the odds of this distribution occurring at random if each seed has a ~25% chance of pollination are quite low. Plant #3 I really have no idea, I think there are more 0’s and 3’s than one should expect for a random distribution but my gut feeling is that this distribution for this number of seeds is much more likely than the distribution for plant #2, and may not be that unlikely. But obviously I want to know for sure, and across all plants.

In the end I’m looking to write a statement like “The distribution of pollinated seeds in seed pods fits (or does not fit) the hypothesis that plants are not simply partially self compatible, but require visitation of a pollinator to effect seed set. (results of statistical test).” This is really just part of my forward looking section, where I’m talking about what experiments to conduct next, so I’m not desperate for this to be one thing or the other, but I want to know for myself, if possible. If I can’t do what I’m trying to do with this data, I’d like to know that too!

I did ask a rather broad question at first, since I’m curious as to whether or not there are any good tests to show if data should go into a zero inflated model in the first place. All of the examples I’ve seen seem to say –“look, there are a lot of zeros here, and there is a reasonable explanation for that, so let’s use a zero inflated model”. That’s what I’m doing right now on this forum, but I had an experience on my last chapter where I used a Poisson glm for count data and my one of my supervisors said “No, glms are too complex and unnecessary, this data should go into a contingency table” and then sent me a data dump of the massive contingency table generated by their expensive stats package that gave the same p values for all my factors + interactions to three significant digits!! So, I’m trying to keep the stats clear and simple, and make sure I understand them well enough to robustly defend my choices, which I don’t feel I can do for a zero inflated model right now. I’ve used both a quasibinomial (for whole plants to get rid of pesudoreplicaiton) and a mixed model for the above data to compare treatments and answer my main experimental questions, either seems to do the same job, but I’m going to also play around with ZINB’s tonight, to see how well that performs. I’m thinking if I can explicitly demonstrate that this data is strongly clustered (or zero inflated) at first, then provide a good biological reason for that occurring, I’ll be much better set up to subsequently pull out a ZINB, than to just compare one to a quasibinomial/mixed model and argue since it gives better results, that’s what I should use.

But I don’t want to distract too much from my primary question, how can I determine if my data really is more zero inflated than expected from a random distribution? In my case the answer to that is what is of real interest to me, with the possible benefit for model justification being a bonus.

Thanks again for all your time and help!

Cheers,
BWGIA

This seems like a relatively straightforward (nonlinear) mixed model to me. You have seed pods nested into clusters nested into plants, and you can fit a binomial model with random effects at each stage:

    library(lme4)     binre <- lmer( pollinated ~ 1 + (1|plant) + (1|cluster), data = my.data, family = binomial) 

or with covariates if you have them. If the flowers self-pollinate, then you might see some mild effects due to natural variability in how viable the plants are by themselves. However if most of the variability in the response is driven by say cluster variability, you would have a stronger evidence of pollination by insects that might visit only selected clusters on a plant. Ideally, you would want a non-parametric distribution of the random effects rather than Gaussian: a point mass at zero, for no insect visits, and a point mass at a positive value — this is essentially the mixture model Michael Chernick thought about. You can fit this with GLLAMM Stata package, I'd be surprised if this were not possible in R.

Probably for a clean experiment, you would want to have the plants inside, or at least in a location with no insect access, and see how many seeds would be pollinated. That would probably answer all your questions in a more methodologically rigorous way.

Similar Posts:

Rate this post

Leave a Comment