It is common practice to substitute the zeros of an outcome variable for a gamma GLM with a very small number like 1 or 0.1 if the number of these zero observations does not exceed say 10% of all data.
However, this value is arbitrary and adds some kind of randomness to the model.
I'd like to know what implications this value has for the model and for predictions with that model, and what could be some extreme situations where this might cause heavy problems.
Could you think of a real example and how can I show this?
Best Answer
The description "common practice" may fit literature you know, but I've not seen this often otherwise. Similarly, "randomness" is the wrong word here: "arbitrariness" perhaps fits the case better.
Note that the replacement number, even if you favour this approach, should certainly not be any value that might genuinely occur otherwise, as is likely for 1. Furthermore, thinking in terms of "a very small number" is the wrong way to think, as its logarithm would be large negative and that could create massive outliers on that scale.
If there are substantive grounds for regarding zeros as really small positives, some adjustment might make sense.
But the fudge should be unnecessary here. Even if you use a log link with the gamma family in GLM the key point is not whether there are zeros in the data (so that log of zero makes no sense), but the assumption that means are positive. If you use some other link, the same key point should solve the problem.
The comments here naturally do not rule out other checks, e.g. sensitivity checks to possible adjustment of recorded zeros, or indeed other models should they seem appropriate (zero-inflated gammas, or whatever).
Similar Posts:
- Solved – A model for non-negative data with many zeros: pros and cons of Tweedie GLM
- Solved – Rules for Percentage of zeros in a zero inflated model
- Solved – Is having too high variance a problem when doing t-test
- Solved – Comparing and visualising highly skewed distributions
- Solved – Comparing and visualising highly skewed distributions