I have biomass data (continuous response variable). If sufficient data is collected, the log(Biomass) follows a normal distribution. However, I am separating the overall biomass by family (i.e., biomass for each family) and in some sites no families were recorded, and the biomass of that family is 0.
Up to now I have log + 1 transformed the family biomass data. This gives me positive values for the non-zero biomasses and 0´s where the family biomass was 0.
The non-zero transformed values follow a normal distribution. If I use stan_glmer model only considering the non-zero values,the pp_check and residuals look fine (in R).
But that is excluding my 0's! (which is excluding part of the reality)
I wanted to account for the 0's and I was suggested a hurdle model: one which uses a binomial distribution to specify the probability of getting a 0 or a positive value, and then fits another distribution to the non-zero data. I have been investigating a bit more about this model and reached the brms package (where you have the hurdle_lognormal function). My question is, is there a similar function that does hurdle_normal or hurdle_gaussian?
If I fit the hurdle_lognormal to my transformed family biomass data the predictive fit underestimates the observed data (e.g., residuals are scattered around 2.5 instead of around 0) (comment: this also happens with my non-transformed data). I think the reason why this is happening is because it is using a lognormal distribution for the non-zero values; if it used a normal distribution I think the model fit would improve tremendously.
I really want to do this using bayesian techniques (e.g., stan). However, I did look at other packages that have the hurdle function (e.g., hurdlr or pscl) and I still couldn't specify the normal distribution for the non-zero values.
Any comments or suggestions about how to proceed? Thank you very much in advance!
If you want to model data that essentially follow a normal distribution for the positive values but have a point mass at zero, you could start with a Gaussian model censored at zero. In the econometric literature this is known as the tobit model.
The next step would be to fit a two-part model with (1) a binary hurdle for zero vs. non-zero (e.g., using a probit link, corresponding to an underlying Gaussian distribution) and (2) a zero-truncated Gaussian model for the positive observations. In the econometrics literature this is known as the Cragg model. The tobit model is nested in the Cragg model, namely if the scaled coefficients from both parts coincide. See also: Is a "hurdle model" really one model? Or just two separate, sequential models?
A potential caveat is that while the usual (un-censored and un-truncated) Gaussian regression is consistent under heteroscedasticity, the same does not hold for the censored and truncated versions. Hence, taking heteroscedasticity into account might matter.
An R package that implements heteroscedastic censored or truncated models is
crch at https://CRAN.R-project.org/package=crch. A paper introducing the package along with a worked example that compares the censored model with the two-part hurdle model is Messner, Mayr, Zeileis (2016), "Heteroscedastic Censored and Truncated Regression with crch", "The R Journal", 8(1), 173-181. https://journal.R-project.org/archive/2016/RJ-2016-012/
- Solved – Implementing a hurdle/Zero-inflated Poisson model in R with right-censored count data
- Solved – How to model non-negative zero-inflated continuous data
- Solved – How to model a zero-inflated ‘continuous’ response data in r ‘without’ assuming an underlying normal distribution
- Solved – How to add confidence intervals to predicted data when the response variable is log transformed
- Solved – Fitting continuous data with zeros to a discrete distribution