Solved – R glmer: distribution for strongly right skewed data

I have the following dataset with daily home range sizes (meter95, meter50) per individual:

  trackId Date        rain  temp windSp distance flights  age   sex   meter95 meter50   <fct>   <date>     <dbl> <dbl>  <dbl>    <int>   <int>  <fct> <fct>   <int>   <int> 1 AP002   2017-12-12  0     15.2   2.88     2311       5  adult male      123      10 2 AP002   2017-12-13  0.06  13.5   3.11     4289       9  adult male       50       8 3 AP002   2017-12-14  0.23  13.6   2.73     4722      11  adult male      111       4 4 AP002   2017-12-15  0.39  13.2   1.33     9297      28  adult male      164     110 5 AP002   2017-12-16  0.02  12.8   1.28     7848      20  adult male      155      29 6 AP002   2017-12-17  0.01  14.1   1.78     7252      16  adult male      198      91 

I am trying to figure out which distribution to fit to the home range data. However, the data seems to be very right-skewed:


It does not include many 0's (only 10/356), like discussed in these posts:
Probability distribution for heavy zero, right skewed data
Fitting a heavy right skewed distribution

I tried to fit other regular distributions, through fitdistr() but none of these fit (just showing a few here):


I also drew a Cullen and Frey graph to see which distribution fits best (as suggested here: How to determine which distribution fits my data best?), and it seems to be a bèta distribution:

Cullen and Frey

I am quite new to this, so I am not sure how to go from here and whether the Cullen and Frey graph really gives me the right distribution. I read on other forums that it doesn't always give the best fit. I also thought my data could maybe fit an inverse-Gaussian distribution, for example, but Cullen and Frey does not include that option.

Also, I wonder if it might be possible to transform my data so it does fit one of the more common distributions? Is that possible when building glmer() models?

First, use your knowledge to simplify the problem; you know things the data do not.

  • Are the high values that cause the positive skew mistakes? Are they meaningful observations? Basically, is there any reason in the data collection process that might mean we should simply toss these observations?

  • If you need to model them, think about the goals of your analysis. Do the goals of your analysis require you to model the dependent variable as a continuous one? If not, are there meaningful ways you can cut the dependent variable into meaningful, discrete groups? I am not suggesting categorizing the data on something arbitrary like a median split; what I am saying is to use your domain knowledge to determine if any meaningful groups can be made out of the dependent variable. If you can, then the situation becomes more tenable: You predict the probability of belonging to one of these buckets using some type of binomial or ordinal regression.

If the above isn't possible, we can move on to finding the error distribution that fit your data. It is funny that the Cullen and Frey graph recommends the beta distribution because the beta distribution can take on almost any shape; it is as if the algorithm is saying, "None of these other distributions fit, so here, take the one that can take on all kinds of shapes!" (Full disclosure: I don't know anything about the Cullen and Frey graph).

However, the beta regression requires your dependent variable to be between 0 and 1. This can be accomplished pretty easily by rescaling the data, as is suggested by Smithson and Verkuilen (2006) in Psychological Methods, p. 57.

The tougher part is that your data seem to be multilevel (from your mention of the glmer function). That is, some observations are dependent on others (because they are from the same individuals or tracks or whatever). The gamlss package has a ton of error distributions (including the beta distribution) that you can choose from, and it handles random effects, as well. There are a bunch of articles and even a few books based on this package:

I worry that the issue you will still run into is that you will be fitting a model with a lot of distributional assumptions (e.g., conditional values of the DV follow a beta distribution, intercepts follow a normal distribution, etc.) that might be violated (and to which the algorithm won't be robust) or that will be very difficult to converge on a solution.

I would try out some of the latter methods by looking into the very flexible gamlss package, but I would also try to simplify the analysis if you can. You do not need to model the conditional values of the dependent variable perfectly to get at a useful result.

Similar Posts:

Rate this post

Leave a Comment