I have the following dataset with daily home range sizes (meter95, meter50) per individual:

` trackId Date rain temp windSp distance flights age sex meter95 meter50 <fct> <date> <dbl> <dbl> <dbl> <int> <int> <fct> <fct> <int> <int> 1 AP002 2017-12-12 0 15.2 2.88 2311 5 adult male 123 10 2 AP002 2017-12-13 0.06 13.5 3.11 4289 9 adult male 50 8 3 AP002 2017-12-14 0.23 13.6 2.73 4722 11 adult male 111 4 4 AP002 2017-12-15 0.39 13.2 1.33 9297 28 adult male 164 110 5 AP002 2017-12-16 0.02 12.8 1.28 7848 20 adult male 155 29 6 AP002 2017-12-17 0.01 14.1 1.78 7252 16 adult male 198 91 `

I am trying to figure out which distribution to fit to the home range data. However, the data seems to be very right-skewed:

It does not include many 0's (only 10/356), like discussed in these posts:

Probability distribution for heavy zero, right skewed data

Fitting a heavy right skewed distribution

I tried to fit other regular distributions, through `fitdistr()`

but none of these fit (just showing a few here):

I also drew a Cullen and Frey graph to see which distribution fits best (as suggested here: How to determine which distribution fits my data best?), and it seems to be a bèta distribution:

I am quite new to this, so I am not sure how to go from here and whether the Cullen and Frey graph really gives me the right distribution. I read on other forums that it doesn't always give the best fit. I also thought my data could maybe fit an inverse-Gaussian distribution, for example, but Cullen and Frey does not include that option.

Also, I wonder if it might be possible to transform my data so it does fit one of the more common distributions? Is that possible when building `glmer()`

models?

**Contents**hide

#### Best Answer

First, use your knowledge to simplify the problem; you know things the data do not.

Are the high values that cause the positive skew mistakes? Are they meaningful observations? Basically, is there any reason in the data collection process that might mean we should simply toss these observations?

If you need to model them, think about the goals of your analysis. Do the goals of your analysis require you to model the dependent variable as a continuous one? If not, are there meaningful ways you can cut the dependent variable into meaningful, discrete groups? I am

*not*suggesting categorizing the data on something arbitrary like a median split; what I*am*saying is to use your domain knowledge to determine if any meaningful groups can be made out of the dependent variable. If you can, then the situation becomes more tenable: You predict the probability of belonging to one of these buckets using some type of binomial or ordinal regression.

If the above isn't possible, we can move on to finding the error distribution that fit your data. It is funny that the Cullen and Frey graph recommends the beta distribution because the beta distribution can take on almost any shape; it is as if the algorithm is saying, "None of these other distributions fit, so here, take the one that can take on all kinds of shapes!" (Full disclosure: I don't know anything about the Cullen and Frey graph).

However, the beta regression requires your dependent variable to be *between* 0 and 1. This can be accomplished pretty easily by rescaling the data, as is suggested by Smithson and Verkuilen (2006) in *Psychological Methods*, p. 57.

The tougher part is that your data seem to be multilevel (from your mention of the `glmer`

function). That is, some observations are dependent on others (because they are from the same individuals or tracks or whatever). The `gamlss`

package has a *ton* of error distributions (including the beta distribution) that you can choose from, and it handles random effects, as well. There are a bunch of articles and even a few books based on this package: http://www.gamlss.com/books-articles/.

I worry that the issue you will still run into is that you will be fitting a model with a lot of distributional assumptions (e.g., conditional values of the DV follow a beta distribution, intercepts follow a normal distribution, etc.) that might be violated (and to which the algorithm won't be robust) or that will be very difficult to converge on a solution.

I would try out some of the latter methods by looking into the very flexible `gamlss`

package, but I would also try to simplify the analysis if you can. You do not need to model the conditional values of the dependent variable perfectly to get at a useful result.

### Similar Posts:

- Solved – What to do if no probability distribution accurately represents the data
- Solved – How to use Cullen and Frey graphs for downstream statistical analysis?
- Solved – How to use Cullen and Frey graphs for downstream statistical analysis?
- Solved – How to use Cullen and Frey graphs for downstream statistical analysis?
- Solved – How to determine the type of probability distribution for a dataset