# Solved – model for continuous dependent variable bounded between 0 and 1

I'm attempting a multiple regression model where the predicted variable is runoff ratio – the ratio of watershed discharge to the precipitation input. This should generally be bounded [0,1], however, due to measurement error some values > 1 occur.

Originally, I modeled this with the predicted variable un-transformed, but logistic regression has been suggested to me, I also have heard Beta regression suggested. I'm not sure how to proceed, and if these transformations are appropriate to my data: My questions are:
1) Is a logistic regression appropriate for these data? and
2) If I were to proceed with logistic regression, would I need to convert the runoff ratios to proportions, or would I apply the logit to the values as they are?

Sorry if these are obtuse questions – I'm new to logit and most of the information I have found is for binary response variables.

As a simple version: I am modeling runoff ratio (rr) as an effect of precipitation (pcp) and antecedent water table position (ant):

rr ~ pcp + ant

rr is a continuous variable. I am not interested in the probability of specific values, rather I'm interested in the values themselves – both to assess the significance of the predictors and as a predictive model.

Conceptually, I was fine modeling it un-transformed. However, a simple linear regression allows predicted values outside of the physical range of [0,1]. As mentioned above, measurement error does lead to values >1, which I'll eventually have to deal with.

Contents

Let "\$run\$" be the runoff, as measured with error, so that the measured runoff ratio \$rr\$ is \$run/pcp\$. The stated model and its alternatives appear to be in the form

\$\$rr = frac{run}{pcp} sim F(beta_{pcp} (pcp) + beta_{ant} (ant) + beta_0)\$\$

where \$F\$ is some family of distributions (such as Beta distributions) and the \$beta_{*}\$ are coefficients to be estimated. The main problem with this is that unless the dispersion of the measurement error in \$run\$ is directly proportional to \$pcp\$, the structure of \$F\$ will be unnecessarily complicated. Why not algebraically rewrite the relationship as

\$\$run = beta_{pcp} (pcp)^2 + beta_{ant} (ant)(pcp) + beta_0(pcp) + varepsilon\$\$

where \$varepsilon\$ represents the measurement error? The absence of several simple terms in this formula (such as one depending directly on \$ant\$ as well as a constant term) suggests that the proposed model may be artificially limited. Thus, ordinary regression (using \$run\$ or some re-expression thereof, such as a square or cube root, as the dependent variable) to fit a model like

\$\$run = alpha_0 + alpha_{pcp}(pcp) + alpha_{ant}(ant) + alpha_{pcp2}(pcp)^2 + alpha_{ant,pcp}(ant)(pcp) + varepsilon\$\$

would be a good way to begin an analysis. And if indeed the variance of \$varepsilon\$ depends on \$pcp\$, that can be modeled in various straightforward ways. This approach seems more natural, realistic, and interpretable than hoping the ratio \$rr\$ would satisfy the more restrictive assumptions of Beta or Logistic regression.

Rate this post