I'm attempting a multiple regression model where the predicted variable is runoff ratio – the ratio of watershed discharge to the precipitation input. This should generally be bounded [0,1], however, due to measurement error some values > 1 occur.
Originally, I modeled this with the predicted variable un-transformed, but logistic regression has been suggested to me, I also have heard Beta regression suggested. I'm not sure how to proceed, and if these transformations are appropriate to my data:
My questions are:
1) Is a logistic regression appropriate for these data? and
2) If I were to proceed with logistic regression, would I need to convert the runoff ratios to proportions, or would I apply the logit to the values as they are?
Sorry if these are obtuse questions – I'm new to logit and most of the information I have found is for binary response variables.
Edited for suggested additions:
As a simple version: I am modeling runoff ratio (rr) as an effect of precipitation (pcp) and antecedent water table position (ant):
rr ~ pcp + ant
rr is a continuous variable. I am not interested in the probability of specific values, rather I'm interested in the values themselves – both to assess the significance of the predictors and as a predictive model.
Conceptually, I was fine modeling it un-transformed. However, a simple linear regression allows predicted values outside of the physical range of [0,1]. As mentioned above, measurement error does lead to values >1, which I'll eventually have to deal with.
Best Answer
Let "$run$" be the runoff, as measured with error, so that the measured runoff ratio $rr$ is $run/pcp$. The stated model and its alternatives appear to be in the form
$$rr = frac{run}{pcp} sim F(beta_{pcp} (pcp) + beta_{ant} (ant) + beta_0)$$
where $F$ is some family of distributions (such as Beta distributions) and the $beta_{*}$ are coefficients to be estimated. The main problem with this is that unless the dispersion of the measurement error in $run$ is directly proportional to $pcp$, the structure of $F$ will be unnecessarily complicated. Why not algebraically rewrite the relationship as
$$run = beta_{pcp} (pcp)^2 + beta_{ant} (ant)(pcp) + beta_0(pcp) + varepsilon$$
where $varepsilon$ represents the measurement error? The absence of several simple terms in this formula (such as one depending directly on $ant$ as well as a constant term) suggests that the proposed model may be artificially limited. Thus, ordinary regression (using $run$ or some re-expression thereof, such as a square or cube root, as the dependent variable) to fit a model like
$$run = alpha_0 + alpha_{pcp}(pcp) + alpha_{ant}(ant) + alpha_{pcp2}(pcp)^2 + alpha_{ant,pcp}(ant)(pcp) + varepsilon$$
would be a good way to begin an analysis. And if indeed the variance of $varepsilon$ depends on $pcp$, that can be modeled in various straightforward ways. This approach seems more natural, realistic, and interpretable than hoping the ratio $rr$ would satisfy the more restrictive assumptions of Beta or Logistic regression.
Similar Posts:
- Solved – Conditional Distribution in logistic regression
- Solved – the prediction error while using deming regression (weighted total least squares)
- Solved – Intercept increases in regression when adding explanatory variables
- Solved – Compare coefficients of two independent variable from two regressions models with the same dependent variable
- Solved – Should I run separate regressions for every community, or can community simply be a controlling variable in an aggregated model