I am conducting a mulitple first order regression analysis of genetic data. The vectors of y-values do not all follow a normal distribution, therefore I need to implement a non-parametric regression using ranks.
Is the lm()
function in R suitable for this, i.e.,
lin.reg <- lm(Y~X*Z)
where Y, X and Z are vectors of ordinal categorical variables?
I am interested in the p-value assigned to the coefficient of the interaction term in the first order model. The lm()
function obtains this from a t-test, i.e., is the interaction coefficient significantly different from zero.
Is the automatic implementation of a t-test to determine this p-value appropriate when the regression model is carried out on data as described?
Thanks.
EDIT
Sample data for clarity:
Y <- c(4, 1, 2, 3) # A vector of ranks X <- c(0, 2, 1, 1) # A vector of genotypes (0 = aa, 1 = ab, 2 = bb) Z <- c(2, 2, 1, 0)
Best Answer
If your response variable is ordinal, you may want to consider and "ordered logistic regression". This is basically where you model the cumulative probabilities {in the simple example, you would model $Pr(Yleq 1),Pr(Yleq 2),Pr(Yleq 3)$}. This incorporates the ordering of the response into the model, without the need for an arbitrary assumption which transforms the ordered response into a numerical one (although having said that, this can be a useful first step in exploratory analysis, or in selecting which $X$ and $Z$ variables are not necessary)
There is a way that you can get the glm() function in R to give you the MLE's for this model (other wise you would need to write your own algorithm to get the MLEs). You define a new set of variables, say $W$, where these are defined as
$$W_{1jk} = frac{Y_{1jk}}{sum_{i=1}^{i=I} Y_{ijk}}$$ $$W_{2jk} = frac{Y_{2jk}}{sum_{i=2}^{i=I} Y_{ijk}}$$ $$…$$ $$W_{I-1,jk} = frac{Y_{I-1,jk}}{sum_{i=I-1}^{i=R} Y_{ijk}}$$
Where $i=1,..,I$ indexes the $Y$ categories, $j=1,..,J$ indexes the $X$ categories, and $k=1,..,K$ indexes the $Z$ categories. Then fit a glm() of W on X and Z using the complimentary log-log link function. Denoting $theta_{ijk}=Pr(Y_{ijk}leq i)$ as the cumulative probability, the MLE's of the theta's (assuming a multi-nomial distribution for $Y_{ijk}$ values) is then
$$hat{theta}_{ijk}=hat{W}_{ijk}+hat{theta}_{(i-1)jk}(1-hat{W}_{ijk}) i=1,dots ,I-1$$
Where $hat{theta}_{0jk}=0$ and $hat{theta}_{Ijk}=1$ and $hat{W}_{ijk}$ are the fitted values from the glm.
You can then use the deviance table (use the anova() function on the glm object) to assess the significance of the regressor variables.
EDIT: one thing I forgot to mention in my original answer was that in the glm() function, you need to specify weights when fitting the model to $W$, which are equal to the denominators in the respective fractions defining each $W$.
You could also try a Bayesian approach, but you would most likely need to use sampling techniques to get your posterior, and using the multinomial likelihood (but parameterised with respect to $theta_{ijk}$, so the likelihood function will have differences of the form $theta_{ijk}-theta_{i-1,jk}$), the MLE's are a good "first crack" at genuinely fitting the model, and give an approximate Bayesian solution (as you may have noticed, I prefer Bayesian inference)
This method is in my lecture notes, so I'm not really sure how to reference it (there are no references given in the notes) apart from what I've just said.
Just another note, I won't harp on it, but I p-values are not all they are cracked up to be. A good post discussing this can be found here. I like Harlod Jeffrey's quote above p-values (from his book probability theory) "A null hypothesis may be rejected because it did not predict something that was not observed" (this is because p-values ask for the probability of events more extreme than what was observed).