Solved – SNP genotype coding in regression

I would like to conduct some analysis on some biological traits with regression model.
The response variable is continuous. One important independent variable is the SNP information (wildtype, heterozygous, or homozygous). There are different ways to code it.
It can be treated as a nominal or a ordinal variable (like 1, 2, 3). Any one familar with the difference and any classical references about it?
Thank you for any suggestion.

If you treat the variable as ordinal you are assuming a gene-dosage effect. This is essentially a one degree of freedom test since you are testing whether the slope of the regression line is significantly different from $0$. If you treat the variable as nominal you are not assuming any gene-dosage effect and instead you are doing a one way ANOVA with 3 groups so that's a two degrees of freedom test. The gene-dosage model (treating genotypes as ordinal) is more powerful because you are using information about the genotype groups (whether the group has 0, 1 or 2 copies of the wild type allele) whereas in the categorical approach your model knows nothing about the genotype groups (they could just be called A, B and C). Treating the genotype as ordinal is the preferred approach. Also I should mention that if you believe that for example the wild-type allele is dominant then you can merge the heterozygous individuals into the wild-type homozygous group and treat them as one group.

Similar Posts:

Rate this post

Leave a Comment