Solved – R Formula that only uses a subset of a factor

What is the least painful way to do the following in R? We want to run a model with a formula like this:

Model #1:  Speed ~ male + female 

Unfortunately, our data frame has just a single column, Gender, which has levels 'female', 'male', and 'unknown'. We could then write a formula like this:

Model #2:  Speed ~ Gender 

However, we don't want to treat 'unknown' as its own gender value. We want the model's semantics to be such that the 'unknown' rows will comprise the baseline for the model, and then the effects of the 'male' and 'female' variables will go on top of that. We do NOT want to simply select a subset of rows that are either 'female' or 'male' and run model #2; we need to include the 'unknown' levels as baseline. How can this be done?

(I understand rearranging the data frame would allow for running model #1. Assume for the moment that this is not feasible.)

Thanks!

If I understand correctly what you're asking, you'll want to use the following code:

data$gender <- factor(data$gender, levels=c('unknown','male','female') 

and then fit your model. I'll use a basic linear model here as an example.

linearmod <- lm(speed~gender, data=yourdata, ...) 

would fit a model of the form $hat{y}=beta_0+beta_1*male+beta_2*female$ where $male$ and $female$ are Boolean variables that only take on a value of 0 or 1. So in the case of 'unknown', both $male$ and $female$ would be 0, making $beta_0$ the predicted value ($hat{y}$) of the speed of a person with unknown gender. Similarly, the predicted speed of a person known to be male would be $beta_0 +beta_1$. The purpose of the factor() command is to order the levels the way you want (with unknown as the baseline) instead of alphabetically, which is the default. Setting the base level to 'unknown' doesn't actually change the model that will be fit by the call to lm(), but it makes the output optimal for comparing 'male' to 'unknown' and 'female' to 'unknown'. Alternatively you could use the relevel() function to rearrange the base level of your factor variable.

Similar Posts:

Rate this post

Leave a Comment