# Solved – R Formula that only uses a subset of a factor

What is the least painful way to do the following in R? We want to run a model with a formula like this:

``Model #1:  Speed ~ male + female ``

Unfortunately, our data frame has just a single column, `Gender`, which has levels `'female'`, `'male'`, and `'unknown'`. We could then write a formula like this:

``Model #2:  Speed ~ Gender ``

However, we don't want to treat `'unknown'` as its own gender value. We want the model's semantics to be such that the `'unknown'` rows will comprise the baseline for the model, and then the effects of the `'male'` and `'female'` variables will go on top of that. We do NOT want to simply select a subset of rows that are either `'female'` or `'male'` and run model #2; we need to include the `'unknown'` levels as baseline. How can this be done?

(I understand rearranging the data frame would allow for running model #1. Assume for the moment that this is not feasible.)

Thanks!

Contents

If I understand correctly what you're asking, you'll want to use the following code:

``data\$gender <- factor(data\$gender, levels=c('unknown','male','female') ``

and then fit your model. I'll use a basic linear model here as an example.

``linearmod <- lm(speed~gender, data=yourdata, ...) ``

would fit a model of the form \$hat{y}=beta_0+beta_1*male+beta_2*female\$ where \$male\$ and \$female\$ are Boolean variables that only take on a value of 0 or 1. So in the case of 'unknown', both \$male\$ and \$female\$ would be 0, making \$beta_0\$ the predicted value (\$hat{y}\$) of the speed of a person with unknown gender. Similarly, the predicted speed of a person known to be male would be \$beta_0 +beta_1\$. The purpose of the `factor()` command is to order the levels the way you want (with unknown as the baseline) instead of alphabetically, which is the default. Setting the base level to 'unknown' doesn't actually change the model that will be fit by the call to `lm()`, but it makes the output optimal for comparing 'male' to 'unknown' and 'female' to 'unknown'. Alternatively you could use the `relevel()` function to rearrange the base level of your factor variable.

Rate this post