If I have a model, lets say y = a + b1male + b2large + b3medium + b4malelarge + b5malemedium where male is dummy coded 1 for male and 0 for female and large and female are dummies (also 0,1 coded) for company size with small as reference categorie, how can I test whether the effect of male varies significantly with company size?
I presume that the t-test for the coeffs tells me the significance of the effect of male on the respective dummy-category compared to the referece category. So e.g. the effect of male varies significantly between large and small companies (=p-value for b4). But since the t-test is always compared to the reference group, how can I test "in general" whether the effect of male varies with company size?
My intuitive approach would be, to run a hierarchical regression with the interactions included in the second block. So if the change in R² is significant, this would mean that the interaction of at least one categorie of the dummy is significant. Is this correct?
And how can I figure out, which categories show sig. differences. Thats particularly hard if I have a large number of categories of one dummy. Do I have to run regressions and change manually the reference category?
Best Answer
It's almost never a good idea to break a (quasi)-continuous variable into categories for regression, as explained nicely on this page. For example, with your cutoff between 49 and 50 employees, do your really think that adding one more employee places a company into a whole new category? You might need to work on ways to transform the employee-number scale so that it has a linear relation to your outcome variable, but if you have the actual employee numbers as a measure of company size it would be more methodologically sound to use the numbers rather than the categorization. If you treat company size/employee number as a continuous variable, then relatively simple interpretations of an interaction terms with gender may be possible.
If for some reason you do need to use categories, then you can analyze your linear regression as an ANOVA to get the type of output you desire. In R, for example, you can run a linear regression with the lm()
function, then wrap that output in anova()
to present the results in a way that tests "in general" for significance of a multi-category factor or its interactions. You will, however, need to pay attention to the way that variance is partitioned, as explained on this page. The default for anova()
in R is to use Type I sums of squares, hierarchically associating as much variance as possible with each main effect in the order the effects were specified in the model, then proceeding similarly with interactions. I think that's what you had in mind in the third paragraph of your question. Some statistical software instead uses Type III sums of squares.
Tests of differences among different categories of a predictor variable are better done in a systematically parallel way rather than resetting the reference category as you propose. You need to take into account the multiple testing of hypotheses if you do not have just a few pre-specified hypotheses to test. For example, the TukeyHSD()
function in R provides a way to do this if the numbers of cases in the different categories are not too imbalanced. TukeyHSD()
expects input from an aov()
rather than an lm()
analysis in R, but aov()
is simply a particular wrapper for the underlying functionality of lm()
. ANOVA and linear models are essentially just different ways of thinking about the same underlying structure.
Similar Posts:
- Solved – Interpretation of coefficient on categorical variable when all categories are included in the regression
- Solved – PROCESS (Hayes) by using IV categorical
- Solved – PROCESS (Hayes) by using IV categorical
- Solved – Multinomial logistic regression in R returns fewer categories
- Solved – Interpreting main effect coefficient in different models