I realize this might be a too general question, so I'll describe what I'm doing right now first.
I'm working for a virtual insurance company and I have this dataset. It has severity (meaning payment/number of claims), type of car, gender, marital status, age of car, risk type (high and low), company name (ours and others), deductible range. All the variables are categorical. The question I'm trying to answer is: Is our company paying out more on claims than other companies do?
First I suppose I should choose main effects. The way I do it is that, if I can come up with a story or theory, then I include it. For example, I choose to include marital status because probably married people drive more cautiously since they have families. I include company because…well, this is what I need to study.
Now here is my question:
After reading many posts, it occurred to me that merely looking at the test statistics is not a good idea. Sometimes even if the p-value is greater than 0.05, we should still keep the variable. My question is, is there a rule-of-thumb or something? For example, if the p-value o test statistics of marital_status is 0.8, then I guess it's obvious that we should not include it. But what about 0.08? 0.07?
After selecting main effects, I might should start considering interactions between these variables. My question is, how do I even start? My variable type_of_car has around 20 different values (20 different types of car). I thought about running a regression with all possible interactions, and then drop those with insignificant p-values. That didn't really work out because everytime I drop one, the p-value of the variables in the new model would alter drastically.
I apologize for such a long post. I don't have a solid background in statistics (I'm a math major) and I really have to teach myself all the stuff. Any suggestion on how to proceed would be appreciated!
The long post is fine – it provides context, which is vital.
Regarding your point 1, the typical rule of thumb regarding p value is either 0.05 or 0.01, but you are rejecting these for sound reasons. Choosing another level of p value as a rule of thumb would be just as bad (or good). My view is that you should include the main effect if its effect size is large or if it changes the effect size of your main independent variable, which is company.
Regarding point 2, I would look at interactions for the same "story" reasons that you chose main effects. An interaction says that the effect of one IV on the DV is different at different levels of the other IV. One interaction that springs to mind in your case is gender and marital status – on the theory that unmarried men will be more accident prone than either unmarried women or married men
On another note, I was surprised that age of driver was not included.
- Solved – What interactions to include in the GLM model
- Solved – Multinomial Logistic Regression – Interaction Effect
- Solved – Covary two dumthe variables in SEM
- Solved – Does it make a difference to use Ordinal vs Nominal in Cox Regression
- Solved – Treating ‘Don’t know/Refused’ levels of categorical variables