I am trying to do a linear regression with 6 independent variables and 1 dependent variable.
I used this formula in R to run the linear regression:
multilm<-lm(insdata$Payment~insdata$Bonus+insdata$Make+insdata$Kilometres+insdata$Zone+ insdata$Claims+insdata$Insured) > summary(multilm)
My results don't look right compared to the explanations I have seen. Here are my results:
Call: lm(formula = insdata$Payment ~ insdata$Bonus + insdata$Make + insdata$Kilometres + insdata$Zone + insdata$Claims + insdata$Insured) Residuals: Min 1Q Median 3Q Max -806775 -16943 -6321 11528 847015 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -21733.7415 6338.1119 -3.429 0.000617 *** insdata$Bonus 1182.8983 773.7481 1.529 0.126462 insdata$Make -754.2676 610.6802 -1.235 0.216917 insdata$Kilometres 4768.5641 1085.7279 4.392 0.0000118 *** insdata$Zone 2322.8967 773.5080 3.003 0.002703 ** insdata$Claims 4315.8778 18.9465 227.793 < 2e-16 *** insdata$Insured 27.8802 0.6652 41.913 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 70830 on 2175 degrees of freedom Multiple R-squared: 0.9952, Adjusted R-squared: 0.9952 F-statistic: 7.462e+04 on 6 and 2175 DF, p-value: < 2.2e-16
The residual standard error of 70830..isn't that pretty awful? I'm wondering if there is something else I can test based upon this model to determine if this is bad. For one thing I already know that there is a pretty high correlation between payment and Insured and Payment and claims, but maybe that doesn't matter. Unfortunately I don't know enough yet to determine this.
Best Answer
The units of residual standard error are precisely those of the response or outcome or (in your terms) dependent variable. This is an international forum and you don't say what currency you are using, but in most currencies I know about I'd expect that large insurance payments could easily be millions, so that a standard error of that size doesn't seem at all worrying in itself. Why do you think it's "pretty awful"? It's on all fours with not knowing whether someone with height 80 is tall. You need to know the units. If the units are inches, the person is tall. If the units are cm, not so. If the units are feet, there is an error somewhere.
That's true with all quantitative analysis. In regression your immediate reference is the standard deviation of the response, not given here but easy to get.
In fact, you do have evidence on how bad that residual standard error is. It's bundled up in the $R^2$ of 0.9952, which is spectacularly good; unfortunately, it is so spectacularly good that it is likely too good to be true at least in the sense of interestingly or helpfully true. Are there one or more enormous outliers that are dominating results? Do some of the predictors make a good regression an unhelpful tautology?
In moving further you might consider the following:
With data of this kind I would expect to get better results by working on logarithmic scale, preferably with a generalized linear model that guarantees positive predictions (which your present model does not!). Very likely, some of your predictors should be transformed too. This would help with nonlinearity, outliers if there are any, and so forth.
It seems that some of your predictors are not helpful and some only trivially helpful. Unless you're explicitly testing a hypothesis about their effect, you'd be better off without them. I would start with log Payment versus log Claims and then add log Insured.
Disclaimers: Necessarily I know nothing about your precise aims, which should guide analysis. I have no special expertise with insurance data. It sounds banal that there are bigger payouts on bigger claims but the precise relation could still be interesting.
Similar Posts:
- Solved – What does r, r squared and residual standard deviation tell us about a linear relationship
- Solved – How to interpret parameters of GLM output with Gamma log link
- Solved – use of Tweedie or poisson loss/objective function in XGboost and Deep learning models
- Solved – use of Tweedie or poisson loss/objective function in XGboost and Deep learning models
- Solved – use of Tweedie or poisson loss/objective function in XGboost and Deep learning models