Solved – Puzzlingly large residual standard error

I am trying to do a linear regression with 6 independent variables and 1 dependent variable.

I used this formula in R to run the linear regression:

multilm<-lm(insdata$Payment~insdata$Bonus+insdata$Make+insdata$Kilometres+insdata$Zone+ insdata$Claims+insdata$Insured) > summary(multilm) 

My results don't look right compared to the explanations I have seen. Here are my results:

Call: lm(formula = insdata$Payment ~ insdata$Bonus + insdata$Make +      insdata$Kilometres + insdata$Zone + insdata$Claims + insdata$Insured)  Residuals:     Min      1Q  Median      3Q     Max  -806775  -16943   -6321   11528  847015   Coefficients:                       Estimate  Std. Error t value  Pr(>|t|)     (Intercept)        -21733.7415   6338.1119  -3.429  0.000617 *** insdata$Bonus        1182.8983    773.7481   1.529  0.126462     insdata$Make         -754.2676    610.6802  -1.235  0.216917     insdata$Kilometres   4768.5641   1085.7279   4.392 0.0000118 *** insdata$Zone         2322.8967    773.5080   3.003  0.002703 **  insdata$Claims       4315.8778     18.9465 227.793   < 2e-16 *** insdata$Insured        27.8802      0.6652  41.913   < 2e-16 *** --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 70830 on 2175 degrees of freedom Multiple R-squared:  0.9952,    Adjusted R-squared:  0.9952  F-statistic: 7.462e+04 on 6 and 2175 DF,  p-value: < 2.2e-16 

The residual standard error of 70830..isn't that pretty awful? I'm wondering if there is something else I can test based upon this model to determine if this is bad. For one thing I already know that there is a pretty high correlation between payment and Insured and Payment and claims, but maybe that doesn't matter. Unfortunately I don't know enough yet to determine this.

The units of residual standard error are precisely those of the response or outcome or (in your terms) dependent variable. This is an international forum and you don't say what currency you are using, but in most currencies I know about I'd expect that large insurance payments could easily be millions, so that a standard error of that size doesn't seem at all worrying in itself. Why do you think it's "pretty awful"? It's on all fours with not knowing whether someone with height 80 is tall. You need to know the units. If the units are inches, the person is tall. If the units are cm, not so. If the units are feet, there is an error somewhere.

That's true with all quantitative analysis. In regression your immediate reference is the standard deviation of the response, not given here but easy to get.

In fact, you do have evidence on how bad that residual standard error is. It's bundled up in the $R^2$ of 0.9952, which is spectacularly good; unfortunately, it is so spectacularly good that it is likely too good to be true at least in the sense of interestingly or helpfully true. Are there one or more enormous outliers that are dominating results? Do some of the predictors make a good regression an unhelpful tautology?

In moving further you might consider the following:

  1. With data of this kind I would expect to get better results by working on logarithmic scale, preferably with a generalized linear model that guarantees positive predictions (which your present model does not!). Very likely, some of your predictors should be transformed too. This would help with nonlinearity, outliers if there are any, and so forth.

  2. It seems that some of your predictors are not helpful and some only trivially helpful. Unless you're explicitly testing a hypothesis about their effect, you'd be better off without them. I would start with log Payment versus log Claims and then add log Insured.

Disclaimers: Necessarily I know nothing about your precise aims, which should guide analysis. I have no special expertise with insurance data. It sounds banal that there are bigger payouts on bigger claims but the precise relation could still be interesting.

Similar Posts:

Rate this post

Leave a Comment