I want to train a linear regression model to predict a non-linear variable. This how the two independent variables correlated against the response (points are jittered):

And the residuals against the fitted values:

Most of the values for the response are zero. The effect is a very strong heteroscedasticity

` studentized Breusch-Pagan test data: model BP = 55483.84, df = 2, p-value < 2.2e-16 `

event though the the predictors are strongly correlated with the response

`Call: lm(formula = response ~ predictor1 + predictor2, data = train_predictors) Residuals: Min 1Q Median 3Q Max -7.6996 -0.0268 -0.0238 -0.0182 4.8785 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.748e-02 2.825e-04 97.28 <2e-16 *** predictor1 8.491e-05 6.574e-07 129.16 <2e-16 *** predictor2 -3.934e-10 8.298e-12 -47.41 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1561 on 498498 degrees of freedom Multiple R-squared: 0.0365, Adjusted R-squared: 0.0365 F-statistic: 9442 on 2 and 498498 DF, p-value: < 2.2e-16 `

Should I consider more adopting non-linear models or could I first try correcting the non-linearity of the response?

**Contents**hide

#### Best Answer

I don't know details of your model, but in my opinion you need to deal with the large amount of "zero responses". Look into compound models with a mass point at zero. Something like the "Tweedie model".

### Similar Posts:

- Solved – Should one drop independent variables if they don’t have linear relationship with the response variable
- Solved – Why are the residuals in this model so linearly skewed
- Solved – ANOVA is significant but coefficients aren’t
- Solved – Linear regression and assumptions about response variable
- Solved – Linear regression and assumptions about response variable