How to handle a data set with high number of 0's. I am trying to predict automobile insurance claims of people. So many of them of are zeros. If the `claim == 1`

then `claim_amount = +ve integer`

, else `claim = 0`

and `claim_amount = 0`

. There are high number of 0's(90%). How to develop a predictive model which can predict `claim`

and `claim_amount`

.

I tried `zeroinfl`

method from CRAN `pscl`

R package. The accuracy is very low. It should be done using Python or R. The data has around 22000 rows with 7 predictor variables and 2 target (`claim`

, `claim_amount`

) variables.

**Contents**hide

#### Best Answer

Zero inflation makes things difficult and we don't know what your expectations as to accuracy are. A simple way to let the zero inflation be handled by the algorithm would be a tree model (packages `rpart`

oder `party`

) or a lot of trees (package `randomForest`

or `party`

). 22000 rows are a lot, if 10% out of that are non-zero and there are 7 predictor variables this may even be enough for a sensible neural net.

More R packages on machine learning at https://CRAN.R-project.org/view=MachineLearning

In the following simulated example you can see, how well the different rules of generating zeros are modeled as well as the rule of number generation by a simple tree:

`library(rpart) library(rpart.plot) expl.data <- data.frame(A = sample(1:3, 22000, TRUE), B = sample(1:3, 22000, TRUE), C = sample(1:10,22000, TRUE), D = runif(22000,0,100), response = rep(NA, 22000)) rules <- function(A, B, C, D){ if(D<20) return(0) if(D>80) return(0) if(C<3) return(0) if(A==1 & B==1) return(0) return(sample(1:20*A,1)) } for (i in 1:nrow(expl.data)) # this can be done faster, this is most readable expl.data$response[i] <- rules(expl.data$A[i], expl.data$B[i], expl.data$C[i], expl.data$D[i]) prp(rpart(response ~ ., data=expl.data)) `

### Similar Posts:

- Solved – Comparing coefficients in logistic regression, with different samples
- Solved – Use of regression-trees to determine probabilities for a binary variable
- Solved – How to use estimated probabilities of a class from rpart to identify the top N classes
- Solved – Predictions for rpart model require more variables than shown in the classification tree
- Solved – Dealing with excessive number of zeros