# Solved – How to handle 0 inflated data set

How to handle a data set with high number of 0's. I am trying to predict automobile insurance claims of people. So many of them of are zeros. If the `claim == 1` then `claim_amount = +ve integer`, else `claim = 0` and `claim_amount = 0`. There are high number of 0's(90%). How to develop a predictive model which can predict `claim` and `claim_amount`.

I tried `zeroinfl` method from CRAN `pscl` R package. The accuracy is very low. It should be done using Python or R. The data has around 22000 rows with 7 predictor variables and 2 target (`claim`, `claim_amount`) variables.

Contents

Zero inflation makes things difficult and we don't know what your expectations as to accuracy are. A simple way to let the zero inflation be handled by the algorithm would be a tree model (packages `rpart` oder `party`) or a lot of trees (package `randomForest` or `party`). 22000 rows are a lot, if 10% out of that are non-zero and there are 7 predictor variables this may even be enough for a sensible neural net.
``library(rpart) library(rpart.plot)  expl.data <- data.frame(A = sample(1:3, 22000, TRUE), B = sample(1:3, 22000, TRUE),                         C = sample(1:10,22000, TRUE), D = runif(22000,0,100),                         response = rep(NA, 22000))  rules <- function(A, B, C, D){     if(D<20) return(0)     if(D>80) return(0)     if(C<3) return(0)     if(A==1 & B==1) return(0)     return(sample(1:20*A,1)) }  for (i in 1:nrow(expl.data)) # this can be done faster, this is most readable     expl.data\$response[i] <- rules(expl.data\$A[i],                                    expl.data\$B[i],                                    expl.data\$C[i],                                    expl.data\$D[i])  prp(rpart(response ~ ., data=expl.data)) ``