Solved – How to handle 0 inflated data set

How to handle a data set with high number of 0's. I am trying to predict automobile insurance claims of people. So many of them of are zeros. If the claim == 1 then claim_amount = +ve integer, else claim = 0 and claim_amount = 0. There are high number of 0's(90%). How to develop a predictive model which can predict claim and claim_amount.

I tried zeroinfl method from CRAN pscl R package. The accuracy is very low. It should be done using Python or R. The data has around 22000 rows with 7 predictor variables and 2 target (claim, claim_amount) variables.

Zero inflation makes things difficult and we don't know what your expectations as to accuracy are. A simple way to let the zero inflation be handled by the algorithm would be a tree model (packages rpart oder party) or a lot of trees (package randomForest or party). 22000 rows are a lot, if 10% out of that are non-zero and there are 7 predictor variables this may even be enough for a sensible neural net.

More R packages on machine learning at

In the following simulated example you can see, how well the different rules of generating zeros are modeled as well as the rule of number generation by a simple tree:

library(rpart) library(rpart.plot) <- data.frame(A = sample(1:3, 22000, TRUE), B = sample(1:3, 22000, TRUE),                         C = sample(1:10,22000, TRUE), D = runif(22000,0,100),                         response = rep(NA, 22000))  rules <- function(A, B, C, D){     if(D<20) return(0)     if(D>80) return(0)     if(C<3) return(0)     if(A==1 & B==1) return(0)     return(sample(1:20*A,1)) }  for (i in 1:nrow( # this can be done faster, this is most readable$response[i] <- rules($A[i],                          $B[i],                          $C[i],                          $D[i])  prp(rpart(response ~ ., 

Which leads to graph of decision tree detecting zero rules etc.

Similar Posts:

Rate this post

Leave a Comment