How to handle a data set with high number of 0's. I am trying to predict automobile insurance claims of people. So many of them of are zeros. If the claim == 1
then claim_amount = +ve integer
, else claim = 0
and claim_amount = 0
. There are high number of 0's(90%). How to develop a predictive model which can predict claim
and claim_amount
.
I tried zeroinfl
method from CRAN pscl
R package. The accuracy is very low. It should be done using Python or R. The data has around 22000 rows with 7 predictor variables and 2 target (claim
, claim_amount
) variables.
Best Answer
Zero inflation makes things difficult and we don't know what your expectations as to accuracy are. A simple way to let the zero inflation be handled by the algorithm would be a tree model (packages rpart
oder party
) or a lot of trees (package randomForest
or party
). 22000 rows are a lot, if 10% out of that are non-zero and there are 7 predictor variables this may even be enough for a sensible neural net.
More R packages on machine learning at https://CRAN.R-project.org/view=MachineLearning
In the following simulated example you can see, how well the different rules of generating zeros are modeled as well as the rule of number generation by a simple tree:
library(rpart) library(rpart.plot) expl.data <- data.frame(A = sample(1:3, 22000, TRUE), B = sample(1:3, 22000, TRUE), C = sample(1:10,22000, TRUE), D = runif(22000,0,100), response = rep(NA, 22000)) rules <- function(A, B, C, D){ if(D<20) return(0) if(D>80) return(0) if(C<3) return(0) if(A==1 & B==1) return(0) return(sample(1:20*A,1)) } for (i in 1:nrow(expl.data)) # this can be done faster, this is most readable expl.data$response[i] <- rules(expl.data$A[i], expl.data$B[i], expl.data$C[i], expl.data$D[i]) prp(rpart(response ~ ., data=expl.data))
Similar Posts:
- Solved – Comparing coefficients in logistic regression, with different samples
- Solved – Use of regression-trees to determine probabilities for a binary variable
- Solved – How to use estimated probabilities of a class from rpart to identify the top N classes
- Solved – Predictions for rpart model require more variables than shown in the classification tree
- Solved – Dealing with excessive number of zeros