# Solved – Explanatory variables with many zeros

I am trying to fit a linear model to a price response variable. Many of the predictor variables consist of mainly zeros. For example, one possible predictor variable is "drill holes". Not many parts have a drilled hole, but if they do it would make sense that it affects the price. I am using the `caret` package in `R` to train the model and choose the appropriate variables. I have already removed all variables with zero variance.

I have found a lot of literature about count data for a response variable with many zeros and zero-inflation models. But what I am wondering is, how should explanatory variables with many zeros (many are NOT count data) be handled? Is there an appropriate transformation? Or are explanatory variables with many zeros allowable since I am dealing with explanatory variables and not the response variable?

Contents

You are focusing on zeros as part of the distributions of several predictors, but the central questions for modelling include (a) what kind of response variable you have and (b) what kind of relationship you expect between the response and the predictors or explanatory variables.

Zeros in the predictors themselves rule out little except straight logarithmic transformation.

From your description, the starting point is that price is the response and prices are necessarily positive. That suggests immediately a regression model with log link and quite possibly Poisson regression. (The fact that price is not a count is secondary here. See for example http://blog.stata.com/tag/poisson-regression/ and its literature for explanation.)

From that, how to represent your predictors depends on their relationship with the response as much as, or more than, their marginal distributions. Your post supplies no information to guide advice, but I'd start with including them as they come and then consider if you need other representations, e.g. as roots, squares, set of indicator variables.

Rate this post