I am currently working at work on a project that attempts to predict an environmental change variable. I am personally not a huge fan of the project, but I still want to do the best job possible. Anyhow, let me first describe the properties of the data, and then state my question.

The environmental variable we are trying to model is continuous, and ranges from 0 to 25 (it can have values such as 0.1345 or 1.2335 or 5.674). The environmental variable also has a significant mass at zero (around 30% of the data are zero values), and most of the data is in the range between > 0 and < 1. To complicate things further, around 3% of the data have extreme values of greater than 2. In my opinion, the distribution of the data resembles a Tweedie distribution.

We have around 5 million observations in the dataset. We are going to predict the environmental variable using a set of eight explanatory variables.

We have been modeling the prediction of the environmental for a week, and the predictive power of our results has been meager. Here are the different modeling approaches I have used:

GLM model with a tweedie distribution (glm function in r). This model approach overestimate the environmental change when the variables has low values. Most values between 0 – 1 are significantly overestimated, and the model does not predict high values very well either. The GLM approach produces a very bad fit for our data.

Generalized Additive Model with a tweedie distribution (bam function from mgcv package in r). This approaches fits the data slightly better than the model above, but still significantly overestimates values in the low range.

Regression tree model (rpart model in r). The regression tree model most accurately predicts values in the lower range of the distribution, but fails to predict zero values, and performs also poorly for values greater than 2.

Boosted regression model (dismo package in r using gbm.step). The model significantly overestimate values in the lower range of the distribution. For example, if values are 0.23 it predicts values to be 1.67 and so forth. I believe the gbm.step model to be inadequate for our data, since the family of distribution in the package only models the bernoulli (=binomial), poisson, laplace or gaussian family. None of which accurately describe our distribution.

Since our dataset is very large, I split the data 50/50 into a test and training dataset, and evaluated model fit on a variety of test statistics for the test data set.

Knowing the structure of our data, can anyone think of our modeling approaches that would possibly produce better predictive results? I am still new to machine learning techniques, and I hope that someone might know of other techniques that could be employed.

My strongest program language is R, but I can also do this analysis in Python or Stata.

**Contents**hide

#### Best Answer

There are still a lot of questions that would need answering if someone wanted to sit down and really plan out a solution, mostly relating to the quality of the inputs / data and the needed accuracy / acceptability of certain types of errors.

Based on your description, I would suggest you stop trying to model the problem as a while and start modeling parts of it.

Example: you mention 30% of your data is exactly 0. If you can accurately predict a binary classification problem (output should be exactly zero or non-zero), that would give you a powerful way to make progress. Then you can focus on building a model on the remainder of the data, or further break it down into parts.

The other suggestion would be to learn / explore more of the fitting options in R. With the exception of GLM (which is only linear), your used models are all tree based / related. There are many other algorithms that may do better or worse on your data.

### Similar Posts:

- Solved – Gamma vs tweedie distribution for large productivity dataset
- Solved – GLM model selection using AICc with Tweedie distribution
- Solved – GLM model selection using AICc with Tweedie distribution
- Solved – How to calculate the Tweedie prediction based on model coefficients
- Solved – How to properly forecast sales with skewed data