I'm working on building a predictive model for the number of singles a hitter in baseball generates over the course of a single game. Since the number of singles a hitter scores per game is count data I figured that the most ideal model would be a poisson regression. While the poisson model works fairly well, when I cross validate my model the RMSE is always lower with a gaussian assumption. This makes me feel like there is some adjustment I need to make to my data or my assumptions but I cant for the life of me figure out what that might be.

From what I can tell my data fits a poisson assumption very very well. When I plot the empirical density of singles over a poisson distribution with lambda equal to the sample mean I get an almost perfect overlap so I don't suspect zero inflation. Adjusting the link assumption from log to sqrt improves the accuracy somewhat but the gaussian model still outperforms the poisson. I'm working with around 4400 data points and only using around 5-10 predictors so I don't suspect sample size is an issue. Another potential issue I suspected was overdispersion but using a quasi-poisson and negative binomial model only increased or maintained the RMSE. Finally, when I plot the OLS residuals they clearly do not satisfy the normal assumption of a gaussian distribution. (Screenshots included below)

Why could this possibly be the case? Is the lower RMSE from a gaussian assumption suggestive of a problem with my data or model? Just for reference the difference in RMSE is usually around the .02 range. My only goal is to minimize the RMSE of my model, if OLS is better at reducing RMSE would it be good practice to just ignore the clearly wrong assumptions it makes?

**Contents**hide

#### Best Answer

Imagine (so that the models are comparable) that we fit the same model for the mean using least squares and Poisson regression (or indeed any other model with the same model for how the mean relates to predictors)

To further simplify the notions, consider the model is for the conditional mean and linear in the predictors ($E(Y|x) = xbeta$) so that we're comparing ordinary linear regression with Poisson regression (now with the identity link because of the previous assumption).

[Further (to simplify ideas), assume the model for the mean is correct (so our estimates are unbiased). This is not necessary though]

Then your question amounts to asking "if I use an estimator that *minimizes* the mean square residual, will that smallest-possible mean square residual be smaller than an estimator that does anything else?"

The answer is hopefully now obvious – when you measure fit by the very criterion that one estimator minimizes it must win when compared with anything else

So the question then becomes *why would you use RMSE to compare the two models?*

Edit: Note that counts whose mean is near 0 will have smaller variance than counts whose mean is larger (indeed, if they're Poisson, the variance will be equal to the mean). So a large value is less "precise" than a value near zero (on average it will be further from the mean of the distribution that generated it). So why would you weight the squared deviation for all points equally?

If you want the most precise estimate (in the MSE sense, say) of the Poisson mean, you want to minimize the MSE of the estimator of that mean (i.e. minimize $E[(hat{mu}-mu)^2]$ … or equivalently, minimize its square root). That's not the same as minimizing the MSE in the data.

If your estimator is unbiased, then asymptotically (as $ntoinfty$) in in most situations the MLE will have lowest variance and so will also be asymptotically minimum MSE. In small samples (for easy cases) you can often compute the MSE and compare directly.