I am doing a Multiple Linear Regression on a data set where:
The response variable is continuous
One of the explanatory variables is continuous and the rest are binary(categorical) 1 if it is there 0 if it is not.
I did the Multiple linear regression on my data and found that it had non constant variance so I used Box Cox transformation.
The Box Cox transformation seemed to have worked very well. It had good residual vs. fitted values plots, residuals with a normal distibution and good r-squared and adjusted r-squared values.
The data I did the Box Cox transformation on was a training set. I now need to perform a model validation on the test set. I am using R to do my calculations. When I use the predict
function in R the predicted values will be in the transformed state.
I would also like to use the cv.lm
function in R which performs a cross validation using a given model and a data set. When I used this I am not quite sure which data set to use. The original or the transformed. Information on cv.lm
can be found here http://www.statmethods.net/stats/regression.html and http://www.inside-r.org/packages/cran/DAAG/docs/CVlm
My questions are:
Once I have the predicted values can I just use the inverse of the Box Cox to get my values back to original?
If not how do I proceed from here to make sense of my model? I have looked a lot of places online and would really like some insight or expertise in this.
Thanks in advance.
Best Answer
It's common to think of two very different goals when fitting statistical models: inference and prediction. It seems like you might be confusing the two.
The most common use of the Box-Cox transformation is to make the residuals "better behaved"; that is, iid Normal(0, $sigma^2 I$). If the residuals conform to this assumption after the transformation then the hypothesis tests (namely the F-test and t-tests) that one might like to perform to assess the significance of the estimated regression parameters are valid. To be clear, without the iid Normal(0, $sigma^2 I$) assumption, the hypothesis tests are invalid. This is what I mean by inference.
Prediction, on the other hand, does not require such assumptions. You merely fit your model on the training data and predict on the holdout data.
So it really just depends on your goal. If you're only trying to make good predictions there's no need to fiddle with Box-Cox. But if your interested in statistical significance, it's useful to consider it. If your goal is to do both then there's no reason you can't use the inverse of the transformation on your predictions.
Similar Posts:
- Solved – Transformed data due to non-normal residuals – how to see if it actually improved the model
- Solved – Interpreting how much the linear model has improved after Box-Cox transformation
- Solved – Which residuals to analyse when dependent variable is transformed
- Solved – Whether to log transform variable when untransformed variable has positive skew and transformed has negative skew with additional missing data
- Solved – Whether to log transform variable when untransformed variable has positive skew and transformed has negative skew with additional missing data