I computed a simple linear regression model from my experiment measures in order to make predictions. I have read that you should not calculate predictions for points that depart too far from the available data. However, I could not find any guidance to help me know how far I can extrapolate. For example, if I calculate the reading speed for a disk size of 50GB, I guess the result will be close to the reality. What about a disk size of 100GB, 500GB? How do I know if my predictions are close to the reality?
The details of my experiment are:
I am measuring the reading speed of a software by using different disk
size. So far I have measured it with 5GB to 30GB by increasing the
disk size of 5GB between experiments (6 measures in total).
My results are linear and the standard errors are small, in my opinion.
Best Answer
The term you're searching for is 'extrapolation'. The problem is that no matter how much data you have, and how many intermediate levels you have between your endpoints on disk size (i.e., between 5 and 30), it is always possible that there is some degree of curvature in the true underlying function, that you simply don't have the power to detect. As a result, when you extrapolate far out from the endpoint, what was a small degree of curvature becomes magnified, in that the true function moves further and further away from your fit line. Another possibility is that the true function really is perfectly straight within the range examined, but that there is perhaps a change-point at some distance from the end point in your study. These sorts of things are impossible to rule out; the question is, how likely are they and how inaccurate would your prediction be if they turn out to be real? I don't know how to provide an analytical answer to those questions. My hunch is that 500 is an awfully long way off when the range under study was [5, 30], but there is no real reason to think my hunches are more worthwhile than yours. Standard formulas for computing prediction intervals will show you an expanding interval as you move away from $bar{x}$, seeing what that interval looks like might be helpful. Nonetheless, you need to bear in mind that you are making a theoretical assumption that the line really is perfectly straight, and remains such all the way out to the $x$-value you will use for the prediction. The legitimacy of that prediction is contingent on both the data & fit, and that assumption.