Suppose I want to know whether a baseball team's winning percentage the previous season or the Pythagorean expectation of that team from the previous season is a better predictor of next season's win percentage. (I take percentages because the number of games played per season has changed in the past, so it seems easier).
For each of these independent variables, there's good reason to suspect their will be a linear relationship with the dependent variable, so linear regression seems like it would work.
The dependent variable is always within the [0,1] interval, so logistic regression also seems like it would work. But the dependent variable isn't exactly a probability (well, I suppose it's the probability that the team beats an unknown opponent team), and it also never is actually very close to 0 or 1 (it's pretty much without exception between .3 and .7).
So, with all this in mind, which would be the more natural method to use, linear regression or logistic regression? Are they both valid approaches?
As the comments suggest, either method could work in a practical sense and @Macro might be right that the results should be similar so long as the diagnostics check out.
Particularly when the response is centred around 0.5, linear regression is often not a bad approximation. However, it falls apart as the responses get towards 0 and 1, because a) the variance of the response tends to get smaller at those points, invalidating various OLS assumptions and b) you start getting predicted values outside of the allowable (0,1) range.
Because of that, I think the answer to your question "which is the natural method to use?" is definitely logistic regression. As @Dason pointed out, if you know the number of games you can easily do this (eg in R, if the response is a proportion, you can set the weights to be the number of games).
In contrast, I can't think of any reason why you'd prefer linear to logistic regression.
- Solved – How to predict the odds that a dodgeball team is going to win based on the winning history of its players
- Solved – How to calculate the tipping point of over/under odds in Football
- Solved – How to calculate probability of winning best of 7 series
- Solved – Using lm() with just one variable in R
- Solved – Goodness of fit in logistic regression where features are not frequencies