Anyone that follows baseball has likely heard about the out-of-nowhere MVP-type performance of Toronto's Jose Bautista. In the four years previous, he hit roughly 15 home runs per season. Last year he hit 54, a number surpassed by only 12 players in baseball history.
In 2010 he was paid 2.4 million and he's asking the team for 10.5 million for 2011. They're offering 7.6 million. If he can repeat that in 2011, he'll be easily worth either amount. But what are the odds of him repeating? How hard can we expect him to regress to the mean? How much of his performance can we expect was due to chance? What can we expect his regression-to-the-mean adjusted 2010 totals to be? How do I work it out?
I've been playing around with the Lahman Baseball Database and squeezed out a query that returns home run totals for all players in the previous five seasons who've had at least 50 at-bats per season.
The table looks like this (notice Jose Bautista in row 10)
first last hr_2006 hr_2007 hr_2008 hr_2009 hr_2010 1 Bobby Abreu 15 16 20 15 20 2 Garret Anderson 17 16 15 13 2 3 Bronson Arroyo 2 1 1 0 1 4 Garrett Atkins 29 25 21 9 1 5 Brad Ausmus 2 3 3 1 0 6 Jeff Baker 5 4 12 4 4 7 Rod Barajas 11 4 11 19 17 8 Josh Bard 9 5 1 6 3 9 Jason Bartlett 2 5 1 14 4 10 Jose Bautista 16 15 15 13 54
and the full result (232 rows) is available here.
I really don't know where to start. Can anyone point me in the right direction? Some relevant theory, and R commands would be especially helpful.
Thanks kindly
Tommy
Note: The example is a little contrived. Home runs definitely aren't the best indicator of a player's worth, and home run totals don't consider the varying number of chances per season that a batter has the chance to hit home runs (plate appearances). Nor does it reflect that some players play in more favourable stadiums, and that league average home runs change year over year. Etc. Etc. If I can grasp the theory behind accounting for regression to the mean, I can use it on more suitable measures than HRs.
Best Answer
I think that there's definitely a Bayesian shrinkage or prior correction that could help prediction but you might want to also consider another tack…
Look up players in history, not just the last few years, who've had breakout seasons after a couple in the majors (dramatic increases perhaps 2x) and see how they did in the following year. It's possible the probability of maintaining performance there is the right predictor.
There's a variety of ways to look at this problem but as mpiktas said, you're going to need more data. If you just want to deal with recent data then you're going to have to look at overall league stats, the pitchers he's up against, it's a complex problem.
And then there's just considering Bautista's own data. Yes, that was his best year but it was also the first time since 2007 he had over 350 ABs (569). You might want to consider converting the percentage increase in performance.
Similar Posts:
- Solved – How to determine which variables are dependent or independent
- Solved – Forecasting optimization techniques in fantasy baseball
- Solved – connection between empirical Bayes and random effects
- Solved – connection between empirical Bayes and random effects
- Solved – ARIMA possible with multiple groups