# Solved – Measuring Regression to the Mean in Hitting Home Runs

Anyone that follows baseball has likely heard about the out-of-nowhere MVP-type performance of Toronto's Jose Bautista. In the four years previous, he hit roughly 15 home runs per season. Last year he hit 54, a number surpassed by only 12 players in baseball history.

In 2010 he was paid 2.4 million and he's asking the team for 10.5 million for 2011. They're offering 7.6 million. If he can repeat that in 2011, he'll be easily worth either amount. But what are the odds of him repeating? How hard can we expect him to regress to the mean? How much of his performance can we expect was due to chance? What can we expect his regression-to-the-mean adjusted 2010 totals to be? How do I work it out?

I've been playing around with the Lahman Baseball Database and squeezed out a query that returns home run totals for all players in the previous five seasons who've had at least 50 at-bats per season.

The table looks like this (notice Jose Bautista in row 10)

``     first     last hr_2006 hr_2007 hr_2008 hr_2009 hr_2010 1    Bobby    Abreu      15      16      20      15      20 2   Garret Anderson      17      16      15      13       2 3  Bronson   Arroyo       2       1       1       0       1 4  Garrett   Atkins      29      25      21       9       1 5     Brad   Ausmus       2       3       3       1       0 6     Jeff    Baker       5       4      12       4       4 7      Rod  Barajas      11       4      11      19      17 8     Josh     Bard       9       5       1       6       3 9    Jason Bartlett       2       5       1      14       4 10    Jose Bautista      16      15      15      13      54 ``

and the full result (232 rows) is available here.

I really don't know where to start. Can anyone point me in the right direction? Some relevant theory, and R commands would be especially helpful.

Thanks kindly

Tommy

Note: The example is a little contrived. Home runs definitely aren't the best indicator of a player's worth, and home run totals don't consider the varying number of chances per season that a batter has the chance to hit home runs (plate appearances). Nor does it reflect that some players play in more favourable stadiums, and that league average home runs change year over year. Etc. Etc. If I can grasp the theory behind accounting for regression to the mean, I can use it on more suitable measures than HRs.

Contents