Beyond the Favorite Toy: New Prediction Intervals for Home Run Totals by Jesse Frey

Introduction:

For as long as there have been baseball players, baseball fans, and readily-available statistics, fans have speculated about the future statistics of their favorite players. Special interest has long centered on round numbers such as 500 home runs, 3000 hits, and 300 wins. In the past few years, with Hank Aaron's home run record seemingly at risk, increasing interest has focused on the question of which, if any, of the current crop of stars will break that record. One well-known method for addressing these sorts of questions is the Favorite Toy, which was developed by Bill James and presented in his annual Baseball Abstracts during the 1980s. To estimate the probability of a player reaching a certain career total of say, home runs, one begins by computing that player's established level. This established level is computed via the formula EL = (3*Y0 + 2*Y1 + Y2)/6, where Y0 is the total for the most recent year, Y1 is the total for the previous year, and Y2 is the total for the year before that. One then estimates the number of years remaining in the player's career via the formula YR = max(0.6*(40-age),1.5). Thus a 20-year-old player is estimated to have 12 years remaining, and no active player is estimated to have fewer than 1.5 years remaining. The player's expected remaining home run total is then the product EXP = EL*YR. The probability P that the player in fact hits X additional home runs in his career is estimated via P = EXP/X - 0.5, where P is of course restricted to the interval [0,1]. Example 1: Player A has 200 home runs through his age 26 season and has hit exactly 40 home runs in each of the past three years. We thus compute that his established home run level is EL = 40. We estimate his remaining years as YR = 0.6(40-26) = 8.4 years. His expected remaining home run total is thus 40*8.4 = 336. Since he needs 300 additional home runs to reach 500, his chance of reaching 500 career homers is estimated as P = 336/300 - 0.5 = 0.62. This Favorite Toy formula has a number of nice properties. It completely specifies an estimated distribution for the player's remaining home run total, and this distribution can be used to find upper bounds, lower bounds, and prediction intervals of any desired level. Unfortunately, this estimated distribution is entirely contained in the interval [EXP/1.5, 2*EXP], which is unrealistically short. Consequently, as we will see later on, prediction intervals produced from the formula are much too short to achieve their nominal coverage probabilities. Example 2: Using player A of Example 1, let us construct a 90 % prediction interval for the player's career home run total. One such interval is the interval between the value that the player will exceed 95 % of the time and the value that he will exceed only 5 % of the time. To find the former, one solves 0.95 = 336/X - 0.5, getting X = 336/1.45 = 232. To find the latter, one solves 0.05 = 336/X - 0.5, getting X = 336/0.55 = 611. Thus the interval (232, 611) is a 90 % prediction interval for the player's remaining homers, and (432,811) is a 90 % prediction interval for the player's career total. In this article, we develop a series of formulas, one for each age, which produce prediction intervals which do achieve their nominal coverage probabilities. We then apply those formulas to produce prediction intervals for prominent home run hitters active in the 2002 season.

Development of the Formulas:

Postponing for the moment an examination of the accuracy of the Favorite Toy formula, we first develop a set of formulas to compete with the Favorite Toy in predicting future home run totals and forming prediction intervals. A natural approach to take here is that of trying to find formulas which fit baseball's actual historical record. With this goal in mind, the database from www.baseball1.com was downloaded, and a dataset was produced that contained for almost every player-season in baseball history the following variables: name, age (as of 12 AM on July 1), home runs in that season and each of the previous 4 years, and actual remaining career home run total. Those entries corresponding to players who are still active or who hit 0 homers in the year in question were then eliminated from the dataset. Separately for each age, remaining career home run totals were regressed against the home run totals in the current and the previous 4 seasons. The previous home run totals were then removed from the model chronologically until only those variables which had positive coefficients and made a genuine contribution to predicting the remaining career home run totals were left. The square of homeruns was also considered as a variable, but it was in no case found to be useful. The prediction formulas which were obtained were of the form EXP = c0*Y0 + c1*Y1 + c2*Y2, where the coefficients are as tabled below: Table 1: Age c0 c1 c2 19 21.47 20 16.21 21 11.68 3.01 22 9.51 1.76 23 7.88 1.87 24 5.98 2.47 25 4.43 1.91 1.35 26 3.93 1.43 1.03 27 3.51 1.07 1.01 28 3.04 1.03 0.90 29 1.94 1.53 0.80 30 2.46 1.17 31 2.58 0.74 33 1.71 0.83 34 1.85 0.43 35 1.60 0.37 36 1.74 37 1.54 38 1.46 39 1.14 40 0.82 41 0.47 What these prediction formulas suggest is that only the home run totals in the most recent year and the two previous years are important in predicting remaining home runs. For very young and very old players, only the most recent year's home run total is important. Example 3: We use these formulas to predict the remaining home run total of Player A of Example 1. We have that Y0 = Y1 = Y2 = 40, and the player is age 26. Thus EXP = 3.93*40 + 1.43*40 + 1.03*40 = 256, and we would expect player A to have a career total of 200 + 256 = 456 home runs. This may be contrasted with the Favorite Toy's estimate of 536 career home runs. Given these formulas, we would now like to derive prediction intervals for future home run totals. To do this, we computed for each player-season the difference between the true number of remaining home runs and the predicted value. We then regressed the absolute values of these differences on the predicted values, obtaining a linear formula giving the magnitude for what we might call a typical error. The motivation for this step was the feeling, confirmed by graphs, that the magnitude of the errors produced by the prediction formulas depended on the size of the predicted values. These linear formulas were of the form TE = d0 + d1*EXP, where d0 and d1 were as given below: Table 2: Age d0 d1 19 49.9 0.45 20 53.5 0.25 21 39.2 0.26 22 28.7 0.35 23 22.0 0.36 24 16.3 0.39 25 11.9 0.43 26 9.2 0.43 27 6.1 0.46 28 6.5 0.45 29 4.7 0.50 30 3.9 0.54 31 3.2 0.56 32 2.3 0.59 33 2.3 0.61 34 2.8 0.56 35 1.7 0.64 36 1.2 0.63 37 1.8 0.63 38 2.1 0.60 39 2.1 0.50 40 1.5 0.55 41 0.9 0.79 We then computed for each player-season the ratio between the actual error and the typical error TE. Within each age group, we found the 0.05, 0.10, 0.25, 0.75. 0.90, and 0.95 quantiles of the distribution of these ratios. These quantiles, tabled below, were then used as the coefficients for prediction intervals. Table 3: Age q(.95) q(.90) q(.75) q(.25) q(.10) q(.05) 19 3.92 2.57 1.00 -0.36 -0.68 -0.88 20 3.89 2.44 0.84 -0.33 -0.70 -1.00 21 4.06 2.13 0.69 -0.38 -0.73 -0.99 22 3.76 2.21 0.55 -0.47 -0.81 -1.08 23 3.49 1.96 0.51 -0.56 -1.00 -1.28 24 3.32 1.91 0.50 -0.58 -1.03 -1.25 25 3.01 1.76 0.44 -0.71 -1.08 -1.32 26 2.74 1.61 0.32 -0.77 -1.17 -1.37 27 2.40 1.40 0.24 -0.88 -1.23 -1.42 28 2.45 1.43 0.21 -0.88 -1.25 -1.44 29 2.39 1.33 0.19 -0.90 -1.25 -1.42 30 2.23 1.27 0.17 -0.94 -1.25 -1.38 31 2.25 1.29 0.09 -0.94 -1.26 -1.36 32 2.29 1.22 0.06 -1.01 -1.26 -1.35 33 2.22 1.20 0.09 -0.95 -1.25 -1.36 34 2.01 1.13 0.04 -0.95 -1.24 -1.38 35 2.10 1.20 -0.07 -1.01 -1.22 -1.30 36 2.23 0.96 0.01 -1.01 -1.23 -1.35 37 2.52 1.16 -0.02 -0.96 -1.22 -1.30 38 2.62 1.32 0.07 -0.92 -1.13 -1.32 39 1.83 1.47 0.12 -0.91 -1.26 -1.39 40 3.15 1.12 0.09 -0.86 -1.09 -1.32 41 2.73 1.23 -0.36 -0.58 -0.86 -1.03 To find a prediction interval for the remaining home run total for a player using this method, one proceeds in the following way, making sure to use the row in each table corresponding to the age of the player in the most recent season: 1. Using the coefficients in Table 1, find the expected remaining home run total EXP for the player. 2. Using the coefficients in Table 2, find the typical error TE = d0 + d1*EXP. 3. For a 90 % prediction interval, use the interval (EXP + TE*q(0.05), EXP + TE*q(0.95)). For an 80 % prediction interval, use the interval (EXP + TE*q(0.10), EXP + TE*q(0.90)). For a 50 % prediction interval, use the interval (EXP + TE*q(0.25), EXP + TE*q(0.75)). To find a prediction interval for a player's career total, simply add his current career total to each coordinate of the prediction interval for his remaining home runs. Example 4: We compute a 90 % prediction interval for the remaining home runs of player A of Example 1. We saw in Example 3 that EXP = 256. Thus the typical error, using Table 2 and the fact that player A is 26, is TE = 9.2 + 0.43*256 = 119. Since q(0.05) = -1.37 and q(0.95) = 2.74 for age 26, our prediction interval goes from 256 - 1.37*119 = 93 to 256 + 2.74*119 = 582. Our 90 % prediction interval for his career total is then (293, 782). This may be contrasted with the interval (432, 811) produced by the Favorite Toy. These intervals have, by construction, exactly the prediction levels specified, up to rounding error. That is, if one computes 90 % prediction intervals for all of the players of some particular age, then 90 % of those intervals will contain the true number of remaining home runs for the players in question. As long as players continue to age in roughly the same way, we can expect our prediction intervals for active and future players to be almost as good.

Comparing Coverage Probabilities:

For each player-season in the dataset, 90 %, 80 %, and 50 % prediction intervals based on the Favorite Toy method were computed. The proportions of those intervals that contained the true values were as tabled below: Table 4: Nominal Coverage Probabilities: Age 90 % 80 % 50 % 19 0.158 0.126 0.084 20 0.214 0.206 0.122 21 0.222 0.180 0.113 22 0.285 0.261 0.158 23 0.275 0.248 0.151 24 0.262 0.233 0.148 25 0.260 0.234 0.156 26 0.262 0.238 0.148 27 0.256 0.235 0.145 28 0.227 0.208 0.121 29 0.204 0.186 0.125 30 0.204 0.185 0.121 31 0.193 0.173 0.111 32 0.201 0.178 0.118 33 0.175 0.168 0.109 34 0.179 0.166 0.100 35 0.171 0.147 0.100 36 0.196 0.172 0.116 37 0.206 0.180 0.118 38 0.217 0.199 0.111 39 0.246 0.225 0.155 40 0.154 0.121 0.099 41 0.109 0.091 0.036 We see from the table that for no age does the nominal 90 % prediction interval achieve a coverage probability exceeding 30 %, and for no age does the nominal 50 % prediction interval achieve a coverage probability exceeding 16 %. The obvious conclusion here is that the Favorite Toy, while easy to work with, does not produce anything even approximating a reasonable estimate of the probabilities it attempts to explore. To illustrate this point more concretely, we present both types of intervals as of the end of the 1960 season for each player who hit 20 or more home runs in 1960. Table 5 gives 90 % prediction intervals for career totals, Table 6 gives 80 % prediction intervals for career totals, and Table 7 gives 50 % prediction intervals for career totals. Table 5: F.T. New Player 90 % Int. 90 % Int. Career Total Ernie Banks ( 466 , 789) (316 , 688) Actual = 512 Hank Aaron ( 439 , 799) (307 , 775) Actual = 755 Mickey Mantle ( 505 , 809) (378 , 738) Actual = 536 Roger Maris ( 280 , 580) (186 , 686) Actual = 275 Eddie Mathews ( 537 , 862) (397 , 761) Actual = 512 Jim Lemon ( 260 , 446) (165 , 404) Actual = 164 Rocky Colavito ( 386 , 749) (250 , 712) Actual = 374 Ken Boyer ( 279 , 496) (175 , 428) Actual = 282 Harmon Killebrew ( 279 , 599) (212 , 802) Actual = 573 Frank Robinson ( 381 , 735) (285 , 849) Actual = 586 Willie Mays ( 419 , 647) (310 , 579) Actual = 660 Ted Williams ( 544 , 581) (523 , 566) Actual = 521 Roy Sievers ( 323 , 453) (251 , 402) Actual = 318 Bill Skowron ( 207 , 358) (132 , 311) Actual = 211 Joe Adcock ( 285 , 415) (218 , 383) Actual = 336 Orlando Cepeda ( 263 , 570) (217 , 823) Actual = 379 Charlie Maxwell ( 201 , 317) (138 , 292) Actual = 148 Frank Howard ( 109 , 248) ( 96 , 515) Actual = 382 Dick Stuart ( 191 , 395) (101 , 345) Actual = 228 Ron Hansen ( 104 , 238) (121 , 614) Actual = 106 Jim Gentile ( 83 , 182) ( 43 , 227) Actual = 179 Willie Kirkland ( 174 , 365) ( 97 , 362) Actual = 148 Woodie Held ( 183 , 356) (103 , 303) Actual = 179 Frank Thomas ( 272 , 396) (207 , 346) Actual = 286 Vada Pinson ( 173 , 390) (220 , 804) Actual = 256 Minnie Minoso ( 192 , 235) (169 , 250) Actual = 186 Table 6: F.T. New Player 80 % Int. 80 % Int. Career Total Ernie Banks ( 473 , 746) (333 , 585) Actual = 512 Hank Aaron ( 447 , 751) (329 , 647) Actual = 755 Mickey Mantle ( 512 , 768) (396 , 644) Actual = 536 Roger Maris ( 287 , 540) (213 , 542) Actual = 275 Eddie Mathews ( 544 , 818) (415 , 666) Actual = 512 Jim Lemon ( 264 , 421) (171 , 333) Actual = 164 Rocky Colavito ( 394 , 701) (272 , 585) Actual = 374 Ken Boyer ( 284 , 467) (186 , 358) Actual = 282 Harmon Killebrew ( 286 , 556) (240 , 620) Actual = 573 Frank Robinson ( 389 , 688) (312 , 675) Actual = 586 Willie Mays ( 424 , 616) (322 , 504) Actual = 660 Ted Williams ( 545 , 576) (525 , 549) Actual = 521 Roy Sievers ( 326 , 436) (256 , 359) Actual = 318 Bill Skowron ( 210 , 338) (140 , 261) Actual = 211 Joe Adcock ( 288 , 398) (222 , 335) Actual = 336 Orlando Cepeda ( 270 , 529) (250 , 628) Actual = 379 Charlie Maxwell ( 204 , 302) (143 , 248) Actual = 148 Frank Howard ( 112 , 229) (120 , 380) Actual = 382 Dick Stuart ( 195 , 367) (113 , 281) Actual = 228 Ron Hansen ( 107 , 220) (149 , 456) Actual = 106 Jim Gentile ( 85 , 169) ( 52 , 176) Actual = 179 Willie Kirkland ( 178 , 339) (110 , 289) Actual = 148 Woodie Held ( 187 , 333) (113 , 251) Actual = 179 Frank Thomas ( 274 , 379) (211 , 309) Actual = 286 Vada Pinson ( 178 , 361) (250 , 581) Actual = 256 Minnie Minoso ( 193 , 229) (171 , 221) Actual = 186 Table 7: F.T. New Player 50 % Int. 50 % Int. Career Total Ernie Banks ( 498 , 650) (367 , 474) Actual = 512 Hank Aaron ( 474 , 645) (375 , 499) Actual = 755 Mickey Mantle ( 535 , 678) (430 , 531) Actual = 536 Roger Maris ( 309 , 451) (256 , 389) Actual = 275 Eddie Mathews ( 568 , 722) (449 , 552) Actual = 512 Jim Lemon ( 278 , 366) (187 , 257) Actual = 164 Rocky Colavito ( 422 , 593) (317 , 440) Actual = 374 Ken Boyer ( 300 , 403) (210 , 282) Actual = 282 Harmon Killebrew ( 311 , 462) (298 , 438) Actual = 573 Frank Robinson ( 416 , 583) (368 , 501) Actual = 586 Willie Mays ( 441 , 549) (347 , 424) Actual = 660 Ted Williams ( 548 , 565) (528 , 530) Actual = 521 Roy Sievers ( 335 , 397) (268 , 312) Actual = 318 Bill Skowron ( 221 , 293) (156 , 208) Actual = 211 Joe Adcock ( 298 , 360) (233 , 282) Actual = 336 Orlando Cepeda ( 293 , 438) (293 , 421) Actual = 379 Charlie Maxwell ( 212 , 267) (156 , 201) Actual = 148 Frank Howard ( 123 , 188) (159 , 253) Actual = 382 Dick Stuart ( 211 , 307) (135 , 207) Actual = 228 Ron Hansen ( 117 , 180) (183 , 287) Actual = 106 Jim Gentile ( 93 , 140) ( 70 , 119) Actual = 179 Willie Kirkland ( 193 , 283) (136 , 206) Actual = 148 Woodie Held ( 200 , 282) (132 , 188) Actual = 179 Frank Thomas ( 284 , 342) (223 , 263) Actual = 286 Vada Pinson ( 195 , 297) (291 , 415) Actual = 256 Minnie Minoso ( 196 , 216) (176 , 196) Actual = 186 The performance of the two methods was as given in Table 8. Though neither method performed up to its nominal level on this group of 26 players, the intervals produced by the new method contained the true value more often than did the intervals produced by the Favorite Toy. Table 8: Correct Intervals: Method 90 % Int. 80 % Int. 50 % Int. Favorite Toy 14 9 7 New 22 18 7

Prediction Intervals for Active Players:

Given below are prediction intervals and expectations, produced by the new method, for career home runs totals for the active players who, after the 2002 season, were predicted by the new system to finish with more than 300 home runs. Thus, for example, Barry Bonds was predicted to finish with 684 home runs. He was estimated to have a 50 % chance of finishing with 639 to 683 home runs, and a 90 % chance of finishing with 623 to 801 home runs. The intervals are quite wide for younger players, but history suggests that they need to be. New Player Age EXP 90 % Int. 50 % Int. Barry Bonds 37 684 ( 623 , 801 ) (639 , 683) Alex Rodriguez 26 639 ( 425 , 1065 ) (519 , 688) Sammy Sosa 33 636 ( 519 , 826 ) (554 , 644) Rafael Palmeiro 37 556 ( 500 , 666 ) (514 , 555) Fred McGriff 38 522 ( 484 , 596 ) (496 , 524) Jim Thome 31 504 ( 370 , 726 ) (412 , 513) Ken Griffey 32 503 ( 472 , 554 ) (480 , 504) Albert Pujols 22 459 ( 282 , 1079 ) (382 , 550) Vladimir Guerrero 26 456 ( 298 , 773 ) (367 , 493) Jeff Bagwell 34 454 ( 393 , 543 ) (412 , 456) Andruw Jones 25 454 ( 285 , 837 ) (363 , 510) Juan Gonzalez 32 450 ( 411 , 516 ) (421 , 451) Manny Ramirez 30 439 ( 338 , 603 ) (370 , 452) Mike Piazza 33 433 ( 359 , 555 ) (381 , 438) Frank Thomas 34 430 ( 384 , 495 ) (398 , 431) Troy Glaus 25 423 ( 251 , 814 ) (330 , 480) Gary Sheffield 33 413 ( 349 , 516 ) (368 , 417) Shawn Green 29 410 ( 278 , 631 ) (326 , 427) Matt Williams 36 395 ( 376 , 427 ) (380 , 395) Ellis Burks 37 394 ( 352 , 477 ) (363 , 394) Larry Walker 35 391 ( 342 , 469 ) (353 , 388) Andres Galarraga 41 390 ( 386 , 402 ) (388 , 389) Carlos Delgado 30 389 ( 289 , 550 ) (321 , 401) Eric Chavez 24 387 ( 229 , 807 ) (314 , 451) Mo Vaughn 34 373 ( 332 , 433 ) (345 , 374) Greg Vaughn 36 366 ( 352 , 388 ) (356 , 366) Todd Helton 28 365 ( 240 , 579 ) (289 , 384) Jason Giambi 31 362 ( 256 , 538 ) (288 , 369) Chipper Jones 30 361 ( 275 , 501 ) (303 , 372) Ron Gant 37 348 ( 323 , 396 ) (329 , 347) Miguel Tejada 26 338 ( 202 , 609 ) (262 , 370) Alfonso Soriano 24 338 ( 182 , 751 ) (265 , 400) Tino Martinez 34 337 ( 292 , 403 ) (306 , 339) Lance Berkman 26 336 ( 185 , 639 ) (251 , 372) Robin Ventura 34 334 ( 285 , 406 ) (300 , 335) Jeff Kent 34 331 ( 267 , 424 ) (287 , 333) Raul Mondesi 31 327 ( 256 , 444 ) (278 , 332) Richie Sexson 27 326 ( 200 , 540 ) (248 , 348) Scott Rolen 27 326 ( 211 , 519 ) (255 , 345) Adam Dunn 22 326 ( 189 , 803 ) (266 , 396) Luis Gonzalez 34 325 ( 262 , 417 ) (282 , 327) Magglio Ordonez 28 325 ( 202 , 535 ) (250 , 343) David Justice 36 324 ( 306 , 354 ) (311 , 324) Pat Burrell 25 322 ( 170 , 668 ) (240 , 372) Ryan Klesko 31 321 ( 243 , 450 ) (267 , 326) Tim Salmon 33 321 ( 275 , 396 ) (289 , 324) Brian Giles 31 313 ( 214 , 479 ) (244 , 320) Tony Batista 28 313 ( 202 , 502 ) (245 , 329) Jim Edmonds 32 305 ( 235 , 423 ) (252 , 308) Eric Karros 34 301 ( 273 , 340 ) (282 , 301)

Some Final Notes:

1. Because the coefficients recorded in Tables 1, 2, and 3 were derived based on working with home run totals, they are almost certainly not appropriate coefficients to use for other statistics such as hits or stolen bases. The method by which these coefficients were derived should, however, be applicable to other statistics. 2. The coefficients as presented in Tables 1, 2, and 3 have not been smoothed across ages. In practice, one might feel more confident working with smoothed versions of the coefficients. For example, since it seems unlikely that q(0.95) takes a sudden dip at age 39, one might use a value such as 2.7 which meshes better with surrounding values than does the 1.83 given in Table 3.